GWAS Tutorial#
This tutorial describes how to perform a GWAS with simple linear
regression, where each SNP is treated independently when determining its
effect on the phenotype. Lets consider our input data to be a genotype
matrix \(X\) which has \(N\) rows (number of individuals) and
\(M\) columns (number of variants, usually SNPs). We’re using
Genotype Representation
Graphs to
represent our genotype matrix. If you have .vcf.gz data, see
Converting .vcf.gz to GRG. If you have (or want) a
more efficicient representation than .vcf.gz see Converting IGD to
GRG.
For this tutorial, we’ll just use a (really) small simulated dataset of 200 individuals that we can quickly download. Replace it with your own dataset as desired.
The other piece of information we need for performing a GWAS is the phenotype, \(Y\). We are just going to simulate a phenotype for this tutorial, but replace it with your own phenotype as desired.
What you’ll need:
Python dependencies “grapp”, “igdtools”, “seaborn”:
pip install grapp igdtools seabornCommand line tool “wget”:
sudo apt install wget(or your system’s equivalent)
Get Dataset#
%%bash
if [[ ! -e gwas.example.igd ]]; then
# Download a small example dataset
wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O gwas.example.vcf.gz
# Convert to IGD; this isn't necessary, but most of the time you will want to do this
igdtools gwas.example.vcf.gz -o gwas.example.igd
fi
# Just show some stats about the dataset
igdtools -s gwas.example.igd
Stats for gwas.example.igd
... in range 0 - 18446744073709551615
Variants in range: 10893
Average samples/var: 50.2075
Stddev samples/var: 86.1985
Average var/sample: 1367.28
Stddev var/sample: 25.9211
Variants with missing data: 0
Total missing alleles: 0
Total unique sites: 10885
Convert to GRG#
%%bash
if [[ ! -e gwas.example.grg ]]; then
# -j controls how many threads to use.
grg construct -j 1 gwas.example.igd -o gwas.example.grg
fi
Simulate Phenotype#
Since this phenotype is just for demonstration purposes, we can use the default settings for phenotype simulation. See Simulating Phenotypes for a more detailed tutorial that describes the range of options. Here, we just want a phenotype that has been derived from the genotype: that is, our simulation will choose some “causal mutations” that are responsible for some proportion of the phenotype (this is called the heritability), and compute for each individual a phenotype value from the actual mutations the individual has.
By default, phenotype simulation draws effect sizes for causal SNPs from a standard normal distribution ~\(N(0, 1)\).
%%bash
grapp pheno gwas.example.grg
The initial effect sizes are
mutation_id effect_size causal_mutation_id
0 0 -1.657757 0
1 1 -0.537562 0
2 2 1.146189 0
3 3 -0.697705 0
4 4 -0.157002 0
... ... ... ...
10888 10888 -0.217740 0
10889 10889 0.936594 0
10890 10890 0.976964 0
10891 10891 0.274724 0
10892 10892 0.711005 0
[10893 rows x 3 columns]
The genetic values of the individuals are
individual_id genetic_value causal_mutation_id
0 0 -62.758200 0
1 1 8.271010 0
2 2 -93.726026 0
3 3 -58.657459 0
4 4 -47.272752 0
.. ... ... ...
195 195 -22.379358 0
196 196 34.579149 0
197 197 -40.854451 0
198 198 -21.653168 0
199 199 -69.561989 0
[200 rows x 3 columns]
Wrote phenotypes to gwas.example.grg.phen
Perform association between genotype and phenotype#
Now we want to learn the association between the genotype \(X\) and the phenotype \(Y\). We are using simple linear regression to do this. The result is a value \(\beta_i\) for each mutation \(i\): a larger absolute value indicates a larger effect of that mutation (variant) on the phenotype. We are finding correlations between \(X\) and \(Y\). A variant \(i\) that is highly correlated with a phenotype \(Y\) might have a causal relationship to \(Y\).
GWAS via command line#
First we’ll run the shell commands that perform a GWAS.
%%bash
# Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS between X (test-200-samples.grg) and Y (test-200-samples.grg.phen)
grapp assoc -p gwas.example.grg.phen -o gwas.example.gwas.tsv gwas.example.grg
Using the column 2 (the last one) for phenotype.
Wrote results to gwas.example.gwas.tsv
Now we can examine the results by loading the dataframe into pandas.
import pandas
gwas_df = pandas.read_csv("gwas.example.gwas.tsv", delimiter="\t")
gwas_df
| POS | ALT | REF | COUNT | BETA | B0 | SE | R2 | T | P | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 55829 | G | A | 4 | -0.512129 | 0.010243 | 0.505040 | 0.005166 | -1.014035 | 0.311804 |
| 1 | 56812 | T | G | 3 | 0.794622 | -0.011919 | 0.580456 | 0.009376 | 1.368961 | 0.172562 |
| 2 | 57349 | G | T | 1 | -1.466317 | 0.007332 | 0.999621 | 0.010750 | -1.466873 | 0.143997 |
| 3 | 58785 | T | C | 10 | -0.358335 | 0.017917 | 0.324263 | 0.006130 | -1.105077 | 0.270467 |
| 4 | 59367 | A | G | 2 | -0.118437 | 0.001184 | 0.712412 | 0.000140 | -0.166248 | 0.868131 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10888 | 9997601 | G | C | 3 | 0.113106 | -0.001697 | 0.583141 | 0.000190 | 0.193959 | 0.846407 |
| 10889 | 9998038 | A | C | 21 | -0.072860 | 0.007650 | 0.209914 | 0.000608 | -0.347095 | 0.728889 |
| 10890 | 9998412 | G | T | 42 | 0.068393 | -0.014362 | 0.164342 | 0.000874 | 0.416160 | 0.677743 |
| 10891 | 9999031 | C | G | 295 | -0.015402 | 0.022718 | 0.116634 | 0.000088 | -0.132055 | 0.895075 |
| 10892 | 9999126 | T | A | 2 | 1.195692 | -0.011957 | 0.707376 | 0.014225 | 1.690320 | 0.092540 |
10893 rows × 10 columns
BETA is the effect size for the variant at base-pair position
POS with alternate allele ALT. We can plot the histogram of our
inferred BETA values and see that it does indeed recover a normal
distribution centered at \(0\), which is what we simulated with our
phenotype.
import seaborn
seaborn.histplot(data=gwas_df, x="BETA")
<Axes: xlabel='BETA', ylabel='Count'>
GWAS via Python APIs#
Now we do the same GWAS, but using Python code instead of running
grapp assoc ... on the command line. This gives us more options. In
this example, we’ll standardize the genotype matrix \(X\) prior to
performing the linear regression, which will make all mutations
(variants) have the same variance. This has the effect of putting
variants with different allele frequencies on the same scale.
from grapp.assoc import linear_assoc_no_covar
import pygrgl
GRG_FILE = "gwas.example.grg"
PHEN_FILE = "gwas.example.grg.phen"
# Load the GRG into memory
grg = pygrgl.load_immutable_grg(GRG_FILE)
# Load the phenotype into memory
Y = pandas.read_csv(PHEN_FILE, delimiter="\t")
# Perform the GWAS
gwas_df = linear_assoc_no_covar(grg, Y["phenotypes"].to_numpy(), standardize=True)
gwas_df
| POS | ALT | REF | COUNT | BETA | B0 | SE | R2 | T | P | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 55829 | G | A | 4 | -0.071335 | 0.001427 | 0.070707 | 0.005116 | -1.008875 | 0.314266 |
| 1 | 56812 | T | G | 3 | 0.096223 | -0.001443 | 0.070558 | 0.009307 | 1.363732 | 0.174200 |
| 2 | 57349 | G | T | 1 | -0.103295 | 0.000516 | 0.070508 | 0.010724 | -1.465014 | 0.144503 |
| 3 | 58785 | T | C | 10 | -0.077090 | 0.003854 | 0.070676 | 0.005988 | -1.090740 | 0.276713 |
| 4 | 59367 | A | G | 2 | -0.011755 | 0.000118 | 0.070884 | 0.000139 | -0.165830 | 0.868460 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10888 | 9997601 | G | C | 3 | 0.013696 | -0.000205 | 0.070882 | 0.000189 | 0.193225 | 0.846981 |
| 10889 | 9998038 | A | C | 21 | -0.026328 | 0.002764 | 0.070864 | 0.000704 | -0.371527 | 0.710643 |
| 10890 | 9998412 | G | T | 42 | 0.029327 | -0.006159 | 0.070857 | 0.000903 | 0.413891 | 0.679402 |
| 10891 | 9999031 | C | G | 295 | -0.009143 | 0.013486 | 0.070880 | 0.000267 | -0.128993 | 0.897494 |
| 10892 | 9999126 | T | A | 2 | 0.118671 | -0.001187 | 0.070386 | 0.014155 | 1.686008 | 0.093369 |
10893 rows × 10 columns
We can see that the output format (dataframe) is the same as the command line version. We should expect some differences in the distribution of betas because we standardized \(X\) this time, though the shape should still be a normal distribution centered at \(0\).
seaborn.histplot(data=gwas_df, x="BETA")
<Axes: xlabel='BETA', ylabel='Count'>
You can see that the range of beta values is much smaller (from \(-0.3\) to \(+0.3\)).
Finally, we can perform one more GWAS where we use an approximate
distribution for the variance of each mutation in \(X\). Each value
in the \(X\) matrix can take on a value of \(0\), \(1\), or
\(2\), based on how many copies each of the (diploid) individuals
has for that allele. By default, the linear regression is performed with
the exact sum of squared errors, based on these diploid values in
\(X\). Theoretically, in the absence of many complicating effects
(selection, genotyping error, inbreeding, etc.) the variance of each
mutation should follow a binomial distribution with mean
\(2 \times f_i\) (where \(f_i\) is the allele frequency for
mutation \(i\)). You can choose to perform the regression using the
binomial distribution, instead of the exact values, via
dist="binomial", as shown below.
gwas_df = linear_assoc_no_covar(grg, Y["phenotypes"].to_numpy(), dist="binomial")
seaborn.histplot(data=gwas_df, x="BETA")
<Axes: xlabel='BETA', ylabel='Count'>
You can see that this (non-standardized) distribution follows our original (exact) result pretty closely, even with a small sample size of 200 individuals. Partially this is because our genotype matrix \(X\) was generated using a neutral simulation; in real datasets, the binomial distribution may cause more beta values to differ from the exact value than is shown here.