GWAS Tutorial#

This tutorial describes how to perform a GWAS with simple linear regression, where each SNP is treated independently when determining its effect on the phenotype. Lets consider our input data to be a genotype matrix \(X\) which has \(N\) rows (number of individuals) and \(M\) columns (number of variants, usually SNPs). We’re using Genotype Representation Graphs to represent our genotype matrix. If you have .vcf.gz data, see Converting .vcf.gz to GRG. If you have (or want) a more efficicient representation than .vcf.gz see Converting IGD to GRG.

For this tutorial, we’ll just use a (really) small simulated dataset of 200 individuals that we can quickly download. Replace it with your own dataset as desired.

The other piece of information we need for performing a GWAS is the phenotype, \(Y\). We are just going to simulate a phenotype for this tutorial, but replace it with your own phenotype as desired.

What you’ll need:

Python dependencies “grapp”, “igdtools”, “seaborn”: pip install grapp igdtools seaborn
Command line tool “wget”: sudo apt install wget (or your system’s equivalent)

Get Dataset#

%%bash

if [[ ! -e gwas.example.igd ]]; then
  # Download a small example dataset
  wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O gwas.example.vcf.gz

  # Convert to IGD; this isn't necessary, but most of the time you will want to do this
  igdtools gwas.example.vcf.gz -o gwas.example.igd
fi

# Just show some stats about the dataset
igdtools -s gwas.example.igd

Stats for gwas.example.igd
... in range 0 - 18446744073709551615
  Variants in range: 10893
  Average samples/var: 50.2075
  Stddev samples/var: 86.1985
  Average var/sample: 1367.28
  Stddev var/sample: 25.9211
  Variants with missing data: 0
  Total missing alleles: 0
  Total unique sites: 10885

Convert to GRG#

%%bash

if [[ ! -e gwas.example.grg ]]; then
  # -j controls how many threads to use.
  grg construct -j 1 gwas.example.igd -o gwas.example.grg
fi

Simulate Phenotype#

Since this phenotype is just for demonstration purposes, we can use the default settings for phenotype simulation. See Simulating Phenotypes for a more detailed tutorial that describes the range of options. Here, we just want a phenotype that has been derived from the genotype: that is, our simulation will choose some “causal mutations” that are responsible for some proportion of the phenotype (this is called the heritability), and compute for each individual a phenotype value from the actual mutations the individual has.

By default, phenotype simulation draws effect sizes for causal SNPs from a standard normal distribution ~\(N(0, 1)\).

%%bash

grapp pheno gwas.example.grg

The initial effect sizes are
       mutation_id  effect_size  causal_mutation_id
0                0    -1.657757                   0
1                1    -0.537562                   0
2                2     1.146189                   0
3                3    -0.697705                   0
4                4    -0.157002                   0
...            ...          ...                 ...
10888        10888    -0.217740                   0
10889        10889     0.936594                   0
10890        10890     0.976964                   0
10891        10891     0.274724                   0
10892        10892     0.711005                   0

[10893 rows x 3 columns]
The genetic values of the individuals are
     individual_id  genetic_value  causal_mutation_id
0                0     -62.758200                   0
1                1       8.271010                   0
2                2     -93.726026                   0
3                3     -58.657459                   0
4                4     -47.272752                   0
..             ...            ...                 ...
195            195     -22.379358                   0
196            196      34.579149                   0
197            197     -40.854451                   0
198            198     -21.653168                   0
199            199     -69.561989                   0

[200 rows x 3 columns]

Wrote phenotypes to gwas.example.grg.phen

Perform association between genotype and phenotype#

Now we want to learn the association between the genotype \(X\) and the phenotype \(Y\). We are using simple linear regression to do this. The result is a value \(\beta_i\) for each mutation \(i\): a larger absolute value indicates a larger effect of that mutation (variant) on the phenotype. We are finding correlations between \(X\) and \(Y\). A variant \(i\) that is highly correlated with a phenotype \(Y\) might have a causal relationship to \(Y\).

GWAS via command line#

First we’ll run the shell commands that perform a GWAS.

%%bash

# Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS between X (test-200-samples.grg) and Y (test-200-samples.grg.phen)
grapp assoc -p gwas.example.grg.phen -o gwas.example.gwas.tsv gwas.example.grg

Using the column 2 (the last one) for phenotype.

Wrote results to gwas.example.gwas.tsv

Now we can examine the results by loading the dataframe into pandas.

import pandas

gwas_df = pandas.read_csv("gwas.example.gwas.tsv", delimiter="\t")
gwas_df

	POS	ALT	REF	COUNT	BETA	B0	SE	R2	T	P
0	55829	G	A	4	-0.512129	0.010243	0.505040	0.005166	-1.014035	0.311804
1	56812	T	G	3	0.794622	-0.011919	0.580456	0.009376	1.368961	0.172562
2	57349	G	T	1	-1.466317	0.007332	0.999621	0.010750	-1.466873	0.143997
3	58785	T	C	10	-0.358335	0.017917	0.324263	0.006130	-1.105077	0.270467
4	59367	A	G	2	-0.118437	0.001184	0.712412	0.000140	-0.166248	0.868131
...	...	...	...	...	...	...	...	...	...	...
10888	9997601	G	C	3	0.113106	-0.001697	0.583141	0.000190	0.193959	0.846407
10889	9998038	A	C	21	-0.072860	0.007650	0.209914	0.000608	-0.347095	0.728889
10890	9998412	G	T	42	0.068393	-0.014362	0.164342	0.000874	0.416160	0.677743
10891	9999031	C	G	295	-0.015402	0.022718	0.116634	0.000088	-0.132055	0.895075
10892	9999126	T	A	2	1.195692	-0.011957	0.707376	0.014225	1.690320	0.092540

10893 rows × 10 columns

BETA is the effect size for the variant at base-pair position POS with alternate allele ALT. We can plot the histogram of our inferred BETA values and see that it does indeed recover a normal distribution centered at \(0\), which is what we simulated with our phenotype.

import seaborn
seaborn.histplot(data=gwas_df, x="BETA")

<Axes: xlabel='BETA', ylabel='Count'>

GWAS via Python APIs#

Now we do the same GWAS, but using Python code instead of running grapp assoc ... on the command line. This gives us more options. In this example, we’ll standardize the genotype matrix \(X\) prior to performing the linear regression, which will make all mutations (variants) have the same variance. This has the effect of putting variants with different allele frequencies on the same scale.

from grapp.assoc import linear_assoc_no_covar
import pygrgl

GRG_FILE = "gwas.example.grg"
PHEN_FILE = "gwas.example.grg.phen"

# Load the GRG into memory
grg = pygrgl.load_immutable_grg(GRG_FILE)

# Load the phenotype into memory
Y = pandas.read_csv(PHEN_FILE, delimiter="\t")

# Perform the GWAS
gwas_df = linear_assoc_no_covar(grg, Y["phenotypes"].to_numpy(), standardize=True)
gwas_df

	POS	ALT	REF	COUNT	BETA	B0	SE	R2	T	P
0	55829	G	A	4	-0.071335	0.001427	0.070707	0.005116	-1.008875	0.314266
1	56812	T	G	3	0.096223	-0.001443	0.070558	0.009307	1.363732	0.174200
2	57349	G	T	1	-0.103295	0.000516	0.070508	0.010724	-1.465014	0.144503
3	58785	T	C	10	-0.077090	0.003854	0.070676	0.005988	-1.090740	0.276713
4	59367	A	G	2	-0.011755	0.000118	0.070884	0.000139	-0.165830	0.868460
...	...	...	...	...	...	...	...	...	...	...
10888	9997601	G	C	3	0.013696	-0.000205	0.070882	0.000189	0.193225	0.846981
10889	9998038	A	C	21	-0.026328	0.002764	0.070864	0.000704	-0.371527	0.710643
10890	9998412	G	T	42	0.029327	-0.006159	0.070857	0.000903	0.413891	0.679402
10891	9999031	C	G	295	-0.009143	0.013486	0.070880	0.000267	-0.128993	0.897494
10892	9999126	T	A	2	0.118671	-0.001187	0.070386	0.014155	1.686008	0.093369

10893 rows × 10 columns

We can see that the output format (dataframe) is the same as the command line version. We should expect some differences in the distribution of betas because we standardized \(X\) this time, though the shape should still be a normal distribution centered at \(0\).

seaborn.histplot(data=gwas_df, x="BETA")

<Axes: xlabel='BETA', ylabel='Count'>

You can see that the range of beta values is much smaller (from \(-0.3\) to \(+0.3\)).

Finally, we can perform one more GWAS where we use an approximate distribution for the variance of each mutation in \(X\). Each value in the \(X\) matrix can take on a value of \(0\), \(1\), or \(2\), based on how many copies each of the (diploid) individuals has for that allele. By default, the linear regression is performed with the exact sum of squared errors, based on these diploid values in \(X\). Theoretically, in the absence of many complicating effects (selection, genotyping error, inbreeding, etc.) the variance of each mutation should follow a binomial distribution with mean \(2 \times f_i\) (where \(f_i\) is the allele frequency for mutation \(i\)). You can choose to perform the regression using the binomial distribution, instead of the exact values, via dist="binomial", as shown below.

gwas_df = linear_assoc_no_covar(grg, Y["phenotypes"].to_numpy(), dist="binomial")
seaborn.histplot(data=gwas_df, x="BETA")

<Axes: xlabel='BETA', ylabel='Count'>

You can see that this (non-standardized) distribution follows our original (exact) result pretty closely, even with a small sample size of 200 individuals. Partially this is because our genotype matrix \(X\) was generated using a neutral simulation; in real datasets, the binomial distribution may cause more beta values to differ from the exact value than is shown here.