GWAS Tutorial: with covariates#

This tutorial builds on the simple GWAS tutorial and the PCA tutorial. Here we show how to perform a GWAS with covariates. In this case, we use the first 10 principal components as the covariates, but covariates can be anything relating to the phenotype \(Y\). Again, this is simple linear regression. We’re using Genotype Representation Graphs to represent our genotype matrix. If you’re not sure how to get a GRG, start with the simple GWAS tutorial instead, since we are re-using the gwas.example.grg from that tutorial.

What you’ll need:

  • Files: gwas.example.grg

  • Python dependencies “grapp”, “seaborn”: pip install grapp igdtools seaborn

Simulate Phenotype#

Like the simple GWAS tutorial, we are mostly using the default settings for phenotype simulation, which just draws effect sizes for causal SNPs from a standard normal distribution ~\(N(0, 1)\). However, we are going to add a slight complication: we are going to assume that there is a binary attribute of individuals that affects the phenotype, but is not genetic (or at least, is more environmental than genetic). We can do this by using the grg_pheno_sim Python APIs and specifying a covariate as input to the simulation.

import numpy
import pygrgl

numpy.random.seed(42)

# Load the GRG dataset
grg = pygrgl.load_immutable_grg("gwas.example.grg")

# Generate our random (binary) environment attribute that affects Y
env_attribute = ((numpy.random.uniform(size=(grg.num_individuals, 1)) > 0.5) * 1).astype(numpy.float64)
from grg_pheno_sim.phenotype import add_covariates, convert_to_phen
import pandas

# Now simulate our phenotype, taking into account that our environmental attribute affects it
heritability = 0.33   # h^2

covar = env_attribute
covar_effect = numpy.array([1000.0]) # Effect that the covariate has on the final phenotype value
# add_covariates() takes all the same keyword arguments as sim_phenotypes()
phenotypes = add_covariates(grg, covar, covar_effect, heritability=heritability)
# Write the dataframe to a file format that can be used for GWAS
convert_to_phen(phenotypes, "covar.example.grg.phen", include_header=True)
The initial effect sizes are
       mutation_id  effect_size  causal_mutation_id
0                0    -1.657757                   0
1                1    -0.537562                   0
2                2     1.146189                   0
3                3    -0.697705                   0
4                4    -0.157002                   0
...            ...          ...                 ...
10888        10888    -0.217740                   0
10889        10889     0.936594                   0
10890        10890     0.976964                   0
10891        10891     0.274724                   0
10892        10892     0.711005                   0

[10893 rows x 3 columns]
The genetic values of the individuals are
     individual_id  genetic_value  causal_mutation_id
0                0     -62.758200                   0
1                1       8.271010                   0
2                2     -93.726026                   0
3                3     -58.657459                   0
4                4     -47.272752                   0
..             ...            ...                 ...
195            195     -22.379358                   0
196            196      34.579149                   0
197            197     -40.854451                   0
198            198     -21.653168                   0
199            199     -69.561989                   0

[200 rows x 3 columns]
phenotypes
causal_mutation_id individual_id genetic_value environmental_noise phenotype covariate_value
0 0 0 -62.758200 -59.329025 -122.087225 0.0
1 0 1 8.271010 -26.729415 981.541595 1000.0
2 0 2 -93.726026 58.802882 965.076856 1000.0
3 0 3 -58.657459 29.882231 971.224772 1000.0
4 0 4 -47.272752 13.143984 -34.128768 0.0
... ... ... ... ... ... ...
195 0 195 -22.379358 52.827780 30.448422 0.0
196 0 196 34.579149 -45.371223 989.207926 1000.0
197 0 197 -40.854451 103.116600 1062.262149 1000.0
198 0 198 -21.653168 4.893175 983.240007 1000.0
199 0 199 -69.561989 7.776493 938.214504 1000.0

200 rows × 6 columns

Get the top 2 principal components#

The top 10 principal components (those associated with the 10 largest eigenvalues) of the genotype matrix \(X\) are often used to correct for population stratification when performing GWAS. These PCs are the dimensions of most variance across the dataset, which is expected to reflect genetic differences primarily due to individuals being from different ancestral populations (those are the “biggest” genetic differences). Since this dataset is so small, we just use 2 PCs here, which we compute from our GRG.

%%bash
grapp pca -d 2 gwas.example.grg -o gwas.covar.pcs.tsv
Running eigen decomposition on 200 individuals
Wrote PCs to gwas.covar.pcs.tsv
import pandas
# Load our PCs and display the dataframe
pca_df = pandas.read_csv("gwas.covar.pcs.tsv", delimiter="\t")
pca_df
PC1 PC2
0 0.029621 -0.145431
1 -0.081435 -0.085890
2 -0.009044 0.050044
3 -0.072203 0.006810
4 0.011696 0.049359
... ... ...
195 0.023435 0.030237
196 -0.002435 0.001847
197 -0.154034 0.048480
198 -0.085177 0.052992
199 0.036344 0.092449

200 rows × 2 columns

We want to combine the PCs and our attribute into a single dataframe, which we will use as the covariates input to our GWAS.

attr_df = pandas.DataFrame({"EnvAttr": env_attribute.T[0]})
covar_df = pandas.concat([pca_df, attr_df], axis=1)
covar_df.to_csv("covar.example.covar.tsv", sep="\t", index=False)
covar_df
PC1 PC2 EnvAttr
0 0.029621 -0.145431 0.0
1 -0.081435 -0.085890 1.0
2 -0.009044 0.050044 1.0
3 -0.072203 0.006810 1.0
4 0.011696 0.049359 0.0
... ... ... ...
195 0.023435 0.030237 0.0
196 -0.002435 0.001847 1.0
197 -0.154034 0.048480 1.0
198 -0.085177 0.052992 1.0
199 0.036344 0.092449 1.0

200 rows × 3 columns

Perform association with covariates#

We want a GWAS that “ignores the effects” of the covariates, which are 10 PCs + one individual attribute in our case.

GWAS via command line#

First we’ll run the shell commands that perform a GWAS.

%%bash

# Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS with covariates
grapp assoc -c covar.example.covar.tsv -p covar.example.grg.phen -o covar.example.gwascov.tsv gwas.example.grg

# Same thing, but WITHOUT covariates, just for comparison
grapp assoc -p covar.example.grg.phen -o covar.example.gwas.tsv gwas.example.grg
Using the column 2 (the last one) for phenotype.
Wrote results to covar.example.gwascov.tsv
Using the column 2 (the last one) for phenotype.
Wrote results to covar.example.gwas.tsv

Now we can examine both results by loading the dataframes into pandas.

import pandas

gwascov_df = pandas.read_csv("covar.example.gwascov.tsv", delimiter="\t")
gwas_df = pandas.read_csv("covar.example.gwas.tsv", delimiter="\t")
gwascov_df
POS ALT REF COUNT BETA SE T P
0 55829 G A 4 -37.607441 38.817199 -0.968834 0.333828
1 56812 T G 3 90.404328 41.735642 2.166118 0.031516
2 57349 G T 1 -63.659076 72.729817 -0.875282 0.382497
3 58785 T C 10 10.482727 24.477519 0.428259 0.668935
4 59367 A G 2 -24.403985 52.357712 -0.466101 0.641663
... ... ... ... ... ... ... ... ...
10888 9997601 G C 3 51.605426 42.111952 1.225434 0.221890
10889 9998038 A C 21 3.935289 15.240090 0.258220 0.796510
10890 9998412 G T 42 -2.619399 11.996761 -0.218342 0.827391
10891 9999031 C G 295 3.987359 8.705235 0.458041 0.647433
10892 9999126 T A 2 -21.656135 51.830636 -0.417825 0.676535

10893 rows × 8 columns

BETA is the effect size for the variant at base-pair position POS with alternate allele ALT. We can plot the histogram of our inferred BETA values and see that it does indeed recover a normal distribution centered at \(0\), which is what we simulated with our phenotype.

import seaborn
import matplotlib.pyplot as plt

f, axs = plt.subplots(1, 2, figsize=(12, 5))
ax = seaborn.histplot(data=gwas_df, x="BETA", ax=axs[0])
axs[0].set_title("GWAS $\it{without}$ covariates")
seaborn.histplot(data=gwascov_df, x="BETA", ax=axs[1])
axs[1].set_title("GWAS $\it{with}$ covariates")
Text(0.5, 1.0, 'GWAS $\it{with}$ covariates')
../_images/GWASCovariates_15_1.png

Since we simulated the genetic effects that “cause” the phenotype we’re studying, we know that the underlying BETA distribution is normal and centered at 0. Both of these results show a roughly normal distribution centered at 0, but you can see that the one which corrected for the covariate has a clearer distribution without the “extra” peaks around `-500-.

Using the Python API#

We can just repeat the same thing via Python, as an illustration.

from grapp.assoc import linear_assoc_covar
import pygrgl

GRG_FILE = "gwas.example.grg"
PHEN_FILE = "covar.example.grg.phen"

# Load the GRG into memory
grg = pygrgl.load_immutable_grg(GRG_FILE)

# Load the phenotype into memory
Y = pandas.read_csv(PHEN_FILE, delimiter="\t")

# Perform the GWAS
gwas_df = linear_assoc_covar(grg, Y["phenotypes"].to_numpy(), covar_df.to_numpy())
gwas_df
POS ALT REF COUNT BETA SE T P
0 55829 G A 4 -37.607441 38.817199 -0.968834 0.333828
1 56812 T G 3 90.404328 41.735642 2.166118 0.031516
2 57349 G T 1 -63.659076 72.729817 -0.875282 0.382497
3 58785 T C 10 10.482727 24.477519 0.428259 0.668935
4 59367 A G 2 -24.403985 52.357712 -0.466101 0.641663
... ... ... ... ... ... ... ... ...
10888 9997601 G C 3 51.605426 42.111952 1.225434 0.221890
10889 9998038 A C 21 3.935289 15.240090 0.258220 0.796510
10890 9998412 G T 42 -2.619399 11.996761 -0.218342 0.827391
10891 9999031 C G 295 3.987359 8.705235 0.458041 0.647433
10892 9999126 T A 2 -21.656135 51.830636 -0.417825 0.676535

10893 rows × 8 columns

We can see that the output format (dataframe) is the same as the command line version. We should expect some differences in the distribution of betas because we standardized \(X\) this time, though the shape should still be a normal distribution centered at \(0\).

seaborn.histplot(data=gwas_df, x="BETA")
<Axes: xlabel='BETA', ylabel='Count'>
../_images/GWASCovariates_20_1.png