GWAS Tutorial: with covariates#
This tutorial builds on the simple GWAS tutorial and
the PCA tutorial. Here we show how to perform a GWAS with
covariates. In this case, we use the first 10 principal components as
the covariates, but covariates can be anything relating to the phenotype
\(Y\). Again, this is simple linear regression. We’re using
Genotype Representation
Graphs to
represent our genotype matrix. If you’re not sure how to get a GRG,
start with the simple GWAS tutorial instead, since we
are re-using the gwas.example.grg from that tutorial.
What you’ll need:
Files:
gwas.example.grgPython dependencies “grapp”, “seaborn”:
pip install grapp igdtools seaborn
Simulate Phenotype#
Like the simple GWAS tutorial, we are mostly using the default settings for phenotype simulation, which just draws effect sizes for causal SNPs from a standard normal distribution ~\(N(0, 1)\). However, we are going to add a slight complication: we are going to assume that there is a binary attribute of individuals that affects the phenotype, but is not genetic (or at least, is more environmental than genetic). We can do this by using the grg_pheno_sim Python APIs and specifying a covariate as input to the simulation.
import numpy
import pygrgl
numpy.random.seed(42)
# Load the GRG dataset
grg = pygrgl.load_immutable_grg("gwas.example.grg")
# Generate our random (binary) environment attribute that affects Y
env_attribute = ((numpy.random.uniform(size=(grg.num_individuals, 1)) > 0.5) * 1).astype(numpy.float64)
from grg_pheno_sim.phenotype import add_covariates, convert_to_phen
import pandas
# Now simulate our phenotype, taking into account that our environmental attribute affects it
heritability = 0.33 # h^2
covar = env_attribute
covar_effect = numpy.array([1000.0]) # Effect that the covariate has on the final phenotype value
# add_covariates() takes all the same keyword arguments as sim_phenotypes()
phenotypes = add_covariates(grg, covar, covar_effect, heritability=heritability)
# Write the dataframe to a file format that can be used for GWAS
convert_to_phen(phenotypes, "covar.example.grg.phen", include_header=True)
The initial effect sizes are
mutation_id effect_size causal_mutation_id
0 0 -1.657757 0
1 1 -0.537562 0
2 2 1.146189 0
3 3 -0.697705 0
4 4 -0.157002 0
... ... ... ...
10888 10888 -0.217740 0
10889 10889 0.936594 0
10890 10890 0.976964 0
10891 10891 0.274724 0
10892 10892 0.711005 0
[10893 rows x 3 columns]
The genetic values of the individuals are
individual_id genetic_value causal_mutation_id
0 0 -62.758200 0
1 1 8.271010 0
2 2 -93.726026 0
3 3 -58.657459 0
4 4 -47.272752 0
.. ... ... ...
195 195 -22.379358 0
196 196 34.579149 0
197 197 -40.854451 0
198 198 -21.653168 0
199 199 -69.561989 0
[200 rows x 3 columns]
phenotypes
| causal_mutation_id | individual_id | genetic_value | environmental_noise | phenotype | covariate_value | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | -62.758200 | -59.329025 | -122.087225 | 0.0 |
| 1 | 0 | 1 | 8.271010 | -26.729415 | 981.541595 | 1000.0 |
| 2 | 0 | 2 | -93.726026 | 58.802882 | 965.076856 | 1000.0 |
| 3 | 0 | 3 | -58.657459 | 29.882231 | 971.224772 | 1000.0 |
| 4 | 0 | 4 | -47.272752 | 13.143984 | -34.128768 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 195 | 0 | 195 | -22.379358 | 52.827780 | 30.448422 | 0.0 |
| 196 | 0 | 196 | 34.579149 | -45.371223 | 989.207926 | 1000.0 |
| 197 | 0 | 197 | -40.854451 | 103.116600 | 1062.262149 | 1000.0 |
| 198 | 0 | 198 | -21.653168 | 4.893175 | 983.240007 | 1000.0 |
| 199 | 0 | 199 | -69.561989 | 7.776493 | 938.214504 | 1000.0 |
200 rows × 6 columns
Get the top 2 principal components#
The top 10 principal components (those associated with the 10 largest eigenvalues) of the genotype matrix \(X\) are often used to correct for population stratification when performing GWAS. These PCs are the dimensions of most variance across the dataset, which is expected to reflect genetic differences primarily due to individuals being from different ancestral populations (those are the “biggest” genetic differences). Since this dataset is so small, we just use 2 PCs here, which we compute from our GRG.
%%bash
grapp pca -d 2 gwas.example.grg -o gwas.covar.pcs.tsv
Running eigen decomposition on 200 individuals
Wrote PCs to gwas.covar.pcs.tsv
import pandas
# Load our PCs and display the dataframe
pca_df = pandas.read_csv("gwas.covar.pcs.tsv", delimiter="\t")
pca_df
| PC1 | PC2 | |
|---|---|---|
| 0 | 0.029621 | -0.145431 |
| 1 | -0.081435 | -0.085890 |
| 2 | -0.009044 | 0.050044 |
| 3 | -0.072203 | 0.006810 |
| 4 | 0.011696 | 0.049359 |
| ... | ... | ... |
| 195 | 0.023435 | 0.030237 |
| 196 | -0.002435 | 0.001847 |
| 197 | -0.154034 | 0.048480 |
| 198 | -0.085177 | 0.052992 |
| 199 | 0.036344 | 0.092449 |
200 rows × 2 columns
We want to combine the PCs and our attribute into a single dataframe, which we will use as the covariates input to our GWAS.
attr_df = pandas.DataFrame({"EnvAttr": env_attribute.T[0]})
covar_df = pandas.concat([pca_df, attr_df], axis=1)
covar_df.to_csv("covar.example.covar.tsv", sep="\t", index=False)
covar_df
| PC1 | PC2 | EnvAttr | |
|---|---|---|---|
| 0 | 0.029621 | -0.145431 | 0.0 |
| 1 | -0.081435 | -0.085890 | 1.0 |
| 2 | -0.009044 | 0.050044 | 1.0 |
| 3 | -0.072203 | 0.006810 | 1.0 |
| 4 | 0.011696 | 0.049359 | 0.0 |
| ... | ... | ... | ... |
| 195 | 0.023435 | 0.030237 | 0.0 |
| 196 | -0.002435 | 0.001847 | 1.0 |
| 197 | -0.154034 | 0.048480 | 1.0 |
| 198 | -0.085177 | 0.052992 | 1.0 |
| 199 | 0.036344 | 0.092449 | 1.0 |
200 rows × 3 columns
Perform association with covariates#
We want a GWAS that “ignores the effects” of the covariates, which are 10 PCs + one individual attribute in our case.
GWAS via command line#
First we’ll run the shell commands that perform a GWAS.
%%bash
# Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS with covariates
grapp assoc -c covar.example.covar.tsv -p covar.example.grg.phen -o covar.example.gwascov.tsv gwas.example.grg
# Same thing, but WITHOUT covariates, just for comparison
grapp assoc -p covar.example.grg.phen -o covar.example.gwas.tsv gwas.example.grg
Using the column 2 (the last one) for phenotype.
Wrote results to covar.example.gwascov.tsv
Using the column 2 (the last one) for phenotype.
Wrote results to covar.example.gwas.tsv
Now we can examine both results by loading the dataframes into pandas.
import pandas
gwascov_df = pandas.read_csv("covar.example.gwascov.tsv", delimiter="\t")
gwas_df = pandas.read_csv("covar.example.gwas.tsv", delimiter="\t")
gwascov_df
| POS | ALT | REF | COUNT | BETA | SE | T | P | |
|---|---|---|---|---|---|---|---|---|
| 0 | 55829 | G | A | 4 | -37.607441 | 38.817199 | -0.968834 | 0.333828 |
| 1 | 56812 | T | G | 3 | 90.404328 | 41.735642 | 2.166118 | 0.031516 |
| 2 | 57349 | G | T | 1 | -63.659076 | 72.729817 | -0.875282 | 0.382497 |
| 3 | 58785 | T | C | 10 | 10.482727 | 24.477519 | 0.428259 | 0.668935 |
| 4 | 59367 | A | G | 2 | -24.403985 | 52.357712 | -0.466101 | 0.641663 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10888 | 9997601 | G | C | 3 | 51.605426 | 42.111952 | 1.225434 | 0.221890 |
| 10889 | 9998038 | A | C | 21 | 3.935289 | 15.240090 | 0.258220 | 0.796510 |
| 10890 | 9998412 | G | T | 42 | -2.619399 | 11.996761 | -0.218342 | 0.827391 |
| 10891 | 9999031 | C | G | 295 | 3.987359 | 8.705235 | 0.458041 | 0.647433 |
| 10892 | 9999126 | T | A | 2 | -21.656135 | 51.830636 | -0.417825 | 0.676535 |
10893 rows × 8 columns
BETA is the effect size for the variant at base-pair position
POS with alternate allele ALT. We can plot the histogram of our
inferred BETA values and see that it does indeed recover a normal
distribution centered at \(0\), which is what we simulated with our
phenotype.
import seaborn
import matplotlib.pyplot as plt
f, axs = plt.subplots(1, 2, figsize=(12, 5))
ax = seaborn.histplot(data=gwas_df, x="BETA", ax=axs[0])
axs[0].set_title("GWAS $\it{without}$ covariates")
seaborn.histplot(data=gwascov_df, x="BETA", ax=axs[1])
axs[1].set_title("GWAS $\it{with}$ covariates")
Text(0.5, 1.0, 'GWAS $\it{with}$ covariates')
Since we simulated the genetic effects that “cause” the phenotype we’re studying, we know that the underlying BETA distribution is normal and centered at 0. Both of these results show a roughly normal distribution centered at 0, but you can see that the one which corrected for the covariate has a clearer distribution without the “extra” peaks around `-500-.
Using the Python API#
We can just repeat the same thing via Python, as an illustration.
from grapp.assoc import linear_assoc_covar
import pygrgl
GRG_FILE = "gwas.example.grg"
PHEN_FILE = "covar.example.grg.phen"
# Load the GRG into memory
grg = pygrgl.load_immutable_grg(GRG_FILE)
# Load the phenotype into memory
Y = pandas.read_csv(PHEN_FILE, delimiter="\t")
# Perform the GWAS
gwas_df = linear_assoc_covar(grg, Y["phenotypes"].to_numpy(), covar_df.to_numpy())
gwas_df
| POS | ALT | REF | COUNT | BETA | SE | T | P | |
|---|---|---|---|---|---|---|---|---|
| 0 | 55829 | G | A | 4 | -37.607441 | 38.817199 | -0.968834 | 0.333828 |
| 1 | 56812 | T | G | 3 | 90.404328 | 41.735642 | 2.166118 | 0.031516 |
| 2 | 57349 | G | T | 1 | -63.659076 | 72.729817 | -0.875282 | 0.382497 |
| 3 | 58785 | T | C | 10 | 10.482727 | 24.477519 | 0.428259 | 0.668935 |
| 4 | 59367 | A | G | 2 | -24.403985 | 52.357712 | -0.466101 | 0.641663 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10888 | 9997601 | G | C | 3 | 51.605426 | 42.111952 | 1.225434 | 0.221890 |
| 10889 | 9998038 | A | C | 21 | 3.935289 | 15.240090 | 0.258220 | 0.796510 |
| 10890 | 9998412 | G | T | 42 | -2.619399 | 11.996761 | -0.218342 | 0.827391 |
| 10891 | 9999031 | C | G | 295 | 3.987359 | 8.705235 | 0.458041 | 0.647433 |
| 10892 | 9999126 | T | A | 2 | -21.656135 | 51.830636 | -0.417825 | 0.676535 |
10893 rows × 8 columns
We can see that the output format (dataframe) is the same as the command line version. We should expect some differences in the distribution of betas because we standardized \(X\) this time, though the shape should still be a normal distribution centered at \(0\).
seaborn.histplot(data=gwas_df, x="BETA")
<Axes: xlabel='BETA', ylabel='Count'>