GWAS Tutorial: with covariates#

This tutorial builds on the simple GWAS tutorial and the PCA tutorial. Here we show how to perform a GWAS with covariates. In this case, we use the first 10 principal components as the covariates, but covariates can be anything relating to the phenotype \(Y\). Again, this is simple linear regression. We’re using Genotype Representation Graphs to represent our genotype matrix. If you’re not sure how to get a GRG, start with the simple GWAS tutorial instead, since we are re-using the gwas.example.grg from that tutorial.

What you’ll need:

Files: gwas.example.grg
Python dependencies “grapp”, “seaborn”: pip install grapp igdtools seaborn

Simulate Phenotype#

Like the simple GWAS tutorial, we are mostly using the default settings for phenotype simulation, which just draws effect sizes for causal SNPs from a standard normal distribution ~\(N(0, 1)\). However, we are going to add a slight complication: we are going to assume that there is a binary attribute of individuals that affects the phenotype, but is not genetic (or at least, is more environmental than genetic). We can do this by using the grg_pheno_sim Python APIs and specifying a covariate as input to the simulation.

import numpy
import pygrgl

numpy.random.seed(42)

# Load the GRG dataset
grg = pygrgl.load_immutable_grg("gwas.example.grg")

# Generate our random (binary) environment attribute that affects Y
env_attribute = ((numpy.random.uniform(size=(grg.num_individuals, 1)) > 0.5) * 1).astype(numpy.float64)

from grg_pheno_sim.phenotype import add_covariates, convert_to_phen
import pandas

# Now simulate our phenotype, taking into account that our environmental attribute affects it
heritability = 0.33   # h^2

covar = env_attribute
covar_effect = numpy.array([1000.0]) # Effect that the covariate has on the final phenotype value
# add_covariates() takes all the same keyword arguments as sim_phenotypes()
phenotypes = add_covariates(grg, covar, covar_effect, heritability=heritability)
# Write the dataframe to a file format that can be used for GWAS
convert_to_phen(phenotypes, "covar.example.grg.phen", include_header=True)

The initial effect sizes are
       mutation_id  effect_size  causal_mutation_id
              0    -1.657757                   0
              1    -0.537562                   0
              2     1.146189                   0
              3    -0.697705                   0
              4    -0.157002                   0
...            ...          ...                 ...
      10888    -0.217740                   0
      10889     0.936594                   0
      10890     0.976964                   0
      10891     0.274724                   0
      10892     0.711005                   0

[10893 rows x 3 columns]
The genetic values of the individuals are
     individual_id  genetic_value  causal_mutation_id
              0     -62.758200                   0
              1       8.271010                   0
              2     -93.726026                   0
              3     -58.657459                   0
              4     -47.272752                   0
..             ...            ...                 ...
          195     -22.379358                   0
          196      34.579149                   0
          197     -40.854451                   0
          198     -21.653168                   0
          199     -69.561989                   0

[200 rows x 3 columns]

phenotypes

	causal_mutation_id	individual_id	genetic_value	environmental_noise	phenotype	covariate_value
0	0	0	-62.758200	-59.329025	-122.087225	0.0
1	0	1	8.271010	-26.729415	981.541595	1000.0
2	0	2	-93.726026	58.802882	965.076856	1000.0
3	0	3	-58.657459	29.882231	971.224772	1000.0
4	0	4	-47.272752	13.143984	-34.128768	0.0
...	...	...	...	...	...	...
195	0	195	-22.379358	52.827780	30.448422	0.0
196	0	196	34.579149	-45.371223	989.207926	1000.0
197	0	197	-40.854451	103.116600	1062.262149	1000.0
198	0	198	-21.653168	4.893175	983.240007	1000.0
199	0	199	-69.561989	7.776493	938.214504	1000.0

200 rows × 6 columns

Get the top 2 principal components#

The top 10 principal components (those associated with the 10 largest eigenvalues) of the genotype matrix \(X\) are often used to correct for population stratification when performing GWAS. These PCs are the dimensions of most variance across the dataset, which is expected to reflect genetic differences primarily due to individuals being from different ancestral populations (those are the “biggest” genetic differences). Since this dataset is so small, we just use 2 PCs here, which we compute from our GRG.

%%bash
grapp pca -d 2 gwas.example.grg -o gwas.covar.pcs.tsv

Running eigen decomposition on 200 individuals
Wrote PCs to gwas.covar.pcs.tsv

import pandas
# Load our PCs and display the dataframe
pca_df = pandas.read_csv("gwas.covar.pcs.tsv", delimiter="\t")
pca_df

	PC1	PC2
0	0.029621	-0.145431
1	-0.081435	-0.085890
2	-0.009044	0.050044
3	-0.072203	0.006810
4	0.011696	0.049359
...	...	...
195	0.023435	0.030237
196	-0.002435	0.001847
197	-0.154034	0.048480
198	-0.085177	0.052992
199	0.036344	0.092449

200 rows × 2 columns

We want to combine the PCs and our attribute into a single dataframe, which we will use as the covariates input to our GWAS.

attr_df = pandas.DataFrame({"EnvAttr": env_attribute.T[0]})
covar_df = pandas.concat([pca_df, attr_df], axis=1)
covar_df.to_csv("covar.example.covar.tsv", sep="\t", index=False)
covar_df

	PC1	PC2	EnvAttr
0	0.029621	-0.145431	0.0
1	-0.081435	-0.085890	1.0
2	-0.009044	0.050044	1.0
3	-0.072203	0.006810	1.0
4	0.011696	0.049359	0.0
...	...	...	...
195	0.023435	0.030237	0.0
196	-0.002435	0.001847	1.0
197	-0.154034	0.048480	1.0
198	-0.085177	0.052992	1.0
199	0.036344	0.092449	1.0

200 rows × 3 columns

Perform association with covariates#

We want a GWAS that “ignores the effects” of the covariates, which are 10 PCs + one individual attribute in our case.

GWAS via command line#

First we’ll run the shell commands that perform a GWAS.

%%bash

# Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS with covariates
grapp assoc -c covar.example.covar.tsv -p covar.example.grg.phen -o covar.example.gwascov.tsv gwas.example.grg

# Same thing, but WITHOUT covariates, just for comparison
grapp assoc -p covar.example.grg.phen -o covar.example.gwas.tsv gwas.example.grg

Using the column 2 (the last one) for phenotype.

Wrote results to covar.example.gwascov.tsv

Using the column 2 (the last one) for phenotype.

Wrote results to covar.example.gwas.tsv

Now we can examine both results by loading the dataframes into pandas.

import pandas

gwascov_df = pandas.read_csv("covar.example.gwascov.tsv", delimiter="\t")
gwas_df = pandas.read_csv("covar.example.gwas.tsv", delimiter="\t")
gwascov_df

	POS	ALT	REF	COUNT	BETA	SE	T	P
0	55829	G	A	4	-37.607441	38.817199	-0.968834	0.333828
1	56812	T	G	3	90.404328	41.735642	2.166118	0.031516
2	57349	G	T	1	-63.659076	72.729817	-0.875282	0.382497
3	58785	T	C	10	10.482727	24.477519	0.428259	0.668935
4	59367	A	G	2	-24.403985	52.357712	-0.466101	0.641663
...	...	...	...	...	...	...	...	...
10888	9997601	G	C	3	51.605426	42.111952	1.225434	0.221890
10889	9998038	A	C	21	3.935289	15.240090	0.258220	0.796510
10890	9998412	G	T	42	-2.619399	11.996761	-0.218342	0.827391
10891	9999031	C	G	295	3.987359	8.705235	0.458041	0.647433
10892	9999126	T	A	2	-21.656135	51.830636	-0.417825	0.676535

10893 rows × 8 columns

BETA is the effect size for the variant at base-pair position POS with alternate allele ALT. We can plot the histogram of our inferred BETA values and see that it does indeed recover a normal distribution centered at \(0\), which is what we simulated with our phenotype.

import seaborn
import matplotlib.pyplot as plt

f, axs = plt.subplots(1, 2, figsize=(12, 5))
ax = seaborn.histplot(data=gwas_df, x="BETA", ax=axs[0])
axs[0].set_title("GWAS $\it{without}$ covariates")
seaborn.histplot(data=gwascov_df, x="BETA", ax=axs[1])
axs[1].set_title("GWAS $\it{with}$ covariates")

Text(0.5, 1.0, 'GWAS $\it{with}$ covariates')

Since we simulated the genetic effects that “cause” the phenotype we’re studying, we know that the underlying BETA distribution is normal and centered at 0. Both of these results show a roughly normal distribution centered at 0, but you can see that the one which corrected for the covariate has a clearer distribution without the “extra” peaks around `-500-.

Using the Python API#

We can just repeat the same thing via Python, as an illustration.

from grapp.assoc import linear_assoc_covar
import pygrgl

GRG_FILE = "gwas.example.grg"
PHEN_FILE = "covar.example.grg.phen"

# Load the GRG into memory
grg = pygrgl.load_immutable_grg(GRG_FILE)

# Load the phenotype into memory
Y = pandas.read_csv(PHEN_FILE, delimiter="\t")

# Perform the GWAS
gwas_df = linear_assoc_covar(grg, Y["phenotypes"].to_numpy(), covar_df.to_numpy())
gwas_df

	POS	ALT	REF	COUNT	BETA	SE	T	P
0	55829	G	A	4	-37.607441	38.817199	-0.968834	0.333828
1	56812	T	G	3	90.404328	41.735642	2.166118	0.031516
2	57349	G	T	1	-63.659076	72.729817	-0.875282	0.382497
3	58785	T	C	10	10.482727	24.477519	0.428259	0.668935
4	59367	A	G	2	-24.403985	52.357712	-0.466101	0.641663
...	...	...	...	...	...	...	...	...
10888	9997601	G	C	3	51.605426	42.111952	1.225434	0.221890
10889	9998038	A	C	21	3.935289	15.240090	0.258220	0.796510
10890	9998412	G	T	42	-2.619399	11.996761	-0.218342	0.827391
10891	9999031	C	G	295	3.987359	8.705235	0.458041	0.647433
10892	9999126	T	A	2	-21.656135	51.830636	-0.417825	0.676535