GWAS Tutorial: with covariates ============================== This tutorial builds on `the simple GWAS tutorial `__ and `the PCA tutorial `__. Here we show how to perform a GWAS with covariates. In this case, we use the first 10 principal components as the covariates, but covariates can be anything relating to the phenotype :math:`Y`. Again, this is simple linear regression. We’re using `Genotype Representation Graphs `__ to represent our genotype matrix. If you’re not sure how to get a GRG, start with `the simple GWAS tutorial `__ instead, since we are re-using the ``gwas.example.grg`` from that tutorial. **What you’ll need:** - Files: ``gwas.example.grg`` - Python dependencies “grapp”, “seaborn”: ``pip install grapp igdtools seaborn`` Simulate Phenotype ------------------ Like the simple GWAS tutorial, we are mostly using the default settings for phenotype simulation, which just draws effect sizes for causal SNPs from a standard normal distribution ~\ :math:`N(0, 1)`. However, we are going to add a slight complication: we are going to assume that there is a binary attribute of individuals that affects the phenotype, but is not genetic (or at least, is more environmental than genetic). We can do this by using the `grg_pheno_sim Python APIs `__ and specifying a covariate as input to the simulation. .. code:: ipython3 import numpy import pygrgl numpy.random.seed(42) # Load the GRG dataset grg = pygrgl.load_immutable_grg("gwas.example.grg") # Generate our random (binary) environment attribute that affects Y env_attribute = ((numpy.random.uniform(size=(grg.num_individuals, 1)) > 0.5) * 1).astype(numpy.float64) .. code:: ipython3 from grg_pheno_sim.phenotype import add_covariates, convert_to_phen import pandas # Now simulate our phenotype, taking into account that our environmental attribute affects it heritability = 0.33 # h^2 covar = env_attribute covar_effect = numpy.array([1000.0]) # Effect that the covariate has on the final phenotype value # add_covariates() takes all the same keyword arguments as sim_phenotypes() phenotypes = add_covariates(grg, covar, covar_effect, heritability=heritability) # Write the dataframe to a file format that can be used for GWAS convert_to_phen(phenotypes, "covar.example.grg.phen", include_header=True) .. parsed-literal:: The initial effect sizes are mutation_id effect_size causal_mutation_id 0 0 -1.657757 0 1 1 -0.537562 0 2 2 1.146189 0 3 3 -0.697705 0 4 4 -0.157002 0 ... ... ... ... 10888 10888 -0.217740 0 10889 10889 0.936594 0 10890 10890 0.976964 0 10891 10891 0.274724 0 10892 10892 0.711005 0 [10893 rows x 3 columns] The genetic values of the individuals are individual_id genetic_value causal_mutation_id 0 0 -62.758200 0 1 1 8.271010 0 2 2 -93.726026 0 3 3 -58.657459 0 4 4 -47.272752 0 .. ... ... ... 195 195 -22.379358 0 196 196 34.579149 0 197 197 -40.854451 0 198 198 -21.653168 0 199 199 -69.561989 0 [200 rows x 3 columns] .. code:: ipython3 phenotypes .. raw:: html
causal_mutation_id individual_id genetic_value environmental_noise phenotype covariate_value
0 0 0 -62.758200 -59.329025 -122.087225 0.0
1 0 1 8.271010 -26.729415 981.541595 1000.0
2 0 2 -93.726026 58.802882 965.076856 1000.0
3 0 3 -58.657459 29.882231 971.224772 1000.0
4 0 4 -47.272752 13.143984 -34.128768 0.0
... ... ... ... ... ... ...
195 0 195 -22.379358 52.827780 30.448422 0.0
196 0 196 34.579149 -45.371223 989.207926 1000.0
197 0 197 -40.854451 103.116600 1062.262149 1000.0
198 0 198 -21.653168 4.893175 983.240007 1000.0
199 0 199 -69.561989 7.776493 938.214504 1000.0

200 rows × 6 columns

Get the top 2 principal components ---------------------------------- The top 10 principal components (those associated with the 10 largest eigenvalues) of the genotype matrix :math:`X` are often used to correct for population stratification when performing GWAS. These PCs are the dimensions of most variance across the dataset, which is expected to reflect genetic differences primarily due to individuals being from different ancestral populations (those are the “biggest” genetic differences). Since this dataset is so small, we just use 2 PCs here, which we compute from our GRG. .. code:: bash %%bash grapp pca -d 2 gwas.example.grg -o gwas.covar.pcs.tsv .. parsed-literal:: Running eigen decomposition on 200 individuals Wrote PCs to gwas.covar.pcs.tsv .. code:: ipython3 import pandas # Load our PCs and display the dataframe pca_df = pandas.read_csv("gwas.covar.pcs.tsv", delimiter="\t") pca_df .. raw:: html
PC1 PC2
0 0.029621 -0.145431
1 -0.081435 -0.085890
2 -0.009044 0.050044
3 -0.072203 0.006810
4 0.011696 0.049359
... ... ...
195 0.023435 0.030237
196 -0.002435 0.001847
197 -0.154034 0.048480
198 -0.085177 0.052992
199 0.036344 0.092449

200 rows × 2 columns

We want to combine the PCs and our attribute into a single dataframe, which we will use as the covariates input to our GWAS. .. code:: ipython3 attr_df = pandas.DataFrame({"EnvAttr": env_attribute.T[0]}) covar_df = pandas.concat([pca_df, attr_df], axis=1) covar_df.to_csv("covar.example.covar.tsv", sep="\t", index=False) covar_df .. raw:: html
PC1 PC2 EnvAttr
0 0.029621 -0.145431 0.0
1 -0.081435 -0.085890 1.0
2 -0.009044 0.050044 1.0
3 -0.072203 0.006810 1.0
4 0.011696 0.049359 0.0
... ... ... ...
195 0.023435 0.030237 0.0
196 -0.002435 0.001847 1.0
197 -0.154034 0.048480 1.0
198 -0.085177 0.052992 1.0
199 0.036344 0.092449 1.0

200 rows × 3 columns

Perform association with covariates ----------------------------------- We want a GWAS that “ignores the effects” of the covariates, which are 10 PCs + one individual attribute in our case. GWAS via command line ~~~~~~~~~~~~~~~~~~~~~ First we’ll run the shell commands that perform a GWAS. .. code:: bash %%bash # Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS with covariates grapp assoc -c covar.example.covar.tsv -p covar.example.grg.phen -o covar.example.gwascov.tsv gwas.example.grg # Same thing, but WITHOUT covariates, just for comparison grapp assoc -p covar.example.grg.phen -o covar.example.gwas.tsv gwas.example.grg .. parsed-literal:: Using the column 2 (the last one) for phenotype. .. parsed-literal:: Wrote results to covar.example.gwascov.tsv .. parsed-literal:: Using the column 2 (the last one) for phenotype. .. parsed-literal:: Wrote results to covar.example.gwas.tsv Now we can examine both results by loading the dataframes into pandas. .. code:: ipython3 import pandas gwascov_df = pandas.read_csv("covar.example.gwascov.tsv", delimiter="\t") gwas_df = pandas.read_csv("covar.example.gwas.tsv", delimiter="\t") gwascov_df .. raw:: html
POS ALT REF COUNT BETA SE T P
0 55829 G A 4 -37.607441 38.817199 -0.968834 0.333828
1 56812 T G 3 90.404328 41.735642 2.166118 0.031516
2 57349 G T 1 -63.659076 72.729817 -0.875282 0.382497
3 58785 T C 10 10.482727 24.477519 0.428259 0.668935
4 59367 A G 2 -24.403985 52.357712 -0.466101 0.641663
... ... ... ... ... ... ... ... ...
10888 9997601 G C 3 51.605426 42.111952 1.225434 0.221890
10889 9998038 A C 21 3.935289 15.240090 0.258220 0.796510
10890 9998412 G T 42 -2.619399 11.996761 -0.218342 0.827391
10891 9999031 C G 295 3.987359 8.705235 0.458041 0.647433
10892 9999126 T A 2 -21.656135 51.830636 -0.417825 0.676535

10893 rows × 8 columns

``BETA`` is the effect size for the variant at base-pair position ``POS`` with alternate allele ``ALT``. We can plot the histogram of our inferred ``BETA`` values and see that it does indeed recover a normal distribution centered at :math:`0`, which is what we simulated with our phenotype. .. code:: ipython3 import seaborn import matplotlib.pyplot as plt f, axs = plt.subplots(1, 2, figsize=(12, 5)) ax = seaborn.histplot(data=gwas_df, x="BETA", ax=axs[0]) axs[0].set_title("GWAS $\it{without}$ covariates") seaborn.histplot(data=gwascov_df, x="BETA", ax=axs[1]) axs[1].set_title("GWAS $\it{with}$ covariates") .. parsed-literal:: Text(0.5, 1.0, 'GWAS $\\it{with}$ covariates') .. image:: GWASCovariates_files/GWASCovariates_15_1.png Since we simulated the genetic effects that “cause” the phenotype we’re studying, we know that the underlying BETA distribution is normal and centered at 0. Both of these results show a roughly normal distribution centered at 0, but you can see that the one which corrected for the covariate has a clearer distribution without the “extra” peaks around \`-500-. Using the Python API ~~~~~~~~~~~~~~~~~~~~ We can just repeat the same thing via Python, as an illustration. .. code:: ipython3 from grapp.assoc import linear_assoc_covar import pygrgl GRG_FILE = "gwas.example.grg" PHEN_FILE = "covar.example.grg.phen" # Load the GRG into memory grg = pygrgl.load_immutable_grg(GRG_FILE) # Load the phenotype into memory Y = pandas.read_csv(PHEN_FILE, delimiter="\t") # Perform the GWAS gwas_df = linear_assoc_covar(grg, Y["phenotypes"].to_numpy(), covar_df.to_numpy()) gwas_df .. raw:: html
POS ALT REF COUNT BETA SE T P
0 55829 G A 4 -37.607441 38.817199 -0.968834 0.333828
1 56812 T G 3 90.404328 41.735642 2.166118 0.031516
2 57349 G T 1 -63.659076 72.729817 -0.875282 0.382497
3 58785 T C 10 10.482727 24.477519 0.428259 0.668935
4 59367 A G 2 -24.403985 52.357712 -0.466101 0.641663
... ... ... ... ... ... ... ... ...
10888 9997601 G C 3 51.605426 42.111952 1.225434 0.221890
10889 9998038 A C 21 3.935289 15.240090 0.258220 0.796510
10890 9998412 G T 42 -2.619399 11.996761 -0.218342 0.827391
10891 9999031 C G 295 3.987359 8.705235 0.458041 0.647433
10892 9999126 T A 2 -21.656135 51.830636 -0.417825 0.676535

10893 rows × 8 columns

We can see that the output format (dataframe) is the same as the command line version. We should expect some differences in the distribution of betas because we standardized :math:`X` this time, though the shape should still be a normal distribution centered at :math:`0`. .. code:: ipython3 seaborn.histplot(data=gwas_df, x="BETA") .. parsed-literal:: .. image:: GWASCovariates_files/GWASCovariates_20_1.png Related Topics -------------- - See `GWAS `__ and `PCA `__ for for details. - See `Simulating Phenotypes `__ for a more details on generating synthetic phenotypes. - Documentation links: - `grapp.assoc `__: Python APIs for GWAS on GRG - `grg_pheno_sim.phenotype `__: Python APIs for phenotype simulation on GRG