GWAS Tutorial ============= This tutorial describes how to perform a GWAS with simple linear regression, where each SNP is treated independently when determining its effect on the phenotype. Lets consider our input data to be a genotype matrix :math:`X` which has :math:`N` rows (number of individuals) and :math:`M` columns (number of variants, usually SNPs). We’re using `Genotype Representation Graphs `__ to represent our genotype matrix. If you have ``.vcf.gz`` data, see `Converting .vcf.gz to GRG `__. If you have (or want) a more efficicient representation than ``.vcf.gz`` see `Converting IGD to GRG `__. For this tutorial, we’ll just use a (really) small simulated dataset of 200 individuals that we can quickly download. Replace it with your own dataset as desired. The other piece of information we need for performing a GWAS is the phenotype, :math:`Y`. We are just going to simulate a phenotype for this tutorial, but replace it with your own phenotype as desired. **What you’ll need:** - Python dependencies “grapp”, “igdtools”, “seaborn”: ``pip install grapp igdtools seaborn`` - Command line tool “wget”: ``sudo apt install wget`` (or your system’s equivalent) Get Dataset ~~~~~~~~~~~ .. code:: bash %%bash if [[ ! -e gwas.example.igd ]]; then # Download a small example dataset wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O gwas.example.vcf.gz # Convert to IGD; this isn't necessary, but most of the time you will want to do this igdtools gwas.example.vcf.gz -o gwas.example.igd fi # Just show some stats about the dataset igdtools -s gwas.example.igd .. parsed-literal:: Stats for gwas.example.igd ... in range 0 - 18446744073709551615 Variants in range: 10893 Average samples/var: 50.2075 Stddev samples/var: 86.1985 Average var/sample: 1367.28 Stddev var/sample: 25.9211 Variants with missing data: 0 Total missing alleles: 0 Total unique sites: 10885 Convert to GRG -------------- .. code:: bash %%bash if [[ ! -e gwas.example.grg ]]; then # -j controls how many threads to use. grg construct -j 1 gwas.example.igd -o gwas.example.grg fi Simulate Phenotype ------------------ Since this phenotype is just for demonstration purposes, we can use the default settings for phenotype simulation. See `Simulating Phenotypes `__ for a more detailed tutorial that describes the range of options. Here, we just want a phenotype that has been derived from the genotype: that is, our simulation will choose some “causal mutations” that are responsible for some proportion of the phenotype (this is called the heritability), and compute for each individual a phenotype value from the actual mutations the individual has. By default, phenotype simulation draws effect sizes for causal SNPs from a standard normal distribution ~\ :math:`N(0, 1)`. .. code:: bash %%bash grapp pheno gwas.example.grg .. parsed-literal:: The initial effect sizes are mutation_id effect_size causal_mutation_id 0 0 -1.657757 0 1 1 -0.537562 0 2 2 1.146189 0 3 3 -0.697705 0 4 4 -0.157002 0 ... ... ... ... 10888 10888 -0.217740 0 10889 10889 0.936594 0 10890 10890 0.976964 0 10891 10891 0.274724 0 10892 10892 0.711005 0 [10893 rows x 3 columns] The genetic values of the individuals are individual_id genetic_value causal_mutation_id 0 0 -62.758200 0 1 1 8.271010 0 2 2 -93.726026 0 3 3 -58.657459 0 4 4 -47.272752 0 .. ... ... ... 195 195 -22.379358 0 196 196 34.579149 0 197 197 -40.854451 0 198 198 -21.653168 0 199 199 -69.561989 0 [200 rows x 3 columns] Wrote phenotypes to gwas.example.grg.phen Perform association between genotype and phenotype -------------------------------------------------- Now we want to learn the association between the genotype :math:`X` and the phenotype :math:`Y`. We are using simple linear regression to do this. The result is a value :math:`\beta_i` for each mutation :math:`i`: a larger absolute value indicates a larger effect of that mutation (variant) on the phenotype. We are finding *correlations* between :math:`X` and :math:`Y`. A variant :math:`i` that is highly correlated with a phenotype :math:`Y` *might* have a causal relationship to :math:`Y`. GWAS via command line ~~~~~~~~~~~~~~~~~~~~~ First we’ll run the shell commands that perform a GWAS. .. code:: bash %%bash # Using our phenotype file, emit a tab-separated (tsv) pandas dataframe containing the results of our GWAS between X (test-200-samples.grg) and Y (test-200-samples.grg.phen) grapp assoc -p gwas.example.grg.phen -o gwas.example.gwas.tsv gwas.example.grg .. parsed-literal:: Using the column 2 (the last one) for phenotype. .. parsed-literal:: Wrote results to gwas.example.gwas.tsv Now we can examine the results by loading the dataframe into pandas. .. code:: ipython3 import pandas gwas_df = pandas.read_csv("gwas.example.gwas.tsv", delimiter="\t") gwas_df .. raw:: html
POS ALT REF COUNT BETA B0 SE R2 T P
0 55829 G A 4 -0.512129 0.010243 0.505040 0.005166 -1.014035 0.311804
1 56812 T G 3 0.794622 -0.011919 0.580456 0.009376 1.368961 0.172562
2 57349 G T 1 -1.466317 0.007332 0.999621 0.010750 -1.466873 0.143997
3 58785 T C 10 -0.358335 0.017917 0.324263 0.006130 -1.105077 0.270467
4 59367 A G 2 -0.118437 0.001184 0.712412 0.000140 -0.166248 0.868131
... ... ... ... ... ... ... ... ... ... ...
10888 9997601 G C 3 0.113106 -0.001697 0.583141 0.000190 0.193959 0.846407
10889 9998038 A C 21 -0.072860 0.007650 0.209914 0.000608 -0.347095 0.728889
10890 9998412 G T 42 0.068393 -0.014362 0.164342 0.000874 0.416160 0.677743
10891 9999031 C G 295 -0.015402 0.022718 0.116634 0.000088 -0.132055 0.895075
10892 9999126 T A 2 1.195692 -0.011957 0.707376 0.014225 1.690320 0.092540

10893 rows × 10 columns

``BETA`` is the effect size for the variant at base-pair position ``POS`` with alternate allele ``ALT``. We can plot the histogram of our inferred ``BETA`` values and see that it does indeed recover a normal distribution centered at :math:`0`, which is what we simulated with our phenotype. .. code:: ipython3 import seaborn seaborn.histplot(data=gwas_df, x="BETA") .. parsed-literal:: .. image:: GWAS_files/GWAS_12_1.png GWAS via Python APIs ~~~~~~~~~~~~~~~~~~~~ Now we do the same GWAS, but using Python code instead of running ``grapp assoc ...`` on the command line. This gives us more options. In this example, we’ll standardize the genotype matrix :math:`X` prior to performing the linear regression, which will make all mutations (variants) have the same variance. This has the effect of putting variants with different allele frequencies on the same scale. .. code:: ipython3 from grapp.assoc import linear_assoc_no_covar import pygrgl GRG_FILE = "gwas.example.grg" PHEN_FILE = "gwas.example.grg.phen" # Load the GRG into memory grg = pygrgl.load_immutable_grg(GRG_FILE) # Load the phenotype into memory Y = pandas.read_csv(PHEN_FILE, delimiter="\t") # Perform the GWAS gwas_df = linear_assoc_no_covar(grg, Y["phenotypes"].to_numpy(), standardize=True) gwas_df .. raw:: html
POS ALT REF COUNT BETA B0 SE R2 T P
0 55829 G A 4 -0.071335 0.001427 0.070707 0.005116 -1.008875 0.314266
1 56812 T G 3 0.096223 -0.001443 0.070558 0.009307 1.363732 0.174200
2 57349 G T 1 -0.103295 0.000516 0.070508 0.010724 -1.465014 0.144503
3 58785 T C 10 -0.077090 0.003854 0.070676 0.005988 -1.090740 0.276713
4 59367 A G 2 -0.011755 0.000118 0.070884 0.000139 -0.165830 0.868460
... ... ... ... ... ... ... ... ... ... ...
10888 9997601 G C 3 0.013696 -0.000205 0.070882 0.000189 0.193225 0.846981
10889 9998038 A C 21 -0.026328 0.002764 0.070864 0.000704 -0.371527 0.710643
10890 9998412 G T 42 0.029327 -0.006159 0.070857 0.000903 0.413891 0.679402
10891 9999031 C G 295 -0.009143 0.013486 0.070880 0.000267 -0.128993 0.897494
10892 9999126 T A 2 0.118671 -0.001187 0.070386 0.014155 1.686008 0.093369

10893 rows × 10 columns

We can see that the output format (dataframe) is the same as the command line version. We should expect some differences in the distribution of betas because we standardized :math:`X` this time, though the shape should still be a normal distribution centered at :math:`0`. .. code:: ipython3 seaborn.histplot(data=gwas_df, x="BETA") .. parsed-literal:: .. image:: GWAS_files/GWAS_16_1.png You can see that the range of beta values is much smaller (from :math:`-0.3` to :math:`+0.3`). Finally, we can perform one more GWAS where we use an approximate distribution for the variance of each mutation in :math:`X`. Each value in the :math:`X` matrix can take on a value of :math:`0`, :math:`1`, or :math:`2`, based on how many copies each of the (diploid) individuals has for that allele. By default, the linear regression is performed with the exact sum of squared errors, based on these diploid values in :math:`X`. Theoretically, in the absence of many complicating effects (selection, genotyping error, inbreeding, etc.) the variance of each mutation should follow a binomial distribution with mean :math:`2 \times f_i` (where :math:`f_i` is the allele frequency for mutation :math:`i`). You can choose to perform the regression using the binomial distribution, instead of the exact values, via ``dist="binomial"``, as shown below. .. code:: ipython3 gwas_df = linear_assoc_no_covar(grg, Y["phenotypes"].to_numpy(), dist="binomial") seaborn.histplot(data=gwas_df, x="BETA") .. parsed-literal:: .. image:: GWAS_files/GWAS_18_1.png You can see that this (non-standardized) distribution follows our original (exact) result pretty closely, even with a small sample size of 200 individuals. Partially this is because our genotype matrix :math:`X` was generated using a neutral simulation; in real datasets, the binomial distribution may cause more beta values to differ from the exact value than is shown here. Related Topics -------------- - In practice, GWAS is often performed while handling covariates (such as age, sex, principal components, etc). See `GWAS with Covariates `__. - See `Simulating Phenotypes `__ for a more details on generating synthetic phenotypes. - Documentation links: - `grapp.assoc `__: Python APIs for GWAS