Working with Real Data ====================== In this tutorial we’re going to download publicly available `1,000 Genomes Project `__ data and examine it with GRG. The goal is just to illustrate the ease with which you can work with such datasets. We also illustrate how to include population information in GRG for a real dataset. **What is demonstrated:** - Building a GRG from real data, including population information - Using GRG matrix multiplication to calculate population-specific allele frequencies - Using `grapp `__ to filter information out of a GRG **What you’ll need:** - Python dependencies “grapp”, “seaborn”: ``pip install grapp seaborn`` - Command line tools “wget”, “tabix”: ``sudo apt install wget tabix`` (or your system’s equivalent) First, download the data. This can take 10-15 minutes, depending on your internet connection (the file is roughly 425MB) .. code:: bash %%bash REMOTE_FILE="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz" if [[ ! -e kgp.chr22.vcf.gz ]]; then wget ${REMOTE_FILE} --progress=dot:mega -O kgp.chr22.vcf.gz tabix kgp.chr22.vcf.gz fi echo "Size of input file:" du -hs kgp.chr22.vcf.gz .. parsed-literal:: Size of input file: 426M kgp.chr22.vcf.gz Build the GRG ------------- GRG can be built from either `IGD `__ or directly from ``.vcf.gz``. IGD is generally smaller and faster than VCF, so if you want to manipulate this data repeatedly (outside of GRG), then convert to IGD first. Otherwise, you can just do a one-time conversion from ``.vcf.gz`` to GRG. We’ll just convert from the VCF file. Before we do that, let’s prepare information that maps each sample in the dataset to a population label, based on metadata provided from the 1,000 Genomes Project. For population information, GRG needs a tab-separated file with a header. We construct the GRG with the filename and the names of two columns: the column containing the sample identifier (individual ID), and the column containing the population label. 1,000 Genomes data has both “population” and “super-population” labels in `their data browser `__. We’ll use the “population” labels since they are easier to download from the FTP site. .. code:: bash %%bash if [[ ! -e 1000G_2504_high_coverage.sequence.index ]]; then # Two files: one for the original 2504 samples, and one for the additional 698 samples. wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index --progress=dot:mega wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_698_related_high_coverage.sequence.index --progress=dot:mega fi # Combine the two files into a single metadata file cat 1000G_2504_high_coverage.sequence.index 1000G_698_related_high_coverage.sequence.index > kgp.combined_metadata.tsv # Show the format head -n 25 kgp.combined_metadata.tsv .. parsed-literal:: ##FileDate=20190503 ##ENA_FILE_PATH=path to ENA file on ENA ftp site ##MD5=md5sum of file ##RUN_ID=SRA/ERA run accession ##STUDY_ID=SRA/ERA study accession ##STUDY_NAME=Name of study ##CENTER_NAME=Submission centre name ##SUBMISSION_ID=SRA/ERA submission accession ##SUBMISSION_DATE=Date sequence submitted, YYYY-MM-DD ##SAMPLE_ID=SRA/ERA sample accession ##SAMPLE_NAME=Sample name ##POPULATION=Sample population. Further information may be available with the data collection. ##EXPERIMENT_ID=Experiment accession ##INSTRUMENT_PLATFORM=Type of sequencing machine ##INSTRUMENT_MODEL=Model of sequencing machine ##LIBRARY_NAME=Library name ##RUN_NAME=Name of machine run ##INSERT_SIZE=Submitter specifed insert size/paired nominal length ##LIBRARY_LAYOUT=Library layout, this can be either PAIRED or SINGLE ##PAIRED_FASTQ=Name of mate pair file if exists (Runs with failed mates will have a library layout of PAIRED but no paired fastq file) ##READ_COUNT=Read count for the file ##BASE_COUNT=Basepair count for the file ##ANALYSIS_GROUP=Analysis group is used to identify groups, or sets, of data. Further information may be available with the data collection. #ENA_FILE_PATH MD5SUM RUN_ID STUDY_ID STUDY_NAME CENTER_NAME SUBMISSION_ID SUBMISSION_DATE SAMPLE_ID SAMPLE_NAME POPULATION EXPERIMENT_ID INSTRUMENT_PLATFORM INSTRUMENT_MODEL LIBRARY_NAME RUN_NAME INSERT_SIZE LIBRARY_LAYOUT PAIRED_FASTQ READ_COUNT BASE_COUNT ANALYSIS_GROUP ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram 923ca8ff7d4cd65fddd28e855e5f173d ERR3239480 ERP114329 30X whole genome sequencing coverage of the 2504 Phase 3 1000 Genome samples. NYGC ERA1783081 2019-03-22 00:00 SRS000632 NA12718 CEU ERX3266855 ILLUMINA Illumina NovaSeq 6000 NA12718 ena-RUN-NYGC-22-03-2019-12:30:23:667-205 450 PAIRED 753247448 112987117200 high_cov The file is a bit messy for GRG to try to use as is (especially since we concatenated the two files without removing the headers). So below we use some Python code to make it cleaner. .. code:: ipython3 # List of (sample, population) pairs that we will output for use with GRG sample_pops = [] with open("kgp.combined_metadata.tsv") as f: for line in f: line = line.rstrip("\n").rstrip("\r") # Skip comments if line.startswith("#"): continue data = line.split("\t") sample_pops.append( (data[9].strip(), data[10].strip()) ) with open("kgp.popmap.tsv", "w") as fout: fout.write("SAMPLE\tPOPULATION\n") # Header fout.write("\n".join(["\t".join(s) for s in sample_pops])) Now we can use the ``--population-ids`` flag to ``grg construct`` to incorporate this information in the GRG. This is mostly a convenience, we could just create the GRG without doing any of this population mapping. The argument ``"kgp.popmap.tsv:SAMPLE:POPULATION"`` can be interpreted as ``::``. Building this GRG takes about 4 minutes (with 4 threads) on my laptop. .. code:: bash %%bash if [[ ! -e kgp.chr22.grg ]]; then grg construct -j 4 --population-ids "kgp.popmap.tsv:SAMPLE:POPULATION" kgp.chr22.vcf.gz -o kgp.chr22.grg fi Examine GRG and allele frequencies ---------------------------------- Now we can load the GRG and examine the information in it. As an example, lets compute the allele frequencies for just a few of the populations that have a large number of samples (more than 200 haplotypes) in the dataset. We’ll compute these per-population and then display them. For more complex computations on GRG, see `LinearOperators `__, `PCA `__, or `GWAS `__. .. code:: ipython3 import pygrgl grg = pygrgl.load_immutable_grg("kgp.chr22.grg", load_up_edges=False) Something really simple we can do is look at the individual identifiers for each individual in the GRG. Not all GRGs have individual IDs: only if the original data had the IDs. Here we just show the first and last individual. .. code:: ipython3 print(f"First: {grg.get_individual_id(0)}") print(f"Last: {grg.get_individual_id(grg.num_individuals - 1)}") .. parsed-literal:: First: HG00096 Last: NA21144 We want to use `GRG matrix multiplication `__ later on to compute the per-population allele frequencies. So we will: \* Find all :math:`K` populations with at least 200 haplotypes \* Create a :math:`K \times 2N` matrix (where :math:`N` is the number of diploid individuals) :math:`A` that represents our labeled samples \* If haplotype sample :math:`i` has population label :math:`j`, then :math:`A_{j, i} = 1` \* Given our (haploid) genotype matrix :math:`X` (as represented by the GRG), we can get the population specific allele counts via :math:`A \times X` .. code:: ipython3 from collections import defaultdict import numpy # Find populations with at least 100 haplotypes labels = grg.get_populations() pop_mapping = defaultdict(list) for sample in range(grg.num_samples): pop_label = labels[grg.get_population_id(sample)] pop_mapping[pop_label].append(sample) # Get labels and samples for populations exceeding our threshold K_pops = list(filter(lambda item: len(item[1]) >= 200, pop_mapping.items())) # Create the matrix K = len(K_pops) A = numpy.zeros( (K, grg.num_samples), dtype=numpy.int32 ) # Populate the matrix for j, (pop_label, sample_list) in enumerate(K_pops): print(f"{pop_label} has {len(sample_list)} haplotypes") A[j, sample_list] = 1 .. parsed-literal:: CHS has 326 haplotypes PUR has 278 haplotypes CLM has 264 haplotypes IBS has 314 haplotypes PEL has 244 haplotypes PJL has 292 haplotypes KHV has 244 haplotypes ACB has 232 haplotypes GWD has 356 haplotypes ESN has 298 haplotypes BEB has 262 haplotypes STU has 228 haplotypes ITU has 214 haplotypes CEU has 358 haplotypes YRI has 356 haplotypes CHB has 206 haplotypes JPT has 208 haplotypes TSI has 214 haplotypes GIH has 206 haplotypes Now we compute the allele counts. The frequency will just be dividing the counts by :math:`N_j`, the number of haploptypes in each population. We’re going to use `pygrgl.matmul `__ to do the work for us. *Notice*: we could also do our own graph traversal to compute these values, but ``matmul()`` is much faster (it does the traversal with optimized C++ code). .. code:: ipython3 N = [len(sample_list) for pop_label, sample_list in K_pops] # The "direction" tells us whether we are summing values from the leaves-to-roots (UP) or roots-to-leaves (DOWN). # These are equivalent to multiplication against X or X^T. # Here we have an input that is in terms of _samples_, which are the leaves, so we need to start at the samples # and sum the values until we reach the mutations (UP). allele_counts = pygrgl.matmul(grg, A, pygrgl.TraversalDirection.UP) frequencies = allele_counts / numpy.array([N]).T We’ll pick the column from the data that has an above average sum and plot it. .. code:: ipython3 import seaborn import pandas # Sum by column total_snp_freqs = numpy.sum(frequencies[:, ], axis=0) # Find the maximum column sum values avg = numpy.mean(total_snp_freqs) # Get the first index that has the max value snp_index = numpy.where(total_snp_freqs >= avg)[0][0] mutation = grg.get_mutation_by_id(snp_index) # Get the Mutation object # Now plot the frequencies at that index df = pandas.DataFrame({ "SNP": [snp_index] * K, "Frequency": frequencies[:, snp_index], "Pop Label": [label for label, smp in K_pops], }) ax = seaborn.barplot(data=df, x="Pop Label", y="Frequency", hue="Pop Label", palette="Dark2") ax.set_title(f"SN: {mutation.position}, {mutation.allele}") ax.set_xticks(ax.get_xticks()) ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right') ax.set_xlabel(None) .. parsed-literal:: Text(0.5, 0, '') .. image:: WorkingWithRealData_files/WorkingWithRealData_19_1.png Filter the GRG -------------- Now we illustrate the types of filtering that are available. At the basic level GRG can be filtered in two ways: 1. By removing a subset of samples. 2. By removing a subset of mutations. If you want to do both, then you need to do it sequentially: e.g., first remove mutations and then samples (or vise versa). The core library for filtering is `pygrgl.save_subset `__, which is very flexible and useful for calling from Python code. However, here we are going to demonstrate the filtering available in the `grapp command line `__, which implements a bunch of common filtering scenarios for you (and calls ``save_subset()`` under the hood). Filter by sample ---------------- First, we’ll filter the GRG down to just the samples that we used in our plot above. We’ll use ``grapp filter --hap-samples `` to do this. In our case, we know that we are actually filtering by individuals; even though the population information is stored at the haplotype level, the way we generated the GRG means that we will only have pairs of haplotypes that belond to individuals. In other words: the haplotype-to-diploid mapping is *maintained* by this filter. .. code:: ipython3 # Save our samples to a list all_samples = [] for pop_label, sample_list in K_pops: all_samples.extend(sample_list) with open("haplotypes.txt", "w") as fout: fout.write("\n".join(map(str, all_samples))) print(f"Wrote {len(all_samples)} samples to haplotypes.txt") .. parsed-literal:: Wrote 5100 samples to haplotypes.txt .. code:: bash %%bash # Filter the GRG. This should be pretty fast (seconds). grapp filter --hap-samples haplotypes.txt kgp.chr22.grg kgp.chr22.hap_filtered.grg grg process stats kgp.chr22.hap_filtered.grg .. parsed-literal:: Keeping 5100 haplotypes .. parsed-literal:: Construction took 119 ms .. parsed-literal:: === GRG Statistics === Nodes: 2219044 Edges: 18011864 Samples: 5100 Mutations: 1066557 Ploidy: 2 Phased: true Populations: 26 Range of mutations: 10519265 - 50808015 Specified range: 0 - 0 ====================== Notice that ``Ploidy: 2`` above was retained: this is because we filtered the haplotypes in a way that kept their pairing between diploid and haplotype. If we filter a GRG by haplotype and do not maintain this (e.g., if we keep the odd-numbered haplotypes only) then the GRG will become ``Ploidy: 1``. Another way to do the above filtering is to just specify what populations we want! We can use ``grapp filter --populations`` for that. For both of these commands, you can either specify a filename or a comma-separated list of items to keep. .. code:: bash %%bash # Filter the GRG. This should be pretty fast (seconds). grapp filter --populations "CHS,PUR,CLM,IBS,PEL,PJL,KHV,ACB,GWD,ESN,BEB,STU,ITU,CEU,YRI,CHB,JPT,TSI,GIH" kgp.chr22.grg kgp.chr22.pop_filtered.grg grg process stats kgp.chr22.pop_filtered.grg .. parsed-literal:: Keeping 5100 haplotypes .. parsed-literal:: Construction took 116 ms .. parsed-literal:: === GRG Statistics === Nodes: 2219044 Edges: 18011864 Samples: 5100 Mutations: 1066557 Ploidy: 2 Phased: true Populations: 26 Range of mutations: 10519265 - 50808015 Specified range: 0 - 0 ====================== Filter by Mutation ------------------ There are a bunch of options for filtering by Mutations. :: mutation filters: -r RANGE, --range RANGE Keep only the variants within the given range, in base pairs. Example: "lower-upper", where both are integers and lower is inclusive, upper is exclusive. -c MIN_AC, --min-ac MIN_AC Minimum allele count to keep. All Mutations with count below this value will be dropped -C MAX_AC, --max-ac MAX_AC Maximum allele count to keep. All Mutations with count above this value will be dropped -q MIN_AF, --min-af MIN_AF Minimum allele frequency to keep. All Mutations with frequency below this value will be dropped -Q MAX_AF, --max-af MAX_AF Maximum allele frequency to keep. All Mutations with frequency above this value will be dropped These filters can be combined together in a single command. Here we’ll illustrate keeping only mutations with the range ``20000000-25000000`` that have an allele count of at least 20. .. code:: bash %%bash # Filter the GRG. This should be pretty fast (seconds). grapp filter -r "20000000-25000000" -c 20 kgp.chr22.pop_filtered.grg kgp.chr22.mut_filtered.grg grg process stats kgp.chr22.mut_filtered.grg .. parsed-literal:: Kept 47945 variants. Dropped 1018612 variants .. parsed-literal:: Construction took 7 ms .. parsed-literal:: === GRG Statistics === Nodes: 274295 Edges: 2476134 Samples: 5100 Mutations: 47945 Ploidy: 2 Phased: true Populations: 26 Range of mutations: 20000241 - 24999768 Specified range: 20000000 - 25000000 ====================== .. code:: ipython3 # Lets load the graph and check the minimum allele counts! from grapp.util.simple import allele_counts smaller_grg = pygrgl.load_immutable_grg("kgp.chr22.mut_filtered.grg") mut_counts = allele_counts(smaller_grg) print(f"Minimum allele count: {numpy.min(mut_counts)}") .. parsed-literal:: Minimum allele count: 20 Related Topics -------------- - More specific applications for GRG to real data are `GWAS `__, `PCA `__. - Documentation links: - `pygrgl `__: The full Python API for GRGL, the low-level operations you can perform on a GRG. - `grapp.util `__: The Python APIs for utility functions, like counting alleles, filtering, etc.