Working with Real Data#

In this tutorial we’re going to download publicly available 1,000 Genomes Project data and examine it with GRG. The goal is just to illustrate the ease with which you can work with such datasets. We also illustrate how to include population information in GRG for a real dataset.

What is demonstrated:

Building a GRG from real data, including population information
Using GRG matrix multiplication to calculate population-specific allele frequencies
Using grapp to filter information out of a GRG

What you’ll need:

Python dependencies “grapp”, “seaborn”: pip install grapp seaborn
Command line tools “wget”, “tabix”: sudo apt install wget tabix (or your system’s equivalent)

First, download the data. This can take 10-15 minutes, depending on your internet connection (the file is roughly 425MB)

%%bash

REMOTE_FILE="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"

if [[ ! -e kgp.chr22.vcf.gz ]]; then
    wget ${REMOTE_FILE} --progress=dot:mega -O kgp.chr22.vcf.gz
    tabix kgp.chr22.vcf.gz
fi

echo "Size of input file:"
du -hs kgp.chr22.vcf.gz

Size of input file:
426M        kgp.chr22.vcf.gz

Build the GRG#

GRG can be built from either IGD or directly from .vcf.gz. IGD is generally smaller and faster than VCF, so if you want to manipulate this data repeatedly (outside of GRG), then convert to IGD first. Otherwise, you can just do a one-time conversion from .vcf.gz to GRG.

We’ll just convert from the VCF file. Before we do that, let’s prepare information that maps each sample in the dataset to a population label, based on metadata provided from the 1,000 Genomes Project.

For population information, GRG needs a tab-separated file with a header. We construct the GRG with the filename and the names of two columns: the column containing the sample identifier (individual ID), and the column containing the population label.

1,000 Genomes data has both “population” and “super-population” labels in their data browser. We’ll use the “population” labels since they are easier to download from the FTP site.

%%bash

if [[ ! -e 1000G_2504_high_coverage.sequence.index ]]; then
  # Two files: one for the original 2504 samples, and one for the additional 698 samples.
  wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index --progress=dot:mega
  wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_698_related_high_coverage.sequence.index --progress=dot:mega
fi

# Combine the two files into a single metadata file
cat 1000G_2504_high_coverage.sequence.index 1000G_698_related_high_coverage.sequence.index > kgp.combined_metadata.tsv

# Show the format
head -n 25 kgp.combined_metadata.tsv

##FileDate=20190503
##ENA_FILE_PATH=path to ENA file on ENA ftp site
##MD5=md5sum of file
##RUN_ID=SRA/ERA run accession
##STUDY_ID=SRA/ERA study accession
##STUDY_NAME=Name of study
##CENTER_NAME=Submission centre name
##SUBMISSION_ID=SRA/ERA submission accession
##SUBMISSION_DATE=Date sequence submitted, YYYY-MM-DD
##SAMPLE_ID=SRA/ERA sample accession
##SAMPLE_NAME=Sample name
##POPULATION=Sample population. Further information may be available with the data collection.
##EXPERIMENT_ID=Experiment accession
##INSTRUMENT_PLATFORM=Type of sequencing machine
##INSTRUMENT_MODEL=Model of sequencing machine
##LIBRARY_NAME=Library name
##RUN_NAME=Name of machine run
##INSERT_SIZE=Submitter specifed insert size/paired nominal length
##LIBRARY_LAYOUT=Library layout, this can be either PAIRED or SINGLE
##PAIRED_FASTQ=Name of mate pair file if exists (Runs with failed mates will have a library layout of PAIRED but no paired fastq file)
##READ_COUNT=Read count for the file
##BASE_COUNT=Basepair count for the file
##ANALYSIS_GROUP=Analysis group is used to identify groups, or sets, of data. Further information may be available with the data collection.
#ENA_FILE_PATH      MD5SUM  RUN_ID  STUDY_ID        STUDY_NAME      CENTER_NAME     SUBMISSION_ID   SUBMISSION_DATE SAMPLE_ID       SAMPLE_NAME     POPULATION      EXPERIMENT_ID   INSTRUMENT_PLATFORM     INSTRUMENT_MODEL        LIBRARY_NAME    RUN_NAME        INSERT_SIZE     LIBRARY_LAYOUT  PAIRED_FASTQ    READ_COUNT      BASE_COUNT      ANALYSIS_GROUP
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram       923ca8ff7d4cd65fddd28e855e5f173d        ERR3239480      ERP114329       30X whole genome sequencing coverage of the 2504 Phase 3 1000 Genome samples.   NYGC    ERA1783081      2019-03-22 00:00        SRS000632       NA12718 CEU     ERX3266855      ILLUMINA        Illumina NovaSeq 6000   NA12718 ena-RUN-NYGC-22-03-2019-12:30:23:667-205        450     PAIRED          753247448       112987117200    high_cov

The file is a bit messy for GRG to try to use as is (especially since we concatenated the two files without removing the headers). So below we use some Python code to make it cleaner.

# List of (sample, population) pairs that we will output for use with GRG
sample_pops = []
with open("kgp.combined_metadata.tsv") as f:
    for line in f:
        line = line.rstrip("\n").rstrip("\r")
        # Skip comments
        if line.startswith("#"):
            continue
        data = line.split("\t")
        sample_pops.append( (data[9].strip(), data[10].strip()) )

with open("kgp.popmap.tsv", "w") as fout:
    fout.write("SAMPLE\tPOPULATION\n")  # Header
    fout.write("\n".join(["\t".join(s) for s in sample_pops]))

Now we can use the --population-ids flag to grg construct to incorporate this information in the GRG. This is mostly a convenience, we could just create the GRG without doing any of this population mapping.

The argument "kgp.popmap.tsv:SAMPLE:POPULATION" can be interpreted as <filename>:<column with sample IDs>:<column with population labels>.

Building this GRG takes about 4 minutes (with 4 threads) on my laptop.

%%bash

if [[ ! -e kgp.chr22.grg ]]; then
    grg construct -j 4 --population-ids "kgp.popmap.tsv:SAMPLE:POPULATION" kgp.chr22.vcf.gz -o kgp.chr22.grg
fi

Examine GRG and allele frequencies#

Now we can load the GRG and examine the information in it. As an example, lets compute the allele frequencies for just a few of the populations that have a large number of samples (more than 200 haplotypes) in the dataset. We’ll compute these per-population and then display them.

For more complex computations on GRG, see LinearOperators, PCA, or GWAS.

import pygrgl

grg = pygrgl.load_immutable_grg("kgp.chr22.grg", load_up_edges=False)

Something really simple we can do is look at the individual identifiers for each individual in the GRG. Not all GRGs have individual IDs: only if the original data had the IDs. Here we just show the first and last individual.

print(f"First: {grg.get_individual_id(0)}")
print(f"Last: {grg.get_individual_id(grg.num_individuals - 1)}")

First: HG00096
Last: NA21144

We want to use GRG matrix multiplication later on to compute the per-population allele frequencies. So we will: * Find all \(K\) populations with at least 200 haplotypes * Create a \(K \times 2N\) matrix (where \(N\) is the number of diploid individuals) \(A\) that represents our labeled samples * If haplotype sample \(i\) has population label \(j\), then \(A_{j, i} = 1\) * Given our (haploid) genotype matrix \(X\) (as represented by the GRG), we can get the population specific allele counts via \(A \times X\)

from collections import defaultdict
import numpy

# Find populations with at least 100 haplotypes
labels = grg.get_populations()
pop_mapping = defaultdict(list)
for sample in range(grg.num_samples):
    pop_label = labels[grg.get_population_id(sample)]
    pop_mapping[pop_label].append(sample)

# Get labels and samples for populations exceeding our threshold
K_pops = list(filter(lambda item: len(item[1]) >= 200, pop_mapping.items()))

# Create the matrix
K = len(K_pops)
A = numpy.zeros( (K, grg.num_samples), dtype=numpy.int32 )

# Populate the matrix
for j, (pop_label, sample_list) in enumerate(K_pops):
    print(f"{pop_label} has {len(sample_list)} haplotypes")
    A[j, sample_list] = 1

CHS has 326 haplotypes
PUR has 278 haplotypes
CLM has 264 haplotypes
IBS has 314 haplotypes
PEL has 244 haplotypes
PJL has 292 haplotypes
KHV has 244 haplotypes
ACB has 232 haplotypes
GWD has 356 haplotypes
ESN has 298 haplotypes
BEB has 262 haplotypes
STU has 228 haplotypes
ITU has 214 haplotypes
CEU has 358 haplotypes
YRI has 356 haplotypes
CHB has 206 haplotypes
JPT has 208 haplotypes
TSI has 214 haplotypes
GIH has 206 haplotypes

Now we compute the allele counts. The frequency will just be dividing the counts by \(N_j\), the number of haploptypes in each population.

We’re going to use pygrgl.matmul to do the work for us. Notice: we could also do our own graph traversal to compute these values, but matmul() is much faster (it does the traversal with optimized C++ code).

N = [len(sample_list) for pop_label, sample_list in K_pops]

# The "direction" tells us whether we are summing values from the leaves-to-roots (UP) or roots-to-leaves (DOWN).
# These are equivalent to multiplication against X or X^T.
# Here we have an input that is in terms of _samples_, which are the leaves, so we need to start at the samples
# and sum the values until we reach the mutations (UP).
allele_counts = pygrgl.matmul(grg, A, pygrgl.TraversalDirection.UP)

frequencies = allele_counts / numpy.array([N]).T

We’ll pick the column from the data that has an above average sum and plot it.

import seaborn
import pandas

# Sum by column
total_snp_freqs = numpy.sum(frequencies[:, ], axis=0)

# Find the maximum column sum values
avg = numpy.mean(total_snp_freqs)

# Get the first index that has the max value
snp_index = numpy.where(total_snp_freqs >= avg)[0][0]
mutation = grg.get_mutation_by_id(snp_index)             # Get the Mutation object

# Now plot the frequencies at that index
df = pandas.DataFrame({
    "SNP": [snp_index] * K,
    "Frequency": frequencies[:, snp_index],
    "Pop Label": [label for label, smp in K_pops],
})
ax = seaborn.barplot(data=df, x="Pop Label", y="Frequency", hue="Pop Label", palette="Dark2")
ax.set_title(f"SN: {mutation.position}, {mutation.allele}")
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.set_xlabel(None)

Text(0.5, 0, '')

Filter the GRG#

Now we illustrate the types of filtering that are available. At the basic level GRG can be filtered in two ways: 1. By removing a subset of samples. 2. By removing a subset of mutations. If you want to do both, then you need to do it sequentially: e.g., first remove mutations and then samples (or vise versa).

The core library for filtering is pygrgl.save_subset, which is very flexible and useful for calling from Python code.

However, here we are going to demonstrate the filtering available in the grapp command line, which implements a bunch of common filtering scenarios for you (and calls save_subset() under the hood).

Filter by sample#

First, we’ll filter the GRG down to just the samples that we used in our plot above. We’ll use grapp filter --hap-samples <filename> to do this.

In our case, we know that we are actually filtering by individuals; even though the population information is stored at the haplotype level, the way we generated the GRG means that we will only have pairs of haplotypes that belond to individuals. In other words: the haplotype-to-diploid mapping is maintained by this filter.

# Save our samples to a list
all_samples = []
for pop_label, sample_list in K_pops:
    all_samples.extend(sample_list)

with open("haplotypes.txt", "w") as fout:
    fout.write("\n".join(map(str, all_samples)))

print(f"Wrote {len(all_samples)} samples to haplotypes.txt")

Wrote 5100 samples to haplotypes.txt

%%bash
# Filter the GRG. This should be pretty fast (seconds).
grapp filter --hap-samples haplotypes.txt kgp.chr22.grg kgp.chr22.hap_filtered.grg

grg process stats kgp.chr22.hap_filtered.grg

Keeping 5100 haplotypes

Construction took 119 ms

=== GRG Statistics ===
Nodes: 2219044
Edges: 18011864
Samples: 5100
Mutations: 1066557
Ploidy: 2
Phased: true
Populations: 26
Range of mutations: 10519265 - 50808015
Specified range: 0 - 0
======================

Notice that Ploidy: 2 above was retained: this is because we filtered the haplotypes in a way that kept their pairing between diploid and haplotype. If we filter a GRG by haplotype and do not maintain this (e.g., if we keep the odd-numbered haplotypes only) then the GRG will become Ploidy: 1.

Another way to do the above filtering is to just specify what populations we want! We can use grapp filter --populations for that. For both of these commands, you can either specify a filename or a comma-separated list of items to keep.

%%bash
# Filter the GRG. This should be pretty fast (seconds).
grapp filter --populations "CHS,PUR,CLM,IBS,PEL,PJL,KHV,ACB,GWD,ESN,BEB,STU,ITU,CEU,YRI,CHB,JPT,TSI,GIH" kgp.chr22.grg kgp.chr22.pop_filtered.grg

grg process stats kgp.chr22.pop_filtered.grg

Keeping 5100 haplotypes

Construction took 116 ms

=== GRG Statistics ===
Nodes: 2219044
Edges: 18011864
Samples: 5100
Mutations: 1066557
Ploidy: 2
Phased: true
Populations: 26
Range of mutations: 10519265 - 50808015
Specified range: 0 - 0
======================

Filter by Mutation#

There are a bunch of options for filtering by Mutations.

mutation filters:
  -r RANGE, --range RANGE
                        Keep only the variants within the given range, in base pairs. Example: "lower-upper", where both are integers and lower is inclusive, upper is exclusive.
  -c MIN_AC, --min-ac MIN_AC
                        Minimum allele count to keep. All Mutations with count below this value will be dropped
  -C MAX_AC, --max-ac MAX_AC
                        Maximum allele count to keep. All Mutations with count above this value will be dropped
  -q MIN_AF, --min-af MIN_AF
                        Minimum allele frequency to keep. All Mutations with frequency below this value will be dropped
  -Q MAX_AF, --max-af MAX_AF
                        Maximum allele frequency to keep. All Mutations with frequency above this value will be dropped

These filters can be combined together in a single command. Here we’ll illustrate keeping only mutations with the range 20000000-25000000 that have an allele count of at least 20.

%%bash
# Filter the GRG. This should be pretty fast (seconds).
grapp filter -r "20000000-25000000" -c 20 kgp.chr22.pop_filtered.grg kgp.chr22.mut_filtered.grg

grg process stats kgp.chr22.mut_filtered.grg

Kept 47945 variants. Dropped 1018612 variants

Construction took 7 ms

=== GRG Statistics ===
Nodes: 274295
Edges: 2476134
Samples: 5100
Mutations: 47945
Ploidy: 2
Phased: true
Populations: 26
Range of mutations: 20000241 - 24999768
Specified range: 20000000 - 25000000
======================

# Lets load the graph and check the minimum allele counts!
from grapp.util.simple import allele_counts

smaller_grg = pygrgl.load_immutable_grg("kgp.chr22.mut_filtered.grg")
mut_counts = allele_counts(smaller_grg)

print(f"Minimum allele count: {numpy.min(mut_counts)}")

Minimum allele count: 20