Working with Simulated Data#

Simulated data from msprime or SLiM can be saved in the tskit TreeSequence format, which can be directly converted to GRG. It is worth noting that GRG’s converted from tskit.TreeSequence do not have the same performance characteristics as GRG’s converted from the corresponding tabular (e.g., .vcf.gz or .igd) data (a consideration when doing benchmarking).

We’ll use the msprime engine from stdpopsim for this tutorial.

What you’ll need:

Python dependencies “stdpopsim”, “msprime”, “grapp”: pip install stdpopsim msprime grapp

Simulate Data#

First lets simulate a small example dataset.

import stdpopsim

# We will simulate 200 diploid individuals in total, equally spread across populations.
individuals = 200
i_per_pop = individuals // 2

# Setup the options for stdpopsim
species = stdpopsim.get_species("HomSap")
samples = {"AFR": i_per_pop, "EUR": i_per_pop}
contig = species.get_contig("chr22", genetic_map="HapMapII_GRCh38")
engine = stdpopsim.get_engine("msprime")
model = species.get_demographic_model("OutOfAfrica_2T12")

# Simulate the data, which returns an ARG "ts" and then we also save that ARG to disk.
ts = engine.simulate(model, contig, samples)
ts_filename = "stdpop.ooa2.chr22.trees"
ts.dump(ts_filename)

/home/ddehaas/Py3Env/lib/python3.10/site-packages/stdpopsim/engines.py:111: UserWarning: The demographic model has mutation rate 2.36e-08, but this simulation used the contig's mutation rate 1.29e-08. Diversity levels may be different than expected for this species. For details see documentation at https://popsim-consortium.github.io/stdpopsim-docs/stable/tutorial.html
  warnings.warn(

# Count the recurrent and back mutations in the TreeSequence
recurrent = 0
back = 0
already_seen = set()
for s in ts.sites():
    for m in s.mutations:
        if m.parent != -1:
            back += 1
        key = (s.position, m.derived_state)
        if key in already_seen:
            recurrent += 1
        already_seen.add(key)

print("## TreeSequence Info ##")
print(f"Mutations: {ts.num_mutations}")
print(f"Haplotypes: {ts.num_samples}")
print(f"Nodes: {ts.num_nodes}")
print(f"Edges: {ts.num_edges}")
print(f"Recurrent Mutations: {recurrent}")
print(f"Back Mutations: {back}")

## TreeSequence Info ##
Mutations: 175356
Haplotypes: 400
Nodes: 151705
Edges: 1021127
Recurrent Mutations: 128
Back Mutations: 51

Convert to GRG#

Converting directly to GRG can be much faster than exporting the tskit.TreeSequence to VCF and then building a GRG. We illustrate both the command-line and API methods for this conversion.

%%bash
# You can also use pygrgl.grg_from_trees(), but `grg convert` produces a simplified graph.
grg convert stdpop.ooa2.chr22.trees stdpop.ooa2.chr22.trees.grg

Warning: no individual (diploid) coalescence information will be calculated unless you specify --ts-coals. This is fine, but downstream applications will need to use approximations for Mutation variance.

import pygrgl

# Load the GRG we converted above
grg = pygrgl.load_immutable_grg(ts_filename + ".grg")

# Convert the TS to GRG using the API - just for comparison
api_grg = pygrgl.grg_from_trees(ts_filename)

print("## GRG Info ##")
print(f"Mutations: {grg.num_mutations}")
print(f"Haplotypes: {grg.num_samples}")
print(f"Nodes: {grg.num_nodes}")
print(f"Edges: {grg.num_edges}")

## GRG Info ##
Mutations: 175208
Haplotypes: 400
Nodes: 197416
Edges: 971022

As you can see, the GRG is not just an exact copy of the TreeSequence: * There are fewer Mutations: this is due to recurrent and back mutations that we counted above. GRG tries to keep only a single node per unique variant (position and alleles): both grg construct and grg convert (API: pygrgl.grg_from_trees) create a single node (and Mutation object) for each unique variant. Some back mutations are no effect and are dropped by GRG: e.g., when two mutations for the same site occur on the same ARG branch (i.e. they share a node). * For this ARG, the TreeSequence has fewer nodes. As the datasets get large, this becomes increasingly unlikely. A GRG is a multitree that is created by duplicating nodes in the coalescent trees - i.e., a node in tskit.TreeSequence can have different sets of samples beneath it in different coalescent trees, but in GRG each node has a single sample set. However, GRG also removes nodes from the ARG when that node does not provide compression or information, hence in large datasets there are typically fewer nodes.

Now if we look at the GRG that was converted via the API method, we see there are substantially more nodes and edges. This is because GRG simplifies the graph during serialization to the disk. So when you convert via API you are getting the unsimplified graph - it has not yet been written to disk.

print("## GRG Info ##")
print(f"Mutations: {api_grg.num_mutations}")
print(f"Haplotypes: {api_grg.num_samples}")
print(f"Nodes: {api_grg.num_nodes}")
print(f"Edges: {api_grg.num_edges}")

## GRG Info ##
Mutations: 175208
Haplotypes: 400
Nodes: 730624
Edges: 1460403

Converting to IGD#

Now lets say we want a non-graph representation of the data. We could convert the ARG to VCF with the tskit vcf command, but (a) VCF is large and slow to work with and (b) converting to VCF can also be really slow! Instead, now that we have a GRG we can convert it to IGD and then we can use pyigd to examine the data.

from grapp.util.igd import export_igd

# Convert to IGD using 4 threads
export_igd(grg, ts_filename + ".igd", jobs=4, verbose=True)

Using temporary directory /tmp/tmpsbd8vq95.
Splitting GRG into 8 parts..
Converting GRG parts to IGD files...
Merging 8 parts into single IGD stdpop.ooa2.chr22.trees.igd...

import pyigd

with open(ts_filename + ".igd", "rb") as f:
    igd = pyigd.IGDReader(f)

    print("## IGD Info ##")
    print(f"Variants: {igd.num_variants}")
    print(f"Haplotypes: {igd.num_samples}")

## IGD Info ##
Variants: 175208
Haplotypes: 400

Working with Simulated Data#

Simulate Data#

Convert to GRG#

Converting to IGD#

Related Topics#