GRG Concepts#
GRG#
A GRG is a graph datastructure and a file format that captures the relationship between
a set of Mutations (variants) and samples. One way to think of this is as encoding the haploid
genotype matrix, which is a matrix where there is a row per haplotype sample and a column per
mutation/variant. Each value is either 0 (the sample does not have the variant allele) or 1
(the sample does have the variant allele). There can be multiple variants at a given genetic
position (site), and if a sample has 0 for all of them then that sample must have the reference
allele at that site. When this matrix is represented densely, it takes O(NM) memory, which
is huge for large genetic datasets. There are sparse matrix methods, which instead of representing
all N rows and M columns, represents pairs of indices i, j where the presence of
such a pair indicates a 1 and the absence of the pair i, j indicates that the i’th
sample has a 0 for the j’th variant.
A GRG represents this same matrix, but even more compactly than the sparse matrix representation.
It does this by capturing shared hierarchical subsets of samples that have the same mutations (variants).
A GRG is a multi-tree (the paths above any node form a tree, and the paths below any node form a tree).
There are two types of special nodes: mutation nodes, and sample nodes. A mutation node has one or more
mutations attached to it. A sample node represents a haplotype sample. A sample contains a mutation (has a 1
for that mutation in the genotype matrix) if and only if there is a path from the sample to a node
containing the mutation.
A GRG can be created from phased data (IGD, VCF, or BGEN) or by converting an Ancestral Recombination Graph (in tskit format) to a GRG.
Unphased data is also supported, but the resulting graph is much less optimal than for phased data.
The “down edges” in a GRG go from mutations to samples, and are what is stored on disk. When requested, the “up edges” are created during GRG load from disk, and allow bottom-up traversal of the graph in arbitrary ways.
Once you have a GRG, computations are done by performing a traversal of the graph in either the upwards or downwards directions.
Samples#
Samples can have arbitrary ploidy in GRG. Given a ploidy of P, and N individuals, there will
be P * N haplotypes stored in the GRG.
Sample Node#
A node which has an ID between 0...((N*P) - 1), and is a leaf in the GRG downward direction. The
haploid nodes for an individual are consecutive, so for example the first individual will have P
nodes with IDs 0, 1, ..., P-1.
Individuals#
Individuals can have identifiers associated with them, which helps keep track of properties on those
individuals when the dataset is tranformed or filtered. Individuals are not explicitly represented
in a GRG, but given a sample node with ID i you can find the associated individual “ID” by
dividing by the ploidy (i / P).
Populations#
A GRG can have K populations associated with it. Each sample node then have a population ID
between 0...(K-1) which can be used to lookup the population description.
Mutations#
Mutations are just a triple of (position, ref, alt) where position is a base-pair position (integer),
ref is an arbitrary string for the reference allele, and alt is an arbitrary string for the alternate
allele. Mutations have IDs in the range 0...(M-1). You can have multiple mutation IDs for “the same”
mutation (i.e., the (position, ref, alt) is the same).
Mutation Node#
Any node in the graph can have a Mutation “attached” to it, at which point we call it a mutation node. There can be more than one mutation attached to any given node. The mapping between mutation ID and node is 1-to-1, so if you need to attach a mutation to multiple nodes you need to make a “copy” of the mutation (so it gets another ID) and then associate the new ID with the additional node.
In practice, grg construct associates mutations with a single node. grg convert, when converting an Ancestral
Recombination Graph (ARG) to a GRG may associate the same mutation with multiple nodes, depending on how the ARG
was created.
Mutation Time#
Mutations can optionally have a time associated with them, indicating how old the mutation is. Units are unspecified, and depend on the use-case.
Missing Data#
Missing data is represented as a node-to-samples relationship, just like Mutations are. However, there is no Mutation object directly associated with the missing data. Instead, each Mutation is mapped to two (optional) ``NodeID``s: the mutation’s node, and the missingness node associated with the site (genetic position) of the Mutation.
Obtaining the Missingness Node#
You can obtain the missingness node for a Mutation the same way you obtain it’s “regular” node: by iterating over all
Mutations. In C++, this is via GRG::getNodesAndMutations() (ordered by Mutation node) or
GRG::getMutationsToNodeOrdered() (ordered by Mutation position/allele). In Python, this is via
get_node_mutation_miss() and get_mutation_node_miss().
Given a node, you can also lookup the Mutations and their missingness nodes via GRG::getMutationsForNode() (C++)
or get_muts_and_miss_for_node().
Missing Data in Matrix Multiplication#
GRG matrix multiplication handles missing data by an input/output vector (of length num_mutations) for the value of
each missingness node associated with each Mutation. For multi-allelic sites, multiple Mutations will share a missingness
node, in which case the values provided for each such Mutation will be added together when applied to the missingness node.
Discussion#
Initially, missingness in GRG was implemented as another allele . in a Mutation object. The way it is implemented
now is to make it easier to associate mutations with their missingness. Consider this scenario:
POS REF ALT
1000 A G
1000 A T
1000 A .
The missingness at this site affects the calculations for both of the variants A>G and A>T. If we had an explicit
Mutation object for ., then (a) we would have to preprocess all the mutations to create the map between them and
their missingness mutation and (b) when you performed pygrgl.matmul() you would find all the missing alleles and populate
their values differently (in the input vector) from the “regular” mutations.
Also, pygrgl.GRG.num_mutations wouldn’t really reflect the number of variants, but variants and missingness.
There are two ways that methods in grapp currently adjust calculations for missingness:
Let \(2N`\) be the number of haplotypes. Then allele frequency is \(count(allele_i) / 2N\). However, when missingness is present, let \(C_i = 2N - count(missing_i)\) and allele frequency is \(f_i = count(allele_i) / C_i\) (since \(C_i\) varies by site).
For non-standardized matrix calculations, we also mean-impute the value for missing alleles. Given a column of the genotype matrix \(X_i\), then all the values are either
0,1,2, or.. The mean is \(2 \times f_i\), so we need to “replace”.with \(2 \times f_i\), but we have to do this implicitly. If you are performing apygrgl.matmul()that is \(A \times X^T\) then we can populate the missingness inputs with \(f_i \times A\) to get the results we need.
Unphased Data#
Unphased data is supported by GRG, though GRG construction is not yet optimized for unphased data. An unphased GRG is
identical to a phased GRG: all data is stored at the haploid level, and all samples represent haplotype samples. The
only difference is that GRG::isPhased() (GRG.is_phased) returns false. It is up to the client code
to ensure that all calculations are performed at the individual level (e.g., with the by_individual flag to
matmul()).