GRG Concepts ============ GRG --- A GRG is a graph datastructure and a file format that captures the relationship between a set of Mutations (variants) and samples. One way to think of this is as encoding the haploid genotype matrix, which is a matrix where there is a row per haplotype sample and a column per mutation/variant. Each value is either 0 (the sample does not have the variant allele) or 1 (the sample does have the variant allele). There can be multiple variants at a given genetic position (site), and if a sample has 0 for all of them then that sample must have the reference allele at that site. When this matrix is represented densely, it takes ``O(NM)`` memory, which is huge for large genetic datasets. There are sparse matrix methods, which instead of representing all ``N`` rows and ``M`` columns, represents pairs of indices ``i, j`` where the presence of such a pair indicates a ``1`` and the absence of the pair ``i, j`` indicates that the i'th sample has a ``0`` for the j'th variant. A GRG represents this same matrix, but even more compactly than the sparse matrix representation. It does this by capturing shared hierarchical subsets of samples that have the same mutations (variants). A GRG is a multi-tree (the paths above any node form a tree, and the paths below any node form a tree). There are two types of special nodes: mutation nodes, and sample nodes. A mutation node has one or more mutations attached to it. A sample node represents a haplotype sample. A sample contains a mutation (has a ``1`` for that mutation in the genotype matrix) if and only if there is a path from the sample to a node containing the mutation. A GRG can be created from phased data (`IGD `_, `VCF `_, or `BGEN `_) or by converting an Ancestral Recombination Graph (in `tskit format `_) to a GRG. Unphased data is also supported, but the resulting graph is much less optimal than for phased data. The "down edges" in a GRG go from mutations to samples, and are what is stored on disk. When requested, the "up edges" are created during GRG load from disk, and allow bottom-up traversal of the graph in arbitrary ways. Once you have a GRG, computations are done by performing a `traversal `_ of the graph in either the upwards or downwards directions. Samples ------- Samples can have arbitrary ploidy in GRG. Given a ploidy of ``P``, and ``N`` individuals, there will be ``P * N`` haplotypes stored in the GRG. Sample Node ~~~~~~~~~~~ A node which has an ID between ``0...((N*P) - 1)``, and is a leaf in the GRG downward direction. The haploid nodes for an individual are consecutive, so for example the first individual will have ``P`` nodes with IDs ``0, 1, ..., P-1``. Individuals ~~~~~~~~~~~ Individuals can have identifiers associated with them, which helps keep track of properties on those individuals when the dataset is tranformed or filtered. Individuals are not explicitly represented in a GRG, but given a sample node with ID ``i`` you can find the associated individual "ID" by dividing by the ploidy (``i / P``). Populations ~~~~~~~~~~~ A GRG can have ``K`` populations associated with it. Each sample node then have a population ID between ``0...(K-1)`` which can be used to lookup the population description. Mutations --------- Mutations are just a triple of ``(position, ref, alt)`` where ``position`` is a base-pair position (integer), ``ref`` is an arbitrary string for the reference allele, and ``alt`` is an arbitrary string for the alternate allele. Mutations have IDs in the range ``0...(M-1)``. You can have multiple mutation IDs for "the same" mutation (i.e., the ``(position, ref, alt)`` is the same). Mutation Node ~~~~~~~~~~~~~ Any node in the graph can have a Mutation "attached" to it, at which point we call it a mutation node. There can be more than one mutation attached to any given node. The mapping between mutation ID and node is 1-to-1, so if you need to attach a mutation to multiple nodes you need to make a "copy" of the mutation (so it gets another ID) and then associate the new ID with the additional node. In practice, ``grg construct`` associates mutations with a single node. ``grg convert``, when converting an Ancestral Recombination Graph (ARG) to a GRG *may* associate the same mutation with multiple nodes, depending on how the ARG was created. Mutation Time ~~~~~~~~~~~~~ Mutations can optionally have a time associated with them, indicating how old the mutation is. Units are unspecified, and depend on the use-case. Missing Data ------------ Missing data is represented as a node-to-samples relationship, just like Mutations are. However, there is no Mutation object directly associated with the missing data. Instead, each Mutation is mapped to two (optional) ``NodeID``s: the mutation's node, and the missingness node associated with the site (genetic position) of the Mutation. Obtaining the Missingness Node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can obtain the missingness node for a Mutation the same way you obtain it's "regular" node: by iterating over all Mutations. In C++, this is via :cpp:func:`GRG::getNodesAndMutations` (ordered by Mutation node) or :cpp:func:`GRG::getMutationsToNodeOrdered` (ordered by Mutation position/allele). In Python, this is via :py:meth:`get_node_mutation_miss` and :py:meth:`get_mutation_node_miss`. Given a node, you can also lookup the Mutations and their missingness nodes via :cpp:func:`GRG::getMutationsForNode` (C++) or :py:meth:`get_muts_and_miss_for_node`. Missing Data in Matrix Multiplication ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GRG matrix multiplication handles missing data by an input/output vector (of length ``num_mutations``) for the value of each missingness node associated with each Mutation. For multi-allelic sites, multiple Mutations will share a missingness node, in which case the values provided for each such Mutation will be added together when applied to the missingness node. Discussion ~~~~~~~~~~ Initially, missingness in GRG was implemented as another allele ``.`` in a ``Mutation`` object. The way it is implemented now is to make it easier to associate mutations with their missingness. Consider this scenario: :: POS REF ALT 1000 A G 1000 A T 1000 A . The missingness at this site affects the calculations for both of the variants ``A>G`` and ``A>T``. If we had an explicit ``Mutation`` object for ``.``, then (a) we would have to preprocess all the mutations to create the map between them and their missingness mutation and (b) when you performed :py:meth:`pygrgl.matmul` you would find all the missing alleles and populate their values differently (in the input vector) from the "regular" mutations. Also, :py:attr:`pygrgl.GRG.num_mutations` wouldn't really reflect the number of variants, but variants and missingness. There are two ways that methods in `grapp `_ currently adjust calculations for missingness: * Let :math:`2N`` be the number of haplotypes. Then allele frequency is :math:`count(allele_i) / 2N`. However, when missingness is present, let :math:`C_i = 2N - count(missing_i)` and allele frequency is :math:`f_i = count(allele_i) / C_i` (since :math:`C_i` varies by site). * For non-standardized matrix calculations, we also mean-impute the value for missing alleles. Given a column of the genotype matrix :math:`X_i`, then all the values are either ``0``, ``1``, ``2``, or ``.``. The mean is :math:`2 \times f_i`, so we need to "replace" ``.`` with :math:`2 \times f_i`, but we have to do this implicitly. If you are performing a :py:meth:`pygrgl.matmul` that is :math:`A \times X^T` then we can populate the missingness inputs with :math:`f_i \times A` to get the results we need. Unphased Data ------------- Unphased data is supported by GRG, though GRG construction is not yet optimized for unphased data. An unphased GRG is identical to a phased GRG: all data is stored at the haploid level, and all samples represent haplotype samples. The only difference is that :cpp:func:`GRG::isPhased` (:py:attr:`GRG.is_phased`) returns false. It is up to the client code to ensure that all calculations are performed at the *individual* level (e.g., with the ``by_individual`` flag to :py:meth:`matmul`).