GRG Concepts ============ GRG --- A GRG is a graph datastructure and a file format that captures the relationship between a set of Mutations (variants) and samples. One way to think of this is as encoding the haploid genotype matrix, which is a matrix where there is a row per haplotype sample and a column per mutation/variant. Each value is either 0 (the sample does not have the variant allele) or 1 (the sample does have the variant allele). There can be multiple variants at a given genetic position (site), and if a sample has 0 for all of them then that sample must have the reference allele at that site. When this matrix is represented densely, it takes ``O(NM)`` memory, which is huge for large genetic datasets. There are sparse matrix methods, which instead of representing all ``N`` rows and ``M`` columns, represents pairs of indices ``i, j`` where the presence of such a pair indicates a ``1`` and the absence of the pair ``i, j`` indicates that the i'th sample has a ``0`` for the j'th variant. A GRG represents this same matrix, but even more compactly than the sparse matrix representation. It does this by capturing shared hierarchical subsets of samples that have the same mutations (variants). A GRG is a multi-tree (the paths above any node form a tree, and the paths below any node form a tree). There are two types of special nodes: mutation nodes, and sample nodes. A mutation node has one or more mutations attached to it. A sample node represents a haplotype sample. A sample contains a mutation (has a ``1`` for that mutation in the genotype matrix) if and only if there is a path from the sample to a node containing the mutation. A GRG can be created from phased data (`IGD `_, `VCF `_, or `BGEN `_) or by converting an Ancestral Recombination Graph (in `tskit format `_) to a GRG. Unphased data is also supported, but the resulting graph is much less optimal than for phased data. The "down edges" in a GRG go from mutations to samples, and are what is stored on disk. When requested, the "up edges" are created during GRG load from disk, and allow bottom-up traversal of the graph in arbitrary ways. Once you have a GRG, computations are done by performing a `traversal `_ of the graph in either the upwards or downwards directions. Samples ------- Samples can have arbitrary ploidy in GRG. Given a ploidy of ``P``, and ``N`` individuals, there will be ``P * N`` haplotypes stored in the GRG. Sample Node ~~~~~~~~~~~ A node which has an ID between ``0...((N*P) - 1)``, and is a leaf in the GRG downward direction. The haploid nodes for an individual are consecutive, so for example the first individual will have ``P`` nodes with IDs ``0, 1, ..., P-1``. Individuals ~~~~~~~~~~~ Individuals can have identifiers associated with them, which helps keep track of properties on those individuals when the dataset is tranformed or filtered. Individuals are not explicitly represented in a GRG, but given a sample node with ID ``i`` you can find the associated individual "ID" by dividing by the ploidy (``i / P``). Populations ~~~~~~~~~~~ A GRG can have ``K`` populations associated with it. Each sample node then have a population ID between ``0...(K-1)`` which can be used to lookup the population description. Mutations --------- Mutations are just a triple of ``(position, ref, alt)`` where ``position`` is a base-pair position (integer), ``ref`` is an arbitrary string for the reference allele, and ``alt`` is an arbitrary string for the alternate allele. Mutations have IDs in the range ``0...(M-1)``. You can have multiple mutation IDs for "the same" mutation (i.e., the ``(position, ref, alt)`` is the same). Mutation Node ~~~~~~~~~~~~~ Any node in the graph can have a Mutation "attached" to it, at which point we call it a mutation node. There can be more than one mutation attached to any given node. The mapping between mutation ID and node is 1-to-1, so if you need to attach a mutation to multiple nodes you need to make a "copy" of the mutation (so it gets another ID) and then associate the new ID with the additional node. In practice, ``grg construct`` associates mutations with a single node. ``grg convert``, when converting an Ancestral Recombination Graph (ARG) to a GRG *may* associate the same mutation with multiple nodes, depending on how the ARG was created. Mutation Time ~~~~~~~~~~~~~ Mutations can optionally have a time associated with them, indicating how old the mutation is. Units are unspecified, and depend on the use-case. Missing Data ------------ Missing data is represented as a node-to-samples relationship, just like Mutations are. However, there is no Mutation object directly associated with the missing data. Instead, each Mutation is mapped to two (optional) ``NodeID``: the mutation's node, and the missingness node associated with the site (genetic position) of the Mutation. Obtaining the Missingness Node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can obtain the missingness node for a Mutation the same way you obtain it's "regular" node: by iterating over all Mutations. In C++, this is via :cpp:func:`GRG::getNodesAndMutations` (ordered by Mutation node) or :cpp:func:`GRG::getMutationsToNodeOrdered` (ordered by Mutation position/allele). In Python, this is via :py:meth:`get_node_mutation_miss` and :py:meth:`get_mutation_node_miss`. Given a node, you can also lookup the Mutations and their missingness nodes via :cpp:func:`GRG::getMutationsForNode` (C++) or :py:meth:`get_muts_and_miss_for_node`. Missing Data in Matrix Multiplication ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GRG matrix multiplication handles missing data by an input/output vector (of length ``num_mutations``) for the value of each missingness node associated with each Mutation. For multi-allelic sites, multiple Mutations will share a missingness node, in which case the values provided for each such Mutation will be added together when applied to the missingness node. Discussion ~~~~~~~~~~ Initially, missingness in GRG was implemented as another allele ``.`` in a ``Mutation`` object. The way it is implemented now is to make it easier to associate mutations with their missingness. Consider this scenario: :: POS REF ALT 1000 A G 1000 A T 1000 A . The missingness at this site affects the calculations for both of the variants ``A>G`` and ``A>T``. If we had an explicit ``Mutation`` object for ``.``, then (a) we would have to preprocess all the mutations to create the map between them and their missingness mutation and (b) when you performed :py:meth:`pygrgl.matmul` you would find all the missing alleles and populate their values differently (in the input vector) from the "regular" mutations. Also, :py:attr:`pygrgl.GRG.num_mutations` wouldn't really reflect the number of variants, but variants and missingness. There are two ways that methods in `grapp `_ currently adjust calculations for missingness: * Let :math:`2N`` be the number of haplotypes. Then allele frequency is :math:`count(allele_i) / 2N`. However, when missingness is present, let :math:`C_i = 2N - count(missing_i)` and allele frequency is :math:`f_i = count(allele_i) / C_i` (since :math:`C_i` varies by site). * For non-standardized matrix calculations, we also mean-impute the value for missing alleles. Given a column of the genotype matrix :math:`X_i`, then all the values are either ``0``, ``1``, ``2``, or ``.``. The mean is :math:`2 \times f_i`, so we need to "replace" ``.`` with :math:`2 \times f_i`, but we have to do this implicitly. If you are performing a :py:meth:`pygrgl.matmul` that is :math:`A \times X^T` then we can populate the missingness inputs with :math:`f_i \times A` to get the results we need. Unphased Data ------------- Unphased data is supported by GRG, though GRG construction is not yet optimized for unphased data. An unphased GRG is identical to a phased GRG: all data is stored at the haploid level, and all samples represent haplotype samples. The only difference is that :cpp:func:`GRG::isPhased` (:py:attr:`GRG.is_phased`) returns false. It is up to the client code to ensure that all calculations are performed at the *individual* level (e.g., with the ``by_individual`` flag to :py:meth:`matmul`). Variations of GRGs ------------------ The typical GRG that most users/tools will interact with has the following properties: 1. It is immutable (:py:meth:`pygrgl.load_immutable_grg`), meaning the edges and nodes cannot be changed. 2. The nodes are numbered according to bottom-up topological order. That means a node with ID ``i`` is gauranteed to have all of it's children visited if you visit all nodes with IDs less than ``i``. This can be checked with :py:attr:`pygrgl.GRG.nodes_are_ordered`. 3. The :py:class:`pygrgl.Mutation`s are numbered in ascending order of ``(position, ref_allele, allele)``. This can be checked with :py:attr:`pygrgl.GRG.mutations_are_ordered`. 4. The Sample nodes are numbered ``0...(num_samples - 1)``, and are leaves of the graph. This can be checked with :py:attr:`pygrgl.GRG.samples_are_ordered`. The above are true for any immutable GRG that has been loaded. However, you can (optionally) break the above properties by performing modifications to the graph. For example: * An immutable GRG can still change it's Mutations (just not the nodes/edges of the graph). So if you call :py:meth:`pygrgl.GRG.add_mutation` or :py:meth:`pygrgl.GRG.remove_mutation` then :py:attr:`pygrgl.GRG.mutations_are_ordered` will become ``False``. * Changes to mutable GRGs can result in violating the above properties. * :py:meth:`pygrgl.MutableGRG.connect` can affect the node topological ordering (see :py:meth:`pygrgl.MutableGRG.connect` for details). If :py:attr:`pygrgl.GRG.nodes_are_ordered` is ``False``, then you need to use :py:attr:`pygrgl.GRG.get_topo_order` (with ``seed_list=None``) to do the traversal. * :py:meth:`pygrgl.MutableGRG.set_samples` will cause :py:attr:`pygrgl.GRG.samples_are_ordered` to become ``False``. You can no longer rely on the numbering of the samples, but this is not usually impactful, as :py:meth:`pygrgl.GRG.get_sample_nodes` and :py:meth:`pygrgl.GRG.is_sample` abstracts this away anyways. If you have modified a GRG, you can always get all of these properties back by writing it to disk and reloading it as an immutable GRG. Negative nodes ~~~~~~~~~~~~~~ "Negative" nodes are just regular nodes in the GRG that have been created with ``grg.make_node(negative=True)``. This tells the GRG that the node should be considered from the **bottom** of the graph for it's topological order. Regular nodes are always checked against the **top** of the graph. For example: * We create a regular node with ID ``A``. We attach ``A`` as a parent to an existing node ``B``. This maintains the topological order, because the NodeIDs increase in number, and that matched the topological order (``A`` is a parent of a node with a smaller ID) * We create a regular node with ID ``A``. We attach ``A`` as a **child** to an existing node ``B``. This breaks topological order! ``B`` is a smaller NodeID than ``A``, but is its parent. * We create a negative node with ID ``A``. We attach ``A`` as a child to an existing node ``B``. This maintains the topological order, as negative nodes are considered "less than" all regular nodes, for topological order. The actual NodeID returned by :py:meth:`pygrgl.GRG.make_node` is still a positive value. The only thing that makes a node negative is how the GRG tracks it internally. Also, when you connect a negative node to another node you must put a negative sign in front of it, like this: :: new_node = grg.make_node(negative=True) grg.connect(old_node, -new_node) # Passed the negation! This keeps the topological order ... grg.get_up_edges(new_node) # All other GRG APIs use the positive node ID Advantages of immutable GRGs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * An immutable GRG uses 2-3x less RAM than a mutable GRG. * An immutable GRG has a faster matrix multiplication than a mutable GRG. * :py:attr:`pygrgl.GRG.nodes_are_ordered` and :py:attr:`pygrgl.GRG.samples_are_ordered` will *always* be true in an immutable GRG.