C++ API#

Exception Types#

class ApiMisuseFailure : public std::runtime_error#

Exception thrown when the GRG API is used incorrectly (bad parameters, incorrect preconditions, etc.).

class BadInputFileFailure : public std::runtime_error#

Exception thrown when the input file does not meet conditions needed for GRG.

class SerializationFailure : public std::runtime_error#

Exception thrown when a GRG fails to serialize/deserialize.

Main GRG Classes#

class GRG : public std::enable_shared_from_this<GRG>#

Abstract GRG base class.

Subclassed by grgl::CSRGRG, grgl::MutableGRG

Public Types

using NodeAndMut = std::pair<NodeID, MutationId>#

A node and a mutation attached to that node.

using NodeMutMiss = std::tuple<NodeID, MutationId, NodeID>#

A node and a mutation attached to that node, and the missingness node associated with the mutation.

using MutAndNode = std::pair<MutationId, NodeID>#

A mutation and node, where the mutation comes first.

using MutNodeMiss = std::tuple<MutationId, NodeID, NodeID>#

A mutation, node, and missingness node, where the mutation comes first.

Public Functions

inline bool isSample(const NodeID nodeId) const#

Is the given nodeID associated with a sample node?

The first numSamples node IDs are reserved for samples, so you can also just use nodeId < grg.numSamples().

Parameters:

nodeId[in] The node ID to check.

inline uint16_t getPloidy() const#

How many haploid samples are there per individual?

Returns:

The ploidy, usually 1 or 2. Individual coalescence support only works when ploidy==2.

inline bool isPhased() const#

Is the dataset phased?

If not, GRG still represents the data as a mapping between mutations and haploid samples, but you cannot depend on any haplotype-based analyses representing the true haplotypes. Individual-based analyses are still completely accurate.

virtual bool nodesAreOrdered() const = 0#

Returns true if nodes are ordered in bottom-up topological order.

virtual size_t numNodes() const = 0#

Number of nodes (including sample nodes) in the graph.

virtual size_t numEdges() const = 0#

Number of edges (total) in the graph.

virtual size_t numDownEdges(NodeID nodeId) = 0#

Number of down edges for a particular node in the graph.

virtual size_t numUpEdges(NodeID nodeId) = 0#

Number of up edges for a particular node in the graph.

virtual NodeIDList getDownEdges(NodeID nodeId) = 0#

Get list of down edges for a particular node in the graph.

virtual NodeIDList getUpEdges(NodeID nodeId) = 0#

Get list of up edges for a particular node in the graph.

virtual bool hasUpEdges() const = 0#

True if this graph stores up edges, false if only down edges are stored.

virtual bool edgesAreOrdered() const = 0#

True if the edges in this graph are always in ascending NodeID order.

inline bool mutationsAreOrdered() const#

True iff the mutations are ordered according to position and the alternate allele (in that order).

inline std::pair<BpPosition, BpPosition> getBPRange()#

Get the base-pair range this GRG covers as a pair {first, last+1} where first is the base-pair position of the first mutation, and last is similarly defined. In general we treat all ranges as left-inclusive and right-exlusive, hence the +1 on the last.

inline std::pair<BpPosition, BpPosition> getSpecifiedBPRange() const#

Get the specified base-pair range this GRG covers as a pair {first, last+1}. Unlike getBPRange, this function just returns a pair of values that were specified by a tool/user at some point.

inline void setSpecifiedBPRange(std::pair<BpPosition, BpPosition> bpRange)#

Set the specified base-pair range this GRG covers as a pair {first, last+1}, i.e. the lower value is inclusive and the higher value is exclusive.

inline NodeIDList getSampleNodes() const#

Get the NodeIDList for all samples in the graph.

inline NodeIDList getRootNodes()#

Get the NodeIDList for all roots in the graph.

inline bool hasMissingData() const#

Does this GRG have Mutations which have missing data represented?

template<typename T = NodeAndMut>
inline NodeAndMutIterator<T> getNodesAndMutations()#

Get the list of all corresponding nodes and mutations, for iterating over the mapping between graph nodes and Mutations. By default, just returns the (NodeID, MutationID) pairs, but using the template argument of GRG::NodeMutMiss will return a tuple (NodeID, MutationID, NodeID) where the additional NodeID is the missingness node (or INVALID_NODE_ID) associated with this Mutation.

Returns:

An iterator over (NodeID, MutationId) pairs or (NodeID, MutationId, NodeID) tuples, depending on the function template argument.

template<typename T = MutationId>
inline std::vector<T> getUnmappedMutations()#

Get the mutations associated with no nodes. By default, the template argument is MutationId, so only the vector of mutations is returned. If you set the template argument to GRG::MutAndNode then the mutation and it’s corresponding missingness node will be returned.

Returns:

A vector of either MutationId or GRG::MutAndNode (a std::pair), depending on the template parameter.

template<typename T = MutationId>
inline std::vector<T> getMutationsForNode(const NodeID nodeId)#

Get the mutations associated with the node. By default, the template argument is MutationId, so only the vector of mutations is returned. If you set the template argument to GRG::MutAndNode then the mutation and it’s corresponding missingness node will be returned.

Parameters:

nodeId[in] The node to lookup.

Returns:

A vector of either MutationId or GRG::MutAndNode (a std::pair), depending on the template parameter.

inline bool nodeHasMutations(const NodeID nodeId)#

Does the node have at least one Mutation associated with it?

Parameters:

nodeId[in] The node to lookup.

Returns:

true if the node has one or more Mutation.

template<typename T = MutAndNode>
inline std::vector<T> getMutationsToNodeOrdered()#

Get corresponding mutation IDs and node IDs, ordered by the mutation position + allele (ascending). By default, a MutAndNode is returned (std::pair). If you set the template argument to MutIdNodeMiss (std::tuple) then the third item in the returned tuple will be the missingness node associated with the Mutation.

Returns:

A vector of pairs, MutationID and NodeID (in that order), or a vector of tuples of (MutationID, NodeID, NodeID), depending on function template argument.

void visitBfs(GRGVisitor &visitor, TraversalDirection direction, const NodeIDList &seedList, ssize_t maxQueueWidth = -1)#

Visit nodes breadth-first, starting at the given nodes and following up or down edges.

Parameters:
  • visitor[in] The visitor that will be called back. Typically owns any state that is being computed.

  • direction[in] Whether to go up or down the graph.

  • seedList[in] The nodes to start traversal from.

  • maxQueueWidth[in] The maximum width of the queue; restricts the number of end-to-end paths that will be visited.

void visitDfs(GRGVisitor &visitor, TraversalDirection direction, const NodeIDList &seedList, bool forwardOnly = false)#

Visit nodes depth-first, starting at the given nodes and following up or down edges. A node is never visited more than one time in either pass (forward or backward), unless forwardOnly is true.

Parameters:
  • visitor[in] The visitor that will be called back. Typically owns any state that is being computed.

  • direction[in] Whether to go up or down the graph.

  • seedList[in] The nodes to start traversal from.

  • maxPathDepth[in] The maximum depth of any path that will be explored. Traversal stops when this is reached. Typically only useful when forwardOnly is true.

  • forwardOnly[in] Typically depth-first search reaches a given node, computes (recursively) the values for all successors, and then computes the current node value. This causes each node to be visited twice. However, setting forwardOnly will only visit nodes in the forward direction. It also causes nodes to be visited an arbitrary number of times.

inline size_t addPopulation(std::string populationDescription)#

Add a population to the GRG.

inline const std::vector<std::string> &getPopulations() const#

The populations (their descriptions) represented by this GRG.

MutationId addMutation(const Mutation &mutation, NodeID nodeId, NodeID missNodeId = INVALID_NODE_ID)#

Add the given mutation to the GRG, associated with the given node. If there is no node associated with the mutation, use INVALID_NODE_ID.

Parameters:
  • mutation[in] The Mutation to add.

  • nodeId[in] The nodeId to associated it with, or INVALID_NODE_ID.

  • missNodeId[in] The nodeId for the missingness node of this Mutation, or INVALID_NODE_ID if none.

Returns:

The MutationId for the newly added Mutation.

void removeMutation(MutationId mutId, NodeID nodeId)#

Remove the Mutation with the given ID and nodeId from the graph. If the Mutation has a node (i.e., has more than 0 samples) then you MUST provide it for the deletion to work properly. Marks the Mutation with the given MutationId (mapped to the particular NodeId) as removed. The Mutation information will be cleared out, and the sorted order of the mutations will be invalidated (so subsequent analyses may re-number the

Parameters:
  • mutId[in] The MutationId.

  • nodeId[in] The NodeId.

void sortMutations()#

If needed, sort the mutations by their (position, allele) and renumber them so that the MutationId ascending order matches this order. Can be a memory-intensive operation. Alternatively, you can just write the GRG to disk (saveGrg()) and then reload it (loadImmutableGRG()) and it will use less memory (but be slower).

bool mutationsAreUnique() const#

Check whether the Mutations in GRG are unique: there is a single Mutation (with MutationId) for each unique (position, ref allele, alt allele) combination. Since the mapping between MutationId and nodes is many-to-1 (not many-to-many), having unique mutations means that each Mutation is associated with a single node in the graph. When a GRG is constructed via grg construct this is always true of the resulting GRG. When a GRG is constructed arbitrarily from the API, it may not be true.

This operation is O(M), where M is the number of mutations.

template<typename T>
inline std::vector<T> matMul(const std::vector<T> &inputMatrix, const size_t numRows, TraversalDirection direction)#

Compute one of two possible matrix multiplications across the entire graph. The input matrix \(V\) can be either \(K \times N\) ( \(N\) is number of samples) or \(K \times M\) ( \(M\) is number of mutations). The given direction determines which input matrix is expected. Let \(X\) be the \(N \times M\) genotype matrix. For an \(K \times N\) input \(V\), the product performed is \(V \times X\) which gives a \(K \times M\) result. I.e., the input matrix is a column per sample and the output matrix is a column per mutation. For an \(K \times M\) input \(V\), the product performed is \(V \times X^T\) which gives a \(K \times N\) result. I.e., the input matrix is a column per mutation and the output matrix is a column per sample.

The simplest case to consider is a vector input (e.g., a \(1 \times N\) matrix). This vector-matrix product in the graph works by seeding the input nodes (samples in this example) with the corresponding values from the input vector and then traversing the graph in the relevant direction (up or down). The ancestor/descendant values are summed at each node, until the terminal nodes (mutations in this example) are reached. The values at the terminal nodes are then the output vector. When a \(K\)-row matrix is input, instead of a vector, the only difference is that each node stores \(K\) values instead of 1.

Note: the RAM used will be \(O(K * nodes)\) where \(nodes\) is the total number of nodes in the graph.

Parameters:
  • inputMatrix[in] The \(K \times N\) or \(K \times M\) input matrix in row-major order. For samples, the \(i\)th column are the values for sample \(i\). Similarly for mutations, the ordering follows the MutationId ordering.

  • numRows[in] The number of rows in the matrix. The size of the input must be divisible by numRows, and determines the number of columns.

  • direction[in] Use TraversalDirection::DIRECTION_DOWN for mutations as input, and TraversalDirection::DIRECTION_UP for samples as input.

Returns:

The resulting matrix as described above, in row-major order.

inline PopulationID getPopulationId(NodeID nodeId)#

Get the ID of the population associated with the node. Will be POPULATION_UNSPECIFIED if the node is not a sample, or if there is no population associated.

Parameters:

nodeId[in] The node to retrieve.

inline void setPopulationId(NodeID nodeId, PopulationID popId)#

Set the ID of the population associated with the node.

Parameters:
  • nodeId[in] The node to modify.

  • popId[in] The population ID, which is already associated with a population description via addPopulation().

inline NodeIDSizeT getNumIndividualCoals(NodeID nodeId)#

Get the number of individuals that coalescence at this node (not below it, but exactly at it).

Parameters:

nodeId[in] The node to retrieve.

Returns:

The number of individuals. For diploid data this is a number between 0…numSamples()/2.

inline void setNumIndividualCoalsGrow(NodeID nodeId, NodeIDSizeT coals)#

Set the number of individuals that coalesce at this node, growing the underlying data size if needed.

NOT THREADSAFE.

Parameters:
  • nodeId[in] The node to modify.

  • coals[in] The number of individuals that coalesce, between 0…numSamples()/ploidy.

inline void setNumIndividualCoals(NodeID nodeId, NodeIDSizeT coals)#

Set the number of individuals that coalesce at this node.

The value is updated atomically, so this can be used from threaded code (as long as the rest of the GRG, e.g. number of nodes, is not changing).

Parameters:
  • nodeId[in] The node to modify.

  • coals[in] The number of individuals that coalesce, between 0…numSamples()/ploidy.

inline bool hasIndividualCoals() const#

Does this dataset have any individual coalescence information? If not, you won’t be able to quickly get the exact sample variance for each mutation (diploid genotypes only).

inline void addIndividualId(const std::string &identifier)#

Add the next individual’s identifier. Must be called in order, for individuals 0…(N-1).

Parameters:

identifier – The identifier to add.

inline void clearIndividualIds()#

Clear all individual identifiers.

inline bool hasIndividualIds() const#

Are there any string identifiers for the individuals in this GRG?

Returns:

true if individual identifiers are available.

inline std::string getIndividualId(NodeIDSizeT individualIndex)#

Get the string identifier for the k’th individual. Returns empty string if there is no identifier for the given individual.

Returns:

String identifier for the given individual.

template<typename T>
class NodeAndMutIterator#

Iterator for traversing the NodeID<–>MutationID mapping, along with the (sometimes present) NodeID for the missingness node associated with the Mutation.

Each Mutation can have two different “kinds” of nodes associated with it:

  1. The Mutation Node, which defines the mapping between a Mutation and all of the samples that have that Mutation.

  2. The Missingness Node, which defines the mapping between a Mutation and all of the samples that are missing an allele at the site associated with the Mutation.

Public Functions

inline size_t size() const#

The number of items in the vector underlying the iterator.

inline VectIter3 begin() const#

The beginning of the vector.

inline VectIter3 end() const#

The end of the vector.

class VectIter3#
class MutableGRG : public grgl::GRG#

A MutableGRG can be changed by adding/removing nodes and edges.

Public Functions

inline explicit MutableGRG(size_t numSamples, uint16_t ploidy, bool phased = true, size_t initialNodeCapacity = DEFAULT_NODE_CAPACITY, bool useUpEdges = true)#

Construct a GRG for a given number of samples.

Parameters:
  • The[in] number of samples that will be used to construct the graph.

  • (Optional)[in] the initial capacity of the node vector. If you know in advance roughly how many nodes will be created this can improve performance.

inline virtual bool nodesAreOrdered() const override#

Returns true if nodes are ordered in bottom-up topological order.

inline virtual size_t numNodes() const override#

Number of nodes (including sample nodes) in the graph.

inline virtual size_t numEdges() const override#

Number of edges (total) in the graph.

inline virtual size_t numDownEdges(NodeID nodeId) override#

Number of down edges for a particular node in the graph.

inline virtual size_t numUpEdges(NodeID nodeId) override#

Number of up edges for a particular node in the graph.

inline virtual NodeIDList getDownEdges(NodeID nodeId) override#

Get list of down edges for a particular node in the graph.

inline virtual NodeIDList getUpEdges(NodeID nodeId) override#

Get list of up edges for a particular node in the graph.

inline virtual bool hasUpEdges() const override#

True if this graph stores up edges, false if only down edges are stored.

inline virtual bool edgesAreOrdered() const override#

True if the edges in this graph are always in ascending NodeID order.

inline NodeID makeNode(const size_t count = 1, bool forceOrdered = false)#

Create a new node in the GRG.

This is the only valid way to construct a GRG node. When you create a GRG you specify the number of sample nodes, and those nodes are created right away. Each call to makeNode() after that will generate sequential-ID nodes. Use connect() to connect the newly created node to other nodes.

Parameters:
  • count[in] The number of nodes to create.

  • forceOrdered[in] If true, do not “break” the topological ordering property of the graph if it already exists. Use only when you are certain that newly added nodes (and their edges) will main topological order.

void connect(NodeID srcId, NodeID tgtId)#

Create a graph edge from one node (the source) to another node (the target).

In GRG, “source node” always refers to the node “above” the other node. I.e., this function constructs a down edge from source to target, and an up edge from target to source. In ARG parlance, the source node is the parent and the target is the child.

Parameters:
  • srcId[in] The ID of the source node.

  • tgtId[in] The ID of the target node.

bool disconnect(NodeID srcId, NodeID tgtId)#

Remove a graph edge between two nodes.

The source and target nodes have the same meaning as in connect().

Parameters:
  • srcId[in] The ID of the source node.

  • tgtId[in] The ID of the target node.

NodeIDSizeT ensureUniqueMutations()#

Ensure that every Mutation in the GRG is unique: there is a single Mutation (with MutationId) for each unique (position, ref allele, alt allele) combination. It does so by finding duplicate Mutations, adding a new node that becomes a parent to them, and creating a new (single) mutation representing them all. At the end of this method, the mutations will be unordered, and either serializing the GRG to disk or calling sortMutations() will be necessary to get the new, correct MutationIds.

This operation is O(M), where M is the number of mutations.

Returns:

The number of nodes that were added to the graph.

void merge(const std::list<std::string> &otherGrgFiles, bool combineNodes = true, bool useSampleSets = false, bool verbose = false, bool ignoreRangeViolations = false, std::vector<BpPosition> positionAdjust = {})#

Merge one or more GRGs into this one. Only succeeds if all GRGs have the same number of samples.

This assumes that the GRGs were constructed from the same sampleset &#8212; e.g., they could be constructed from two subsets of the same sampleset (as long as both were constructed with the same sample node numbering) or from a subset of mutations against the same sampleset.

The specified range of the resulting GRG will be (min(range of any input), max(range of any input)), even if the provided GRGs do not span a contiguous region. It is up to the caller of this method to ensure that either (1) the span is contiguous or (2) they adjust the specified range appropriately afterwards.

Parameters:
  • otherGrgFiles[in] The list of GRG filenames to load and merge.

  • combineNodes[in] Set to false to never combine nodes from the graphs.

  • useSampleSets[in] Set to true to use the slower, more RAM intensive version that tracks sets of samples beneath each graph node, and combines nodes that have the same sample set. The default algorithm just uses the (mapped) children to determine if two nodes can be combined, which combines fewer nodes overall, but also retains more hierarchy in the final graph.

  • verbose[in] Emit more information to stdout.

  • ignoreRangeViolations[in] Merge graphs that might have overlapping positions.

  • positionAdjust[in] Adjust each otherGrgFiles GRG by the given basepair position. I.e., for each Mutation in otherGrgFiles[i], add positionAdjust[i] to its mutation position.

inline GRGNode &getNode(const NodeID nodeId) const#

Retrieve a node by ID.

Parameters:

nodeId[in] The ID of the node in question.

void compact(NodeID nodeId = INVALID_NODE_ID)#

Compact the edge vectors in the GRG. If done infrequently won’t affect the amortized edge addition cost but will reduce overall RAM usage by a non-trivial amount.

class Mutation#

A mutation in the GRG.

Public Functions

inline std::pair<std::string, std::string> getBothAlleles() const#

Get a pair containing the reference allele (first) and the alternate allele (second) as strings.

inline std::string alleleStorageString() const#

Returns the concatenation of the reference allele with the alternate allele. There is no delimiter, see getAltIndex() to get the index of the first character in the alternate allele.

inline void setAlleleStorageString(const std::string &storageStr)#

INTERNAL SERIALIZATION METHOD.

inline size_t getAltIndex() const#

The position of the first character of the alternate allele in the allele storage string.

Warning

doxygenclass: Cannot find class “grgl::GRG::NodeAndMut” in doxygen xml output for project “grgl” from directory: doc/xml/

Warning

doxygenclass: Cannot find class “grgl::GRG::NodeMutMiss” in doxygen xml output for project “grgl” from directory: doc/xml/

Warning

doxygenclass: Cannot find class “grgl::GRG::MutAndNode” in doxygen xml output for project “grgl” from directory: doc/xml/

Warning

doxygenclass: Cannot find class “grgl::GRG::MutNodeMiss” in doxygen xml output for project “grgl” from directory: doc/xml/

Serialization and Conversion#

std::pair<NodeIDSizeT, EdgeSizeT> grgl::writeGrg(const GRGPtr &grg, std::ostream &out, bool allowSimplify)#

Serialize the GRG to the given outstream.

Parameters:
  • grg[in] The GRG to be serialized.

  • out[in] The (binary) output stream.

MutableGRGPtr grgl::readMutableGrg(IFSPointer &inStream, bool loadUpEdges)#

Deserialize the GRG from the given input stream.

Parameters:

inStream[in] The (binary) input stream.

GRGPtr grgl::readImmutableGrg(IFSPointer &inStream, bool loadUpEdges)#
MutableGRGPtr grgl::convertTreeSeqToGRG(const tsk_treeseq_t *treeSeq, bool binaryMutations, bool useNodeTimes, bool maintainTopology, bool computeCoals, std::pair<size_t, size_t> treeRange)#

Convert a tskit tree-sequence object into a GRG.

Parameters:
  • treeSeq[in] The tskit TreeSequence object.

  • binaryMutations[in] Set to true to emit Mutations as 0 (ref) and 1 (any alt).

  • useNodeTimes[in] Set to true to set the Mutation’s time based on its tskit node time, instead of its tskit mutation time.

  • maintainTopology[in] Set to false to save some time and space when converting to GRG, at the cost of not capturing all topology changes due to recombinations “unseen” by Mutations above them. WARNING: This causes inaccurate sample-to-mutation mapping when there are back mutations in the trees (nested mutations at the same site).

  • computeCoals[in] Set to true to compute the number of individuals that coalesce at each node. Required if you want to compute the exact variance for each mutation when performing diploid genotype matrix operations (e.g., GWAS) later. However, you can often just use the binomial approximation to variance and leave this as false. Computing the coalescences is expensive on large TreeSequences.

  • treeRange[in] Optional range of tree indices. If provided, only trees within that range [first, last) are converted (first is inclusive, last is exclusive).

std::pair<NodeIDSizeT, EdgeSizeT> grgl::simplifyAndSerialize(const GRGPtr &grg, std::ostream &outStream, const GRGOutputFilter &filter, bool allowSimplify)#

Visitors#

enum grgl::TraversalDirection#

Values:

enumerator DIRECTION_DOWN#
enumerator DIRECTION_UP#
enum grgl::DfsPass#

Values:

enumerator DFS_PASS_NONE#
enumerator DFS_PASS_THERE#
enumerator DFS_PASS_BACK_AGAIN#
class GRGVisitor#

Visitor abstract base class for traversing the nodes of a GRG.

To traverse a GRG, create a class that inherits from GRGVisitor and overrides the visit() function. Then use the GRG::visitDfs(), GRG::visitBfs(), or GRG::visitTopo() functions and pass your visitor as an argument.

Subclassed by AlleleFreqVisitor, ZygosityInfoVisitor, grgl::DfsSampleCountVisitor, grgl::FrontierVisitor, grgl::GRGCoalCalcVisitor, grgl::GRGCoalMarkVisitor, grgl::NodeHasherVisitor, grgl::NodeMapperVisitor, grgl::RenumberAndWriteVisitor, grgl::TopoCandidateCollectorVisitor, grgl::TopoCoalCounter, grgl::TopoOrderVisitor, grgl::TopoSampleSetVisitor

Public Functions

virtual bool visit(const GRGPtr &grg, NodeID node, TraversalDirection direction, DfsPass dfsPass = DFS_PASS_NONE) = 0#

Visit a node in the GRG.

Parameters:
  • grg[in] The graph.

  • node[in] The current node we are visiting.

  • direction[in] Whether we’re traversing bottom-up (DIRECTION_UP) or top-down (DIRECTION_DOWN).

  • dfsPass[in] For DFS traversals, is this the forward (DFS_PASS_THERE) pass or the backward (DFS_PASS_BACK_AGAIN) pass after we have computed values for all successors.