Python API#

pygrgl.get_bfs_order(grg: _grgl.GRG, direction: _grgl.TraversalDirection, seed_list: list[int], max_queue_width: int = -1) → list[int]#

Get a list of NodeIDs in breadth-first-search (BFS) order, starting from the given seeds and traversing in the provided TraversalDirection (up or down).

Parameters:

grg (pygrgl.GRG or pygrgl.MutableGRG) – The GRG to get nodes for.
direction (pygrgl.TraversalDirection) – The direction to traverse, up or down.
seed_list (List[int]) – The list of NodeIDs that represent the starting place of the traversal. For example, if you use pygrgl.GRG.get_sample_nodes() and pygrgl.TraversalDirection.UP then the entire graph will be traversed from bottom to top.
max_queue_width (int) – The maximum width the queue used for bread-first-search. The default is -1, which means there is no maximum width. Setting this can help reduce traversal cost but will result in an incomplete traversal.

Returns:

The ordered list of NodeIDs.

Return type:

List[int]

pygrgl.get_dfs_order(grg: _grgl.GRG, direction: _grgl.TraversalDirection, seed_list: list[int], forward_only: bool = False) → list[int]#

Get a list of NodeIDs in depth-first-search (DFS) order, starting from the given seeds and traversing in the provided TraversalDirection (up or down). A node never appears more than one time, unless forward_only is True.

Parameters:

grg (pygrgl.GRG or pygrgl.MutableGRG) – The GRG to get nodes for.
direction (pygrgl.TraversalDirection) – The direction to traverse, up or down.
seed_list (List[int]) – The list of NodeIDs that represent the starting place of the traversal. For example, if you use pygrgl.GRG.get_sample_nodes() and pygrgl.TraversalDirection.UP then the entire graph will be traversed from bottom to top.
forward_only (bool) – If True, enumerates nodes in the given direction from seeds and outputs them in the order they are first visited. If False, enumerated nodes in the order they are visited the _second_ time, i.e. when they are popped off the stack that is used for the depth-first-search. This provides a topological order to the nodes when this parameter is set to False, and does _not_ when set to True. Default is False.

Returns:

The ordered list of NodeIDs.

Return type:

List[int]

pygrgl.get_topo_order(grg: _grgl.GRG, direction: _grgl.TraversalDirection, seed_list: list[int]) → list[int]#

Get a list of NodeIDs in topological order, starting from the given seeds and traversing in the provided TraversalDirection (up or down). This order is similar to DFS order, but with the seeds being the “endpoints”. The NodeIDs in this order are guaranteed to have the property: if a NodeID X is at position P in the list, then all of the nodes above/below X (depending on TraversalDirection) have positions before P.

Parameters:

grg (pygrgl.GRG or pygrgl.MutableGRG) – The GRG to get nodes for.
direction (pygrgl.TraversalDirection) – The direction to traverse, up or down.
seed_list (List[int]) – The list of NodeIDs that represent the starting place of the traversal. For example, if you use pygrgl.GRG.get_sample_nodes() and pygrgl.TraversalDirection.UP then the entire graph will be traversed from bottom to top. If you pass in empty list, then all of the roots or leaves of the graph will be used as seeds.

Returns:

The ordered list of NodeIDs.

Return type:

List[int]

pygrgl.grg_from_trees(filename: str, binary_mutations: bool = False, use_node_times: bool = False, maintain_topology: bool = True, compute_coals: bool = False) → _grgl.MutableGRG#

Convert a .trees (TSKit tree-sequence) file to a GRG.

Parameters:

filename (str) – The tree-sequence (.trees) file to load.
binary_mutations (bool) – Set to True to flatten all mutations to be bi-allelic (optional).
use_node_times (bool) – Mutations will be assigned the time from the node below them, instead of the tskit Mutation object.
maintain_topology (bool) – Default: True. Required for ensuring the correct sample-to-Mutation mapping in the presence of nested (back) Mutations on the local tree. When True, we capture all tree topology changes induced by recombination, not just the changes that result in a different set of samples beneath each Mutation in the tree.
compute_coals (bool) – Compute the per-node coalescence counts. I.e., how many individuals coalesced exactly at the node (separate children have both haploid copies of the individual). This is an expensive computation, slowing down the TS to GRG conversion when there are a lot of samples.

Returns:

The GRG.

Return type:

pygrgl.GRG

pygrgl.grg_to_cyto_json(grg: GRG, start_from=[], show_mutations=True) → Dict[str, Any]#: Create a dictionary that can be used with ipycytoscape to display the graph. Right now this only emits down edges (source is older than target).

pygrgl.hwe_exact_pv(hets_A: int, homs_A: int, other: int) → float#

Hardy-Weinberg equilibrium exact p-value test from Wiggington, et. al., 2025 “A Note on Exact Tests of Hardy-Weinberg Equilibrium”. This is a slightly modified wrapper of the C code from https://csg.sph.umich.edu/abecasis/Exact/.

Parameters:

hets_A (int) – Number of heterozygotes containing the allele of interest.
homs_A (int) – Number of homozygotes containing the allele of interest.
other (int) – Number of genotypes that do not include the allele of interest at all.

Returns:

The p-value for the two-sided test.

Return type:

float

pygrgl.load_immutable_grg(filename: str, load_up_edges: bool = False) → _grgl.GRG#

Load a GRG file from disk. Immutable GRGs are much faster to traverse than mutable GRGs and take up less RAM, so this is the preferred method if you are using a GRG for calculation or annotation, and not modifying the graph structure itself.

Parameters:

filename (str) – The file to load.
load_up_edges (bool) – If True, load both “up” and “down” edges of graph (uses more RAM). Default: False.

Returns:

The GRG.

Return type:

pygrgl.GRG

pygrgl.load_mutable_grg(filename: str, load_up_edges: bool = True) → _grgl.MutableGRG#

Load a GRG file from disk. Mutable GRGs can have nodes and edges added/removed from them.

Parameters:: filename (str) – The file to load.
Returns:: The GRG.
Return type:: pygrgl.MutableGRG

pygrgl.map_mutations(grg: _grgl.MutableGRG, mutations: list[_grgl.Mutation], samples: list[list[int]], verbose: bool = False, mutation_batch_size: int = 0) → _grgl.MutationMappingStats#

Map the provided mutations into a MutableGRG. By default, the entire input is processed as one batch (a single graph traversal; RAM intensive). Set the mutation_batch_size parameter to perform the mapping in batches, which saves RAM. Or you can pass in smaller lists of mutations and call the function multiple times.

Parameters:

grg (pygrgl.MutableGRG) – The MutableGRG that will be modified in-place.
mutations (List[pygrgl.Mutation]) – The list of Mutation objects to insert.
samples (List[List[int]]) – List of sample NodeID lists, parallel to mutations.
verbose (bool) – Emit periodic progress information.
mutation_batch_size (int) – Number of mutations to accumulate before mapping.

Returns:

Mapping statistics.

Return type:

pygrgl.MutationMappingStats

pygrgl.matmul(grg: _grgl.GRG, input: object, direction: _grgl.TraversalDirection, emit_all_nodes: bool = False, by_individual: bool = False, init: object = None, miss: object = None) → numpy.ndarray#

Compute one of two possible matrix multiplications across the entire graph. The input matrix \(V\) can be either \(K \times N\) (\(N\) is number of samples) or \(K \times M\) (\(M\) is number of mutations). The given direction determines which input matrix is expected. Let \(X\) be the \(N \times M\) genotype matrix. For an \(K \times N\) input \(V\), the product performed is \(V \times X\) which gives a \(K \times M\) result. I.e., the input matrix is a column per sample and the output matrix is a column per mutation. For an \(K \times M\) input \(V\), the product performed is \(V \times X^T\) which gives a \(K \times N\) result. I.e., the input matrix is a column per mutation and the output matrix is a column per sample.

The simplest case to consider is a vector input (e.g., a \(1 \times N\) matrix). This vector-matrix product in the graph works by seeding the input nodes (samples in this example) with the corresponding values from the input vector and then traversing the graph in the relevant direction (up or down). The ancestor/descendant values are summed at each node, until the terminal nodes (mutations in this example) are reached. The values at the terminal nodes are then the output vector. When a \(K\)-row matrix is input, instead of a vector, the only difference is that each node stores \(K\) values instead of 1.

Note: the RAM used will be \(O(K * nodes)\) where \(nodes\) is the total number of nodes in the graph.

Parameters:

grg (pygrgl.GRG or pygrgl.MutableGRG) – The GRG to perform the computation against.
input (numpy.array) – The numpy 2-dimensional array of input values \(V\).
direction (pygrgl.TraversalDirection) – The direction to traverse, up (input is per sample) or down (input is per mutation).
emit_all_nodes (bool) – False by default. Set to True if you want each output row in the matrix to have a value for every node, not just every sample/mutation (depending on direction).
by_individual (bool) – The dimension that is for samples (either the input or output, depending on the direction parameter) uses individuals instead of haploid samples. Instead of outputting vectors of \(N\) (num_samples) columns, it is \(N / ploidy\) (num_individuals) columns.
init (Union[str, numpy.array]) – Initialization of the nodes of the graph during matrix multiplication. By default (when this is set to None), nodes are initialization to 0. There are three possible types this can take on: 1. A string “xtx” which means to initialize the nodes with twice their coalesence counts. Using this and performing an UP multiplication (with 1s as input) produces the X.T * X product needed for GWAS. 2. A one dimensional numpy array (vector) of length K. The value at position K is assigned to all nodes when performing the multiplication for row K from the input matrix. 3. A two dimensional numpy array (matrix) of size KxT, where T is the total number of nodes in the graph (grg.num_nodes). This fully specifies every node value for the entire matrix operation.
miss (numpy.array) – Optional. This is an _input_ when the direction is DOWN and an _output_ when the direction is up. In both cases (input or output), it is a 2D matrix that matches the shape of the corresponding input or output matrix. Each value miss[i][j] in the matrix is the “missing data quantity” associated with the input/output row “i” and MutationID “j”. There exists a missingness node in the graph representing the sample set with missing data for each site s. When “miss” is an input then the missingness node is initialized by adding to it the value miss[i][j] for each mutation “j” that corresponds to site “s”. When “miss” is an output, miss[i][j] gets set to the sum of all values at the missingness node. I.e., “miss” behaves exactly the same as the input or output matrix (depending on direction) except that it captures sums for missing genotypes, instead of sums for known Mutations.

Returns:

The numpy 2-dimensional array of output values.

Return type:

numpy.array

pygrgl.save_grg(grg: _grgl.GRG, filename: str, allow_simplify: bool = True) → tuple[int, int]#

Save the GRG to disk, simplifying it (if possible) in the process.

Parameters:

grg – The GRG
filename (str) – The file to save to.
allow_simplify (bool) – Set to False to disallow removing nodes/edges from the graph that do not significantly contribute to the mutation-to-samples mapping.

pygrgl.save_subset(grg: _grgl.GRG, filename: str, direction: _grgl.TraversalDirection, seed_list: list[int], bp_range: tuple[int, int] = (0, 0)) → bool#

Save a subset of the GRG to disk, specified by a vector masking either mutation IDs or sample IDs.

Parameters:

grg – The GRG
filename (str) – The file to save to.
direction (pygrgl.TraversalDirection) – Downward means the seeds should be a list of MutationID that should be kept in the graph. Upward means the seeds should be a list of sample NodeID that should be kept.
seed_list (List[int]) – The list of MutationIDs or SampleIDs (NodeIDs of samples) to keep. For samples, the order of this list matters - it defines the order of the samples in the new GRG. For example, [0, 1, 2, 3] keeps the first four samples in their original order, but [3, 2, 1, 0] keeps the same samples but reverses their order in the new GRG. This means that the SampleIDs will change as 0->3, 1->2, 2->1, 3->0. The IndividualIDs mapped to individuals will be properly maintained if present.
bp_range (Tuple[int, int]) – A pair of integers specifying the base-pair range that this GRG covers. This is just meta-data, and does not change the filtering behavior.

Returns:

Whether the output graph was created. If the resulting GRG would be empty (no mutations, or no samples) then it will not save the graph.

Return type:

bool

pygrgl.shared_frontier(grg: _grgl.GRG, direction: _grgl.TraversalDirection, seeds: list[int]) → list[int]#

Get the list of nodes that corresponds to the shared frontier in the graph in the given direction. The frontier are the _first_ nodes, along each path, that are reached by all seeds nodes in the input.

Do not pass duplicate Node IDs in “seeds”! There is no checking for duplication, but you will get incorrect results if you pass in duplicates.

Parameters:

grg (pygrgl.GRG or pygrgl.MutableGRG) – The GRG to perform the computation against.
direction (pygrgl.TraversalDirection) – The direction to traverse, up (terminate at mutations or shared nodes) or down (terminate at samples or shared node).
seeds (List[int]) – List of node IDs to start the search from.

Returns:

A list of node IDs representing the frontier.

Return type:

List[int]

pygrgl.display.grg_to_cyto(grg: GRG, start_from=[], show_mutations=True) → None#: Return a CytoscapeWidget that can be displayed in Jupyter. This is only practical with small GRGs (maybe 100 samples and 100 variants at most?). Useful for illustration and/or debugging on small tests.

class pygrgl.Mutation#

__init__(self: _grgl.Mutation, position: float, allele: str, ref_allele: str = '', time: float = -1.0) → None#: Construct a new Mutation object, to use as a lookup key or to add to a GRG.

property allele#: (Read-only) Allele value associated with the Mutation. Can be a single nucleotide or a sequence of them.

property position#: (Read-only) Position in the genome. Can be absolute or relative (genomic-distance based or otherwise normalized).

property ref_allele#: (Read-only) Reference allele at the position that this Mutation occurs. Can be empty string if not provided.

property time#: (Read/write) Time value associated with the Mutation, or -1.0 if unused.

class pygrgl.GRG#

A Genotype Representation Graph (GRG) representing a particular dataset. This is the immutable portion of the API, so every graph has these operations. See MutableGRG for an extension of this that includes the ability to add/remove nodes and edges from the graph.

add_individual_id(self: _grgl.GRG, identifier: str) → None#

Add the next string identifier for an individual in the dataset. If the individual IDs are already set, this will throw an exception. This must be called in order, from the 0th to the (N-1)st individual.

Parameters:: identifier (str) – The string identifier for the next individual.

add_mutation(self: _grgl.GRG, mutation: _grgl.Mutation, node_id: int, miss_node_id: int = 2147483647) → int#

Add a new Mutation to the GRG, and associate it with the given NodeID.

Parameters:

mutation (pygrgl.Mutation) – The Mutation object.
node_id (int) – The NodeID to attach the Mutation to.

add_population(self: _grgl.GRG, pop_desc: str) → int#

Add a new population to the GRG, and return the ID associated with it.

Parameters:: pop_desc (str) – The population description/name.
Returns:: The PopulationID.
Return type:: int

property bp_range#

The range in base-pair positions that this GRG covers, from its list of mutations.

A pair (lower, upper) of the range covered by this GRG, where lower is inclusive and upper is exclusive.

calculate_missing_coals(self: _grgl.GRG) → int#

Calculate any missing coalescence information in a GRG. You can use this to compute coalescences for a large graph, though it is likely to scale poorly (RAM-wise). Generally, this is used to “fix up” the coalescences after modifying a GRG – any modifications that break coalescence should be marked with COAL_COUNT_NOT_SET and this will fix them.

Returns:: The number of nodes that had their coalescences updated.
Return type:: int

clear_individual_ids(self: _grgl.GRG) → None#: Remove all individual IDs from the current GRG.

get_down_edges(self: _grgl.GRG, arg0: int) → list[int]#

Get a list of NodeIDs that are connected to the given NodeID, via “down” edges (i.e., children).

Parameters:: node_id (int) – The NodeID to get children for.
Returns:: The children of the given node as a list of NodeIDs.
Return type:: List[int]

get_individual_id(self: _grgl.GRG, individual_index: int) → str#

Get the string identifiers for each of the N individuals in the dataset, if available. These are optional, so the empty list will be returned if the GRG does not have them. See has_individual_ids.

Parameters:: individual_index (int) – The individual to retrieve. The individuals are numbered from 0…(num_individuals-1), and correspond to the sample NodeIDs divided by their ploidy.
Returns:: String identifier for the given individual.
Return type:: str

get_mutation_by_id(self: _grgl.GRG, mut_id: int) → _grgl.Mutation#

Get the Mutation associated with the given MutationID.

Parameters:: mut_id (int) – The MutationID to get the Mutation for.
Returns:: The mutation.
Return type:: pygrgl.Mutation

get_mutation_node_miss(self: _grgl.GRG) → list[tuple[int, int, int]]#

Get a list of triples (MutationID, NodeID, “missingness” NodeID). Each Mutation typically is associated to a single Node, but rarely it can have more than one Node, in which case it will show up in more than one row. Results are ordered by MutationID, ascending.

Returns:: A list of tuples of (MutationID, NodeID, “missingness” NodeID).
Return type:: List[Tuple[int, int, int]]

get_mutation_node_pairs(self: _grgl.GRG) → list[tuple[int, int]]#

Get a list of pairs (MutationID, NodeID). Each Mutation typically is associated to a single Node, but rarely it can have more than one Node, in which case it will show up in more than one pair. Results are ordered by MutationID, ascending.

Returns:: A list of pairs of MutationID and NodeID.
Return type:: List[Tuple[int, int]]

get_mutations_for_node(self: _grgl.GRG, node_id: int, allow_sort: bool = True) → list[int]#

Get all the (zero or more) Mutations associated with the given NodeID.

Parameters:

node_id (int) – The NodeID to get mutations for.
allow_sort – Allow the mutations-to-node mapping to be resorted; if false and the mapping is not sorted then an exception will be thrown.

Returns:

A list of MutationIDs.

Return type:

List[int]

get_muts_and_miss_for_node(self: _grgl.GRG, node_id: int, allow_sort: bool) → list[tuple[int, int]]#

Get all the (zero or more) Mutations associated with the given NodeID, and for each Mutation the associated missingness node.

Parameters:

node_id (int) – The NodeID to get mutations for.
allow_sort – Allow the mutations-to-node mapping to be resorted; if false and the mapping is not sorted then an exception will be thrown.

Returns:

A list of pairs, (MutationID, NodeID) where the NodeID is for the missingness node.

Return type:

List[Tuple[int, int]]

get_node_mutation_miss(self: _grgl.GRG, allow_sort: bool = True) → _grgl.GRG_NodeMutMissIterator#

Get a list of triples (NodeID, MutationID, “missingness” NodeID). Each Mutation typically is associated to a single Node, but rarely it can have more than one Node, in which case it will show up in more than one row. Results are ordered by NodeID, ascending.

Parameters:: allow_sort – Allow the mutations-to-node mapping to be resorted; if false and the mapping is not sorted then an exception will be thrown.
Returns:: A list of tuples of (NodeID, MutationID, “missingness” NodeID).
Return type:: List[Tuple[int, int, int]]

get_node_mutation_pairs(self: _grgl.GRG, allow_sort: bool = True) → _grgl.GRG_NodeMutIterator#

Get a list of pairs (NodeID, MutationID). Each Mutation typically is associated to a single Node, but rarely it can have more than one Node, in which case it will show up in more than one pair. Results are ordered by NodeID, ascending.

Parameters:: allow_sort – Allow the mutations-to-node mapping to be resorted; if false and the mapping is not sorted then an exception will be thrown.
Returns:: A list of pairs of NodeID and MutationID.
Return type:: List[Tuple[int, int]]

get_num_individual_coals(self: _grgl.GRG, node_id: int) → int#

Get the number of individuals that coalesced at the given node (not below or above).

Parameters:: node_id (int) – The node to retrieve.
Returns:: The number of individuals that coalesced, or pygrgl.COAL_COUNT_NOT_SET.
Return type:: int

get_population_id(self: _grgl.GRG, node_id: int) → int#

Get the population ID for the given node. Can be used to index into the list returned by get_populations().

Parameters:: node_id (int) – The node to retrieve.
Returns:: The population ID.
Return type:: int

get_populations(self: _grgl.GRG) → list[str]#

Get the (possibly empty) list of population descriptions for this GRG.

Returns:: The population descriptions.
Return type:: List[str]

get_root_nodes(self: _grgl.GRG) → list[int]#

Get the NodeIDs for nodes that have no up edges: the roots of the GRG.

Returns:: The list of NodeIDs that are root nodes.
Return type:: List[int]

get_sample_nodes(self: _grgl.GRG) → list[int]#

Get the NodeIDs for the sample nodes.

Returns:: The list of NodeIDs that are sample nodes.
Return type:: List[int]

get_up_edges(self: _grgl.GRG, arg0: int) → list[int]#

Get a list of NodeIDs that are connected to the given NodeID, via “up” edges (i.e., parents).

Parameters:: node_id (int) – The NodeID to get parents for.
Returns:: The parents of the given node as a list of NodeIDs.
Return type:: List[int]

property haploid_shape#: Get the tuple (num_samples, num_mutations), which is the shape of the haploid genotype matrix (0, 1 matrix) represented by the GRG.

property has_individual_coals#: Does this dataset have any individual coalescence information? If not, you won’t be able to quickly get the exact sample variance for each mutation (diploid genotypes only).

property has_individual_ids#: True if this GRG has string identifiers for each individual. See get_individual_id().

property has_missing_data#

Return true if there is any missing data in this GRG.

Returns:: Whether there is missing data.
Return type:: bool

property has_up_edges#: Returns true if this graph has loaded up edges. All graphs have down edges.

property is_phased#

Is the dataset phased?

If not, GRG still represents the data as a mapping between mutations and haploid samples, but you cannot depend on any haplotype-based analyses representing the true haplotypes. Individual-based analyses are still completely accurate.

is_sample(self: _grgl.GRG, arg0: int) → bool#

Returns true if the given NodeID is associated with a sample.

Parameters:: node_id (int) – The NodeID to check.
Returns:: True iff it is a sample node.
Return type:: bool

property mutations_are_ordered#

Returns true if the MutationID order matches the (position, allele) sorted order. That is, MutationID of 0 is the lowest position value and MutationID of num_mutations-1 is the highest. Ties are broken by the lexicographic order of the allele.

If False, you can make this become True by either: 1. Writing the GRG to disk (save_grg()) and reloading it 2. Or calling sort_mutations() explicitly

mutations_are_unique(self: _grgl.GRG) → bool#

Check whether the Mutations in GRG are unique: there is a single Mutation (with MutationId) for each unique (position, ref allele, alt allele) combination. Since the mapping between MutationId and nodes is many-to-1 (not many-to-many), having unique mutations means that each Mutation is associated with a single node in the graph. When a GRG is constructed via grg construct this is always true of the resulting GRG. When a GRG is constructed arbitrarily from the API, it may not be true.

This operation is O(M), where M is the number of mutations.

Returns:: True if the mutations are unique.
Return type:: bool

node_has_mutations(self: _grgl.GRG, node_id: int, allow_sort: bool = True) → bool#

Return true if there is one or more Mutations associated with the given NodeID.

Parameters:

node_id (int) – The NodeID to check for mutations.
allow_sort – Allow the mutations-to-node mapping to be resorted; if false and the mapping is not sorted then an exception will be thrown.

Returns:

True if the node has at least one mutation.

Return type:

bool

property nodes_are_ordered#

Returns true if the NodeIDs are already in topological order from the bottom-up. If this is true, then the first S NodeIDs starting at 0 will be the sample Nodes, and then the next N-S NodeIDs will be in order as emitted by a DFS of the GRG starting from the roots and emitting the NodeIDs in post-order.

If this is true, you can often just iterate the NodeIDs from 0… num_nodes instead of performing actual graph traversals, depending on what you are trying to accomplish.

num_down_edges(self: _grgl.GRG, arg0: int) → int#

Count the number of children. This can be more efficient than getting the list of children and computing the length.

Parameters:: node_id (int) – The NodeID to get edge count for.
Returns:: The number of down edges (children) for the node..
Return type:: int

property num_edges#: Return the total number of down edges in the graph. Down and up edges are always symmetric, so the count is the same.

property num_individuals#: The number of individuals in the GRG. The corresponding samples can be found by multiplying the individual index \(I\) by the ploidy: \((ploidy * I) + 0, (ploidy \times I) + 1, ..., (ploidy \times I) + (ploidy - 1)\)

property num_mutations#

Get the total number of mutations in the GRG.

Returns:: The mutation count.
Return type:: int

property num_nodes#: Get the total number of nodes (including sample and mutation nodes) in the GRG.

property num_samples#: The number of sample nodes in the GRG.

num_up_edges(self: _grgl.GRG, arg0: int) → int#

Count the number of parents. This can be more efficient than getting the list of parents and computing the length.

Parameters:: node_id (int) – The NodeID to get edge count for.
Returns:: The number of up edges (parents) for the node..
Return type:: int

property ploidy#: The ploidy of each individual.

remove_mutation(self: _grgl.GRG, mut_id: int, node_id: int) → None#

Mark the Mutation with the given MutationId (mapped to the particular NodeId) as removed. The Mutation information will be cleared out, and the sorted order of the mutations will be invalidated. To restore the sorted order of the mutations either save the GRG to disk and reload it, or call sort_mutations().

Parameters:

mutation (int) – The MutationId of an existing Mutation.
node_id (int) – The NodeID the Mutation is attached to (or INVALID_NODE_ID if not attached to a node).

set_mutation_by_id(self: _grgl.GRG, mut_id: int, mutation: _grgl.Mutation) → None#

Set the Mutation associated with the given MutationID.

Parameters:

mut_id (int) – The MutationID to get the Mutation for.
mutation – The mutation. Users can associate whatever Mutation they want with a particular ID, but usually this is the same as the previous Mutation at this ID, with some non-essential properties changes, like “time”.

Type:

pygrgl.Mutation

set_num_individual_coals(self: _grgl.GRG, node_id: int, num_coals: int) → None#

Set the number of individuals that coalesced at the given node (not below or above).

Parameters:

node_id (int) – The node to access.
num_coals (int) – The number of individuals that coalesced, or pygrgl.COAL_COUNT_NOT_SET.

set_population_id(self: _grgl.GRG, node_id: int, population_id: int) → None#

Set the population ID for the given node.

Parameters:

node_id (int) – The node to access.
population_id (int) – The population ID to associate with the node.

property shape#: Get the tuple (num_individuals, num_mutations), which is the shape of the genotype matrix represented by the GRG.

sort_mutations(self: _grgl.GRG) → None#: If needed, sort the mutations by their (position, allele) and renumber them so that the MutationId ascending order matches this order. Can be a memory-intensive operation. Alternatively, you can just write the GRG to disk (save_grg()) and then reload it (load_immutable_grg()) and it will use less memory (but be slower).

property specified_bp_range#

The range in base-pair positions that this GRG covers, as specified during the GRG construction. This range may exceed the bp_range if the GRG was constructed from a range of the genome that did not immediately start/end with a mutation.

A pair (lower, upper) of the range covered by this GRG, where lower is inclusive and upper is exclusive.

class pygrgl.MutableGRG#

connect(self: _grgl.MutableGRG, source: int, target: int) → None#

Add a down edge from source to target, and an up edge from target to source. If you create edges that break the topological order of the nodes (where parent node > child node) this will detect it and downstream analyses will have worse performance (nodes_are_ordered will become False).

You must pass in the negative of a NodeID for either source or target, if that node has been created as a negative (see make_node()). If you are inconsistent between make_node() and connect(), for negative nodes, then the behavior of the topological order is undefined.

Parameters:

source (int) – The NodeID for the source node (edge starts here). Can be negative.
target (int) – The NodeID for the target node (edge ends here). Can be negative.

disconnect(self: _grgl.MutableGRG, source: int, target: int) → bool#

Remove the down edge from source to target, and the up edge from target to source.

Parameters:

source (int) – The NodeID for the source node (edge starts here).
target (int) – The NodeID for the target node (edge ends here).

Returns:

Whether the edge was actually deleted. This will only be False if you asked for deletion of a non-existent edge.

Return type:

bool

ensure_unique_mutations(self: _grgl.MutableGRG) → int#

Ensure that every Mutation in the GRG is unique: there is a single Mutation (with MutationId) for each unique (position, ref allele, alt allele) combination. It does so by finding duplicate Mutations, adding a new node that becomes a parent to them, and creating a new (single) mutation representing them all. At the end of this method, the mutations will be unordered, and either serializing the GRG to disk or calling sortMutations() will be necessary to get the new, correct MutationIds.

This operation is O(M), where M is the number of mutations.

Returns:: The number of nodes that were added to the graph (the number of unique mutations that previously had duplicates).

make_node(self: _grgl.MutableGRG, count: int = 1, negative: bool = False) → int#

Create one or more new nodes in the graph.

Parameters:

count (int) – How many nodes to create (optional, default to 1).
negative (bool) – Treat this node as a negative node, meaning that it is topologically _before_ any other currently existing node. By default, nodes are “positive”, meaning they are topologically _after_ any currently existing node. In both cases (position and negative), you don’t have to maintain the topological order, and when you connect edges it will detect whether you have maintained the order or not. If you violate the order, then nodes_are_ordered will become False.

Returns:

The NodeID of the first newly created nodes. If you created count nodes, then the count consecutive NodeIDs starting at this returned value correspond to the new nodes.

Return type:

int

merge(self: _grgl.MutableGRG, other_grg_files: list[str], combine_nodes: bool = True, use_sample_sets: bool = False, verbose: bool = False, ignore_range_violations: bool = False, position_adjust: list[int] = []) → None#

Merge one or more GRGs into this one. Only succeeds if all GRGs have the same number of

This assumes that the GRGs were constructed from the same sampleset – e.g., they could be constructed from two subsets of the same sampleset (as long as both were constructed with the same sample node numbering) or from a subset of mutations against the same sampleset.

The specified range of the resulting GRG will be (min(range of any input), max(range of any input)), even if the provided GRGs do not span a contiguous region. It is up to the caller of this method to ensure that either (1) the span is contiguous or (2) they adjust the specified range appropriately afterwards.

Parameters:

other_grg_files (List[str]) – A list of filenames for the GRG to merge into the current one.
combine_nodes (bool) – True by default. Combine nodes from different GRGs if the node has the same samples beneath it.
use_sample_sets (bool) – False by default. Use the more expensive merging algorithm that combines nodes if they have the same samples beneath. This produces a smaller graph, but loses hierarchy.
verbose (bool) – False by default. Set to True to get more output on stdout.
ignore_range_violations (bool) – False by default. Set to True to allow merging of graphs that overlap in base-pair positions.
position_adjust (List[int]) – Empty list by default. A list of adjustments to make to the positions of the mutations in the other_grg_files. If specified, the length of the list must be the same as the length of other_grg_files. Each position is an offset that is applied to all Mutation positions in the corresponding GRG, before merging it in.

reduce(self: _grgl.MutableGRG) → int#

Make a single pass over the mutable GRG and reduce the size by building hierarchy where there should be some.

Returns:: Number of edges that were removed.
Return type:: int

reduce_until(self: _grgl.MutableGRG, iterations: int = 10, min_dropped: int = 1000, fraction_dropped: float = 0.8, verbose: bool = False) → int#

Reduce the GRG until one of the conditions is met.

Parameters:

mutGRG (pygrgl.MutableGRG) – The MutableGRG that will be modified.
iterations (int) – Maximum number of graph iterations.
minDropped (int) – Minimum number of edges dropped by a single iteration.
fractionDropped (float) – Total fraction of edges that have been dropped.
verbose (bool) – Set to True to get stdout information per iteration. Default: False.

Returns:

The number of iterations that were performed.

Return type:

int

class pygrgl.TraversalDirection#

Members:

DOWN: Traverse the graph “down” edges.
UP: Traverse the graph via “up” edges.