Matrix multiplication with GRG#

GRG is a graph representation of the genotype matrix. GRG’s internal nodes maintain a hierarchy between Mutations and samples, making it significantly more compact than even the sparse representation of the genotype matrix. Non-GRG methods typically have to perform matrix multiplication block-wise, because the dense/sparse genotype matrix cannot fit in RAM, whereas GRG does not usually require block-wise computation.

Let \(N_H\) be the number of haploid samples, \(N\) be the number of individual samples, and \(M\) be the number of mutations. There are two different genotype matrices that we can consider:

The haploid genotype matrix (we denote as \(X\)) which has entries of either \(0\) or \(1\), and dimensions \(N_H \times M\).
The diploid genotype matrix (we denote as \(X_u\)) which has entries of either \(0\), \(1\), or \(2\), and dimensions \(N \times M\).

GRGs can be constructed from phased data or unphased data. Haploid matrix operations can only be used with phased data, but the diploid matrix operations can be used with either phased or unphased data.

For some applications, it might be useful to treat \(M\) as the number of polymorphic sites as opposed to the number of mutations. When the data is bi-allelic, these are equivalent. When the data is multi-allelic, users can sum the results for the mutations that share sites.

See the grapp preprint for a detailed description of how the matrix multiplication works as a graph traversal. Practically speaking, given a matrix \(A\) that is of dimension \(K \times N\) or a matrix \(B\) that is \(K \times M\), you can use pygrgl.matmul() to get the matrix product \(A \times X_u\) or \(B \times X_u^T\) (or \(B \times X^T\)). The parameter by_individual controls whether the multiplication is against \(X_u\) (by_individual=True) or \(X\) (by_individual=False).

See pygrgl.matmul() for more details on usage.

Allele Frequency#

Allele frequency calculates for each allele the value \(\frac{a}{N_H}\), where \(a\) is the number of samples containing the particular allele. This can be computed as a bottom-up graph traversal where every sample node is given the initial value \(\frac{1}{N_H}\) and the sums are propagated upwards until they reach mutation nodes. The value saved at the mutation node is then the allele frequency.

To calculate allele frequency using pygrgl.matmul() we construct an input vector \(\overrightarrow{v}\) of length \(N_H\) with \(\frac{1}{N_H}\) for every value. The resulting vector \(\overrightarrow{y}\) (of length \(M\)) is the allele frequency for each mutation. Because matmul expects matrix input/output (not vector), we use row vectors.

import numpy as np
import pygrgl

grg = pygrg.load_immutable_grg("test.grg")
input = np.ones( (1, grg.num_samples) ) / grg.num_samples
result = pygrgl.matmul(grg, input, pygrgl.TraversalDirection.UP)[0]

The above causes pygrgl to perform floating point addition at every node in the graph. Instead, we can just get the count of every allele and then divide by grg.num_samples afterwards, which will be much faster (if we use the correct numpy.dtype).

import numpy as np
import pygrgl

grg = pygrg.load_immutable_grg("test.grg")
input = np.ones( (1, grg.num_samples), dtype=np.uint32 )
result = pygrgl.matmul(grg, input, pygrgl.TraversalDirection.UP)[0] / grg.num_samples

For very large datasets, using integers can make a non-trvial performance difference.

Hamming Distance#

To compute the Hamming distance between two samples (i.e., how many variants differ between them), we can also use matmul. The operation that we can do efficiently is a 1-vs-all query, which can be used for finding the nearest neighbors to a sample, for example. Given a GRG containing the sample set \(S\), consider a query sample (haplotype) \(q \in S\). Then to find the Hamming distance between \(q\) and all of \(S\) we perform two matmul operations:

Create a vector \(v\) which is all 0, except that \(v_q = 1\) (the position associated with sample \(q\)). Perform an upward matmul with \(v\) to get result \(h\), which is \(q\)’s haplotype.
Perform a matmul in the downward direction with input \(1 - (2 \times h)\). Given \(r\), the result of the matrix multiplication, the Hamming distance is then \(r + \sum h\).

import numpy as np
import pygrgl

grg = pygrg.load_immutable_grg("test.grg")
q = 0  # Use the 0th haplotype as the query.

# Get q's haplotype
v = np.zero( (1, grg.num_samples), dtype=np.int32 )
v[q] = 1
h = pygrgl.matmul(grg, v, pygrgl.TraversalDirection.UP)

# Get all N_H hamming distances (note: hamming_q[q] = 0)
r = pygrgl.matmul(grg, np.ones( (1, grgl.num_mutations) ) - (2 * h), pygrgl.TraversalDirection.DOWN)[0]
hamming_q = r + numpy.sum(h)

Explanation#

The Hamming distance between two binary sequences can be described in terms of sets of elements, where the presence of that element is a \(1\) and absence is a \(0\). Then \(H(A, B) = |A| + |B| - 2|A \cap B|\), i.e., the total number of \(1\) s minus twice the shared \(1\) s. Given \(q\), our query, and all other samples \(b\), we get \(H(q, b) = |q| + |b| - 2|q \cap b|\) where \(|b| - 2|q \cap b|\) is computed by the second matmul above. The dot product between the query haplotype and the genotype matrix, \(hX\) produces the sum of the intersection between \(q\) and each other haplotype. However, we perform \((1 - 2h)X = 1X - 2hX\) which gives us the sum of each haplotype (i.e., \(|b|\) for each \(b\)) minus twice the intersection. Then we just need to add \(|q|\) (numpy.sum(h)) to get the hamming distance.

Transformation of the genotype matrix#

The grapp library provides instantiations of scipy.sparse.linalg.LinearOperator that perform matrix multiplication against transformations of the genotype matrix. Some of the more useful ones are:

grapp.linalg.SciPyStdXOperator - the standardized genotype matrix.
grapp.linalg.SciPyStdXTXOperator - the \(M \times M\) covariance matrix based on the standardized genotype matrix.
grapp.linalg.SciPyStdXXTOperator - the \(N \times N\) genetic related matrix (GRM), based on the standardized genotype matrix.

There are unstandardized versions of these operators as well.

Whole genome (all chromosomes)#

The “multi” operators in grapp behave the same way as the other operators, except they operate on a list of GRGs instead of a single one. Typically, a GRG is created for each chromosome, and then whole genome operations use a “multi” operator.