Create IGDs and Converting IGD to GRG#
This tutorial covers two topics: 1. Creating an IGD file from a
.vcf.gz. 2. Converting an IGD file to a GRG
IGD files store referenced-aligned genotype data in a very simple sparse
matrix format, without compression. An IGD file is often smaller than
the corresponding .vcf.gz file, and is almost always faster (usually
15x or more) to access. See the
paper if you’re
interested in comparison experiments.
Many datasets are stored in .vcf.gz format by “default.” If these
datasets are large, they are usually stored using
BGZIP so that they can be
indexed for semi-random access. The two different kinds of index files
for BGZIP are tabix or
bcftools. We
only support tabix-style indexes.
What you’ll need:
Python dependencies “pygrgl” and “igdtools”:
pip install pygrgl igdtoolsCommand line tool “wget” and “tabix”:
sudo apt install wget tabix(or your distribution’s equivalent)
Get Dataset#
For our example, we’ll just download a very small simulated dataset that
is stored as .vcf.gz.
%%bash
# Download a small example dataset
wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O igd_convert.example.vcf.gz
--2026-05-11 09:49:04-- aprilweilab/grg_pheno_sim Resolving github.com (github.com)... 140.82.112.3 Connecting to github.com (github.com)|140.82.112.3|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz [following] --2026-05-11 09:49:04-- https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 494022 (482K) [application/octet-stream] Saving to: ‘igd_convert.example.vcf.gz’ 0K .......... .......... .......... .......... .......... 10% 2.25M 0s 50K .......... .......... .......... .......... .......... 20% 6.40M 0s 100K .......... .......... .......... .......... .......... 31% 5.05M 0s 150K .......... .......... .......... .......... .......... 41% 7.50M 0s 200K .......... .......... .......... .......... .......... 51% 6.98M 0s 250K .......... .......... .......... .......... .......... 62% 6.04M 0s 300K .......... .......... .......... .......... .......... 72% 12.1M 0s 350K .......... .......... .......... .......... .......... 82% 11.2M 0s 400K .......... .......... .......... .......... .......... 93% 21.0M 0s 450K .......... .......... .......... .. 100% 21.6M=0.07s 2026-05-11 09:49:04 (6.47 MB/s) - ‘igd_convert.example.vcf.gz’ saved [494022/494022]
Convert to IGD#
If we want to convert from .vcf.gz to IGD with a single thread, then
we do not need a tabix index. However, if we want to use multiple
threads (i.e., -j 2 below) then a tabix index will provide a
performance improvement. So here we first tabix index the dataset,
and then convert to IGD with igdtools.
%%bash
# Index the file. Note: usually your dataset will come with an index already.
tabix igd_convert.example.vcf.gz
# -j controls how many threads to use.
igdtools -j 2 igd_convert.example.vcf.gz -o igd_convert.example.igd
Wrote 5446 total variants
Of which 3170 were written sparsely
Wrote 5447 total variants
Of which 3058 were written sparsely
Convert to GRG#
If we have an IGD file (either because we converted it, or that is how our dataset came) then we can construct a GRG easily:
%%bash
# -j controls how many threads to use.
grg construct -j 1 igd_convert.example.igd -o igd_convert.example.grg
Processing input file in 85 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 85/85 [00:00<00:00, 203.33it/s]
Merging...
=== GRG Statistics ===
Nodes: 15481
Edges: 93351
Samples: 400
Mutations: 10893
Ploidy: 2
Phased: true
Populations: 0
Range of mutations: 55829 - 9999127
Specified range: 0 - 10894
======================
Wrote simplified GRG with:
Nodes: 15481
Edges: 93351
Wrote GRG to igd_convert.example.grg
What if I have metadata?#
IGD and GRG have the general philosophy that a lot of metadata is kept separate from the genotype data. The metadata that the two formats contain natively are: * IGD: Individual identifiers (one per sampled individual, not haplotype) - see pyigd.IGDReader.get_individual_ids. * IGD: Variant identifiers (one per variant) - see pyigd.IGDReader.get_variant_ids. * GRG: Individual identifiers (same as IGD) - see pygrgl.GRG.get_individual_id.
Beyond that, it is suggested that you keep metadata in a simple format.
For example, igdtools supports exporting metadata to .txt
files
in a format that is loadable by
numpy.loadtxt.
This metadata can be accessed by the index of the variant (mutation),
\(i\), and you can also keep a mapping from variant identifier
\(v_i\) to \(i\) so that you can easily lookup other metadata by
the variant id (igdtools exports such a mapping).