Create IGDs and Converting IGD to GRG#

This tutorial covers two topics: 1. Creating an IGD file from a .vcf.gz. 2. Converting an IGD file to a GRG

IGD files store referenced-aligned genotype data in a very simple sparse matrix format, without compression. An IGD file is often smaller than the corresponding .vcf.gz file, and is almost always faster (usually 15x or more) to access. See the paper if you’re interested in comparison experiments.

Many datasets are stored in .vcf.gz format by “default.” If these datasets are large, they are usually stored using BGZIP so that they can be indexed for semi-random access. The two different kinds of index files for BGZIP are tabix or bcftools. We only support tabix-style indexes.

What you’ll need:

  • Python dependencies “pygrgl” and “igdtools”: pip install pygrgl igdtools

  • Command line tool “wget” and “tabix”: sudo apt install wget tabix (or your distribution’s equivalent)

Get Dataset#

For our example, we’ll just download a very small simulated dataset that is stored as .vcf.gz.

%%bash

# Download a small example dataset
wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O igd_convert.example.vcf.gz
--2026-05-11 09:49:04--  aprilweilab/grg_pheno_sim
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz [following]
--2026-05-11 09:49:04--  https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 494022 (482K) [application/octet-stream]
Saving to: ‘igd_convert.example.vcf.gz’

     0K .......... .......... .......... .......... .......... 10% 2.25M 0s
    50K .......... .......... .......... .......... .......... 20% 6.40M 0s
   100K .......... .......... .......... .......... .......... 31% 5.05M 0s
   150K .......... .......... .......... .......... .......... 41% 7.50M 0s
   200K .......... .......... .......... .......... .......... 51% 6.98M 0s
   250K .......... .......... .......... .......... .......... 62% 6.04M 0s
   300K .......... .......... .......... .......... .......... 72% 12.1M 0s
   350K .......... .......... .......... .......... .......... 82% 11.2M 0s
   400K .......... .......... .......... .......... .......... 93% 21.0M 0s
   450K .......... .......... .......... ..                   100% 21.6M=0.07s

2026-05-11 09:49:04 (6.47 MB/s) - ‘igd_convert.example.vcf.gz’ saved [494022/494022]

Convert to IGD#

If we want to convert from .vcf.gz to IGD with a single thread, then we do not need a tabix index. However, if we want to use multiple threads (i.e., -j 2 below) then a tabix index will provide a performance improvement. So here we first tabix index the dataset, and then convert to IGD with igdtools.

%%bash

# Index the file. Note: usually your dataset will come with an index already.
tabix igd_convert.example.vcf.gz

# -j controls how many threads to use.
igdtools -j 2 igd_convert.example.vcf.gz -o igd_convert.example.igd
Wrote 5446 total variants
Of which 3170 were written sparsely
Wrote 5447 total variants
Of which 3058 were written sparsely

Convert to GRG#

If we have an IGD file (either because we converted it, or that is how our dataset came) then we can construct a GRG easily:

%%bash

# -j controls how many threads to use.
grg construct -j 1 igd_convert.example.igd -o igd_convert.example.grg
Processing input file in 85 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 85/85 [00:00<00:00, 203.33it/s]
Merging...
=== GRG Statistics ===
Nodes: 15481
Edges: 93351
Samples: 400
Mutations: 10893
Ploidy: 2
Phased: true
Populations: 0
Range of mutations: 55829 - 9999127
Specified range: 0 - 10894
======================
Wrote simplified GRG with:
  Nodes: 15481
  Edges: 93351
Wrote GRG to igd_convert.example.grg

What if I have metadata?#

IGD and GRG have the general philosophy that a lot of metadata is kept separate from the genotype data. The metadata that the two formats contain natively are: * IGD: Individual identifiers (one per sampled individual, not haplotype) - see pyigd.IGDReader.get_individual_ids. * IGD: Variant identifiers (one per variant) - see pyigd.IGDReader.get_variant_ids. * GRG: Individual identifiers (same as IGD) - see pygrgl.GRG.get_individual_id.

Beyond that, it is suggested that you keep metadata in a simple format. For example, igdtools supports exporting metadata to .txt files in a format that is loadable by numpy.loadtxt. This metadata can be accessed by the index of the variant (mutation), \(i\), and you can also keep a mapping from variant identifier \(v_i\) to \(i\) so that you can easily lookup other metadata by the variant id (igdtools exports such a mapping).