Create IGDs and Converting IGD to GRG ===================================== This tutorial covers two topics: 1. Creating an IGD file from a ``.vcf.gz``. 2. Converting an IGD file to a GRG IGD files store referenced-aligned genotype data in a very simple sparse matrix format, without compression. An IGD file is often smaller than the corresponding ``.vcf.gz`` file, and is almost always faster (usually ``15x`` or more) to access. See `the paper `__ if you’re interested in comparison experiments. Many datasets are stored in ``.vcf.gz`` format by “default.” If these datasets are large, they are usually stored using `BGZIP `__ so that they can be indexed for semi-random access. The two different kinds of index files for ``BGZIP`` are `tabix `__ or `bcftools `__. We only support ``tabix``-style indexes. **What you’ll need:** - Python dependencies “pygrgl” and “igdtools”: ``pip install pygrgl igdtools`` - Command line tool “wget” and “tabix”: ``sudo apt install wget tabix`` (or your distribution’s equivalent) Get Dataset ~~~~~~~~~~~ For our example, we’ll just download a very small simulated dataset that is stored as ``.vcf.gz``. .. code:: bash %%bash # Download a small example dataset wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O igd_convert.example.vcf.gz .. parsed-literal:: --2026-05-11 09:49:04-- https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz Resolving github.com (github.com)... 140.82.112.3 Connecting to github.com (github.com)|140.82.112.3|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz [following] --2026-05-11 09:49:04-- https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 494022 (482K) [application/octet-stream] Saving to: ‘igd_convert.example.vcf.gz’ 0K .......... .......... .......... .......... .......... 10% 2.25M 0s 50K .......... .......... .......... .......... .......... 20% 6.40M 0s 100K .......... .......... .......... .......... .......... 31% 5.05M 0s 150K .......... .......... .......... .......... .......... 41% 7.50M 0s 200K .......... .......... .......... .......... .......... 51% 6.98M 0s 250K .......... .......... .......... .......... .......... 62% 6.04M 0s 300K .......... .......... .......... .......... .......... 72% 12.1M 0s 350K .......... .......... .......... .......... .......... 82% 11.2M 0s 400K .......... .......... .......... .......... .......... 93% 21.0M 0s 450K .......... .......... .......... .. 100% 21.6M=0.07s 2026-05-11 09:49:04 (6.47 MB/s) - ‘igd_convert.example.vcf.gz’ saved [494022/494022] Convert to IGD -------------- If we want to convert from ``.vcf.gz`` to IGD with a single thread, then we do not need a ``tabix`` index. However, if we want to use multiple threads (i.e., ``-j 2`` below) then a tabix index will provide a performance improvement. So here we first ``tabix`` index the dataset, and then convert to IGD with ``igdtools``. .. code:: bash %%bash # Index the file. Note: usually your dataset will come with an index already. tabix igd_convert.example.vcf.gz # -j controls how many threads to use. igdtools -j 2 igd_convert.example.vcf.gz -o igd_convert.example.igd .. parsed-literal:: Wrote 5446 total variants Of which 3170 were written sparsely Wrote 5447 total variants Of which 3058 were written sparsely Convert to GRG -------------- If we have an IGD file (either because we converted it, or that is how our dataset came) then we can construct a GRG easily: .. code:: bash %%bash # -j controls how many threads to use. grg construct -j 1 igd_convert.example.igd -o igd_convert.example.grg .. parsed-literal:: Processing input file in 85 parts. Auto-calculating number of trees per part. Converting segments of input data to graphs 100%|██████████| 85/85 [00:00<00:00, 203.33it/s] Merging... .. parsed-literal:: === GRG Statistics === Nodes: 15481 Edges: 93351 Samples: 400 Mutations: 10893 Ploidy: 2 Phased: true Populations: 0 Range of mutations: 55829 - 9999127 Specified range: 0 - 10894 ====================== Wrote simplified GRG with: Nodes: 15481 Edges: 93351 Wrote GRG to igd_convert.example.grg What if I have metadata? ------------------------ IGD and GRG have the general philosophy that a lot of metadata is kept separate from the genotype data. The metadata that the two formats contain natively are: \* IGD: Individual identifiers (one per sampled individual, not haplotype) - see `pyigd.IGDReader.get_individual_ids `__. \* IGD: Variant identifiers (one per variant) - see `pyigd.IGDReader.get_variant_ids `__. \* GRG: Individual identifiers (same as IGD) - see `pygrgl.GRG.get_individual_id `__. Beyond that, it is suggested that you keep metadata in a simple format. For example, ``igdtools`` supports `exporting metadata to .txt files `__ in a format that is loadable by `numpy.loadtxt `__. This metadata can be accessed by the index of the variant (mutation), :math:`i`, and you can also keep a mapping from variant identifier :math:`v_i` to :math:`i` so that you can easily lookup other metadata by the variant id (``igdtools`` exports such a mapping). Related Topics -------------- - The `igdtools documentation `__ - An overview of the `IGD file format `__