Converting .vcf.gz to GRG ========================= Many datasets are stored in ``.vcf.gz`` format by “default.” If these datasets are large, they are usually stored using `BGZIP `__ so that they can be indexed for semi-random access. The two different kinds of index files for ``BGZIP`` are `tabix `__ or `bcftools `__. In this tutorial we’ll show the (very simple) process of converting ``.vcf.gz`` data to ``GRG`` format, which is much smaller (usually at least ``25x`` smaller) and faster (many orders of magnitude) for performing computations. **What you’ll need:** - Python dependencies “pygrgl”: ``pip install pygrgl`` - Command line tools “wget” and “tabix”: ``sudo apt install wget tabix`` (or your distribution’s equivalent) Get Dataset ~~~~~~~~~~~ For our example, we’ll just download a very small simulated dataset that is stored as ``.vcf.gz``. .. code:: bash %%bash # Download a small example dataset wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O vcf_convert.example.vcf.gz .. parsed-literal:: --2026-05-11 09:46:08-- https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz Resolving github.com (github.com)... 140.82.113.3 Connecting to github.com (github.com)|140.82.113.3|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz [following] --2026-05-11 09:46:08-- https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 494022 (482K) [application/octet-stream] Saving to: ‘vcf_convert.example.vcf.gz’ 0K .......... .......... .......... .......... .......... 10% 2.43M 0s 50K .......... .......... .......... .......... .......... 20% 7.11M 0s 100K .......... .......... .......... .......... .......... 31% 3.88M 0s 150K .......... .......... .......... .......... .......... 41% 7.37M 0s 200K .......... .......... .......... .......... .......... 51% 26.7M 0s 250K .......... .......... .......... .......... .......... 62% 10.3M 0s 300K .......... .......... .......... .......... .......... 72% 5.27M 0s 350K .......... .......... .......... .......... .......... 82% 14.6M 0s 400K .......... .......... .......... .......... .......... 93% 10.1M 0s 450K .......... .......... .......... .. 100% 7.97M=0.07s 2026-05-11 09:46:09 (6.35 MB/s) - ‘vcf_convert.example.vcf.gz’ saved [494022/494022] Convert to GRG -------------- Lets first attempt to convert to GRG without having an index for the file. .. code:: bash %%bash # -j controls how many threads to use. grg construct -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg || true .. parsed-literal:: Will not count variants in VCF files (too slow) Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override). Processing input file in 100 parts. Auto-calculating number of trees per part. Converting segments of input data to graphs 0%| | 0/100 [00:00. """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/ddehaas/Py3Env/bin/grg", line 6, in sys.exit(main()) File "/home/ddehaas/GrgProject/public/grgl/pygrgl/cli.py", line 70, in main construct.from_tabular(args) File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 422, in from_tabular list( File "/home/ddehaas/Py3Env/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__ for obj in iterable: File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next raise value subprocess.CalledProcessError: Command '['/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/../../grgl', 'vcf_convert.example.vcf.gz', '--trees', 'optimal', '--lf-no-tree', '10', '--reduce', '5', '-r', '0.0:0.01', '-o', 'vcf_convert.example.grg.part0.grg']' died with . WARNING: Conversion from VCF without a tabix index is very slow, and not recommended. terminate called after throwing an instance of 'grgl::ApiMisuseFailure' what(): WARNING: Conversion from VCF without a tabix index is very slow, and not recommended. The ``grg`` tool did not like that. Why? ``WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.`` For large datasets, trying to convert a ``.vcf.gz`` to GRG without an index will be very slow. This warning (and failure) is there to prevent you from accidentally doing this. However, we know that our dataset is really small so we don’t care – we can use the ``--force`` flag to force GRG construction. .. code:: bash %%bash # -j controls how many threads to use. grg construct --force -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg .. parsed-literal:: Will not count variants in VCF files (too slow) Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override). Processing input file in 100 parts. Auto-calculating number of trees per part. Converting segments of input data to graphs 0%| | 0/100 [00:00`__ first and then convert to GRG. IGD files can be substantially faster to access than ``.vcf.gz`` files. See `Converting IGD to GRG `__.