Converting .vcf.gz to GRG#
Many datasets are stored in .vcf.gz format by “default.” If these
datasets are large, they are usually stored using
BGZIP so that they can be
indexed for semi-random access. The two different kinds of index files
for BGZIP are tabix or
bcftools.
In this tutorial we’ll show the (very simple) process of converting
.vcf.gz data to GRG format, which is much smaller (usually at
least 25x smaller) and faster (many orders of magnitude) for
performing computations.
What you’ll need:
Python dependencies “pygrgl”:
pip install pygrglCommand line tools “wget” and “tabix”:
sudo apt install wget tabix(or your distribution’s equivalent)
Get Dataset#
For our example, we’ll just download a very small simulated dataset that
is stored as .vcf.gz.
%%bash
# Download a small example dataset
wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O vcf_convert.example.vcf.gz
--2026-05-11 09:46:08-- aprilweilab/grg_pheno_sim Resolving github.com (github.com)... 140.82.113.3 Connecting to github.com (github.com)|140.82.113.3|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz [following] --2026-05-11 09:46:08-- https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 494022 (482K) [application/octet-stream] Saving to: ‘vcf_convert.example.vcf.gz’ 0K .......... .......... .......... .......... .......... 10% 2.43M 0s 50K .......... .......... .......... .......... .......... 20% 7.11M 0s 100K .......... .......... .......... .......... .......... 31% 3.88M 0s 150K .......... .......... .......... .......... .......... 41% 7.37M 0s 200K .......... .......... .......... .......... .......... 51% 26.7M 0s 250K .......... .......... .......... .......... .......... 62% 10.3M 0s 300K .......... .......... .......... .......... .......... 72% 5.27M 0s 350K .......... .......... .......... .......... .......... 82% 14.6M 0s 400K .......... .......... .......... .......... .......... 93% 10.1M 0s 450K .......... .......... .......... .. 100% 7.97M=0.07s 2026-05-11 09:46:09 (6.35 MB/s) - ‘vcf_convert.example.vcf.gz’ saved [494022/494022]
Convert to GRG#
Lets first attempt to convert to GRG without having an index for the file.
%%bash
# -j controls how many threads to use.
grg construct -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg || true
Will not count variants in VCF files (too slow)
Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override).
Processing input file in 100 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
0%| | 0/100 [00:00<?, ?it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
terminate called after throwing an instance of 'grgl::ApiMisuseFailure'
what(): WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
0%| | 0/100 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 301, in star_build_grg
return build_grg(*args)
File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 267, in build_grg
shape_grg = build_shape(range_triple, args, auto_args, input_file, output_file)
File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 254, in build_shape
tb_time = time_call(command, stdout=sys.stdout)
File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/common.py", line 48, in time_call
subprocess.check_call(command, **kwargs)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/../../grgl', 'vcf_convert.example.vcf.gz', '--trees', 'optimal', '--lf-no-tree', '10', '--reduce', '5', '-r', '0.0:0.01', '-o', 'vcf_convert.example.grg.part0.grg']' died with <Signals.SIGABRT: 6>.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ddehaas/Py3Env/bin/grg", line 6, in <module>
sys.exit(main())
File "/home/ddehaas/GrgProject/public/grgl/pygrgl/cli.py", line 70, in main
construct.from_tabular(args)
File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 422, in from_tabular
list(
File "/home/ddehaas/Py3Env/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
subprocess.CalledProcessError: Command '['/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/../../grgl', 'vcf_convert.example.vcf.gz', '--trees', 'optimal', '--lf-no-tree', '10', '--reduce', '5', '-r', '0.0:0.01', '-o', 'vcf_convert.example.grg.part0.grg']' died with <Signals.SIGABRT: 6>.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
terminate called after throwing an instance of 'grgl::ApiMisuseFailure'
what(): WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
The grg tool did not like that. Why?
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
For large datasets, trying to convert a .vcf.gz to GRG without an
index will be very slow. This warning (and failure) is there to prevent
you from accidentally doing this. However, we know that our dataset is
really small so we don’t care – we can use the --force flag to force
GRG construction.
%%bash
# -j controls how many threads to use.
grg construct --force -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg
Will not count variants in VCF files (too slow)
Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override).
Processing input file in 100 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
0%| | 0/100 [00:00<?, ?it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
2%|▏ | 2/100 [00:00<00:07, 13.33it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
4%|▍ | 4/100 [00:00<00:09, 9.66it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
6%|▌ | 6/100 [00:00<00:11, 7.85it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
7%|▋ | 7/100 [00:00<00:11, 8.13it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
8%|▊ | 8/100 [00:00<00:12, 7.64it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
9%|▉ | 9/100 [00:01<00:11, 8.09it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
10%|█ | 10/100 [00:01<00:12, 7.23it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
12%|█▏ | 12/100 [00:01<00:08, 9.82it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
14%|█▍ | 14/100 [00:01<00:07, 11.77it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
16%|█▌ | 16/100 [00:01<00:06, 13.41it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
18%|█▊ | 18/100 [00:01<00:05, 14.54it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
20%|██ | 20/100 [00:01<00:05, 15.41it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
22%|██▏ | 22/100 [00:01<00:04, 15.93it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
24%|██▍ | 24/100 [00:02<00:04, 15.59it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
26%|██▌ | 26/100 [00:02<00:04, 16.06it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
28%|██▊ | 28/100 [00:02<00:04, 16.36it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
30%|███ | 30/100 [00:02<00:04, 16.54it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
32%|███▏ | 32/100 [00:02<00:04, 16.83it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
34%|███▍ | 34/100 [00:02<00:04, 16.15it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
36%|███▌ | 36/100 [00:02<00:03, 16.42it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
38%|███▊ | 38/100 [00:02<00:03, 16.24it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
40%|████ | 40/100 [00:03<00:03, 16.32it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
42%|████▏ | 42/100 [00:03<00:03, 16.39it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
44%|████▍ | 44/100 [00:03<00:03, 16.48it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
46%|████▌ | 46/100 [00:03<00:03, 16.44it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
48%|████▊ | 48/100 [00:03<00:03, 16.14it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
50%|█████ | 50/100 [00:03<00:03, 16.19it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
52%|█████▏ | 52/100 [00:03<00:03, 15.94it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
54%|█████▍ | 54/100 [00:03<00:02, 16.04it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
56%|█████▌ | 56/100 [00:04<00:02, 16.14it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
58%|█████▊ | 58/100 [00:04<00:02, 16.16it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
60%|██████ | 60/100 [00:04<00:02, 15.78it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
62%|██████▏ | 62/100 [00:04<00:02, 16.20it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
64%|██████▍ | 64/100 [00:04<00:02, 16.56it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
66%|██████▌ | 66/100 [00:04<00:02, 16.91it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
68%|██████▊ | 68/100 [00:04<00:01, 17.43it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
70%|███████ | 70/100 [00:04<00:01, 17.04it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
72%|███████▏ | 72/100 [00:04<00:01, 16.72it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
74%|███████▍ | 74/100 [00:05<00:01, 16.84it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
76%|███████▌ | 76/100 [00:05<00:01, 16.46it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
78%|███████▊ | 78/100 [00:05<00:01, 16.35it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
80%|████████ | 80/100 [00:05<00:01, 17.08it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
82%|████████▏ | 82/100 [00:05<00:01, 17.30it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
84%|████████▍ | 84/100 [00:05<00:00, 17.32it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
86%|████████▌ | 86/100 [00:05<00:00, 17.48it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
88%|████████▊ | 88/100 [00:05<00:00, 17.09it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
90%|█████████ | 90/100 [00:06<00:00, 17.15it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
92%|█████████▏| 92/100 [00:06<00:00, 16.25it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
94%|█████████▍| 94/100 [00:06<00:00, 15.67it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
96%|█████████▌| 96/100 [00:06<00:00, 16.01it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
98%|█████████▊| 98/100 [00:06<00:00, 16.36it/s]WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.
100%|██████████| 100/100 [00:06<00:00, 15.01it/s]
Merging...
=== GRG Statistics ===
Nodes: 12945
Edges: 117253
Samples: 400
Mutations: 10893
Ploidy: 2
Phased: true
Populations: 0
Range of mutations: 55829 - 9999127
Specified range: 0 - 100000001
======================
Wrote simplified GRG with:
Nodes: 12945
Edges: 117253
Wrote GRG to vcf_convert.example.grg
That gave us a valid GRG, by essentially “brute force” accessing the VCF
file without an index. This works fine on a small file, but not a large
one. So instead, lets try indexing this file with tabix and trying
against. NOTE: grg only supports tabix-style indexes, and
not bcftools-style indexes on .vcf.gz files.
%%bash
tabix vcf_convert.example.vcf.gz
Now we can construct a GRG without having to use --force, and we
don’t get any warnings.
%%bash
# -j controls how many threads to use.
grg construct -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg
Will not count variants in VCF files (too slow)
Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override).
Processing input file in 100 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 100/100 [00:03<00:00, 28.33it/s]
Merging...
=== GRG Statistics ===
Nodes: 12945
Edges: 117253
Samples: 400
Mutations: 10893
Ploidy: 2
Phased: true
Populations: 0
Range of mutations: 55829 - 9999127
Specified range: 0 - 100000001
======================
Wrote simplified GRG with:
Nodes: 12945
Edges: 117253
Wrote GRG to vcf_convert.example.grg