.. _examples:
Examples and applications using GRG
===================================
GWAS
----
The ``grapp`` tool supports GWAS using linear association affects for diploid data, using a GRG. install
via
::
pip install grapp
See
::
grapp --help
for details.
Phenotype Simulation
--------------------
Given a GRG, you can simulate phenotypes for the individual samples contained within it. This is done using the
package `grg_pheno_sim `_, which can be installed via::
pip install grg_pheno_sim
Basic Example
~~~~~~~~~~~~~
Below is a minimal Python example using the `sim_phenotypes` function:
.. code-block:: python
import pygrgl
from grg_pheno_sim.phenotype import sim_phenotypes
# 1. Load an immutable GRG
grg = pygrgl.load_immutable_grg("test.grg")
# 2. Set desired heritability (h²)
heritability = 0.33
# 3. Simulate phenotypes
phenotypes = sim_phenotypes(grg, heritability=heritability)
The returned DataFrame contains one row per individual and has the following columns:
- ``causal_mutation_id`` – the identifier for each causal mutation (for multivariate simulations)
- ``individual_id`` – the external individual identifier
- ``genetic_value`` – the sum of all causal-mutation effects for the individual
- ``environmental_noise`` – the simulated environmental component
- ``phenotype`` – the total phenotype value
Extended Example with additional parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``sim_phenotypes`` function supports a range of optional parameters for customizing
the simulation process. This example shows how to specify the effect-size model,
control the number of causal variants, set reproducibility parameters, normalize phenotypes,
and save intermediate outputs.
.. code-block:: python
import pygrgl
from grg_pheno_sim.phenotype import sim_phenotypes
from grg_pheno_sim.model import grg_causal_mutation_model
# 1. Load an immutable GRG
grg_1 = pygrgl.load_immutable_grg("test.grg")
# 2. Define effect-size model
model_type = "normal"
mean = 0
var = 1
model = grg_causal_mutation_model(model_type, mean=mean, var=var)
# 3. Simulation parameters
num_causal = 1000
random_seed = 1
normalize_phenotype = True # normalize final phenotypes
normalize_genetic_values_before_noise = True # normalize genetic values before adding noise
heritability = 0.33
effect_output_required = True # save effect sizes to .par file
effect_path = 'univariate_sample_effect_sizes.par'
standardized_output = True
output_path = 'normal_pheno_normalized.phen' # output phenotype file
header = True # include header row in output
# 4. Run simulation
phenotypes = sim_phenotypes(
grg_1,
model,
num_causal,
random_seed,
normalize_phenotype=normalize_phenotype,
normalize_genetic_values_before_noise=normalize_genetic_values_before_noise,
heritability=heritability,
save_effect_output=effect_output_required,
effect_path=effect_path,
standardized_output=standardized_output,
path=output_path,
header=header
)
This produces a phenotype DataFrame (and optional saved files) based on the chosen model
and parameters.
Inspecting output files
~~~~~~~~~~~~~~~~~~~~~~~
When ``save_effect_output=True`` is specified an effect sizes .par file is produced and when ``standardized_output=True`` is specified a phenotype file :
1. **Effect sizes file** – contains effect sizes for each mutation in the GRG.
Example from ``univariate_sample_effect_sizes.par``:
.. list-table::
:header-rows: 1
:widths: 15 15 15 15 15 15
* - mutation_id
- AlternateAllele
- Position
- RefAllele
- Frequency
- Effect
* - 20
- A
- 73883
- C
- 0.000
- -1.810258
* - 28
- G
- 81128
- A
- 0.335
- 1.151768
* - 62
- G
- 117367
- C
- 0.840
- 1.681257
* - 76
- A
- 134028
- C
- 0.010
- 2.346698
* - 119
- A
- 180670
- T
- 0.840
- -0.286668
- ``mutation_id`` – unique ID for each mutation in the GRG
- ``AlternateAllele`` / ``RefAllele`` – allele information
- ``Position`` – genomic position
- ``Frequency`` – allele frequency in the dataset
- ``Effect`` – effect size used in phenotype simulation
2. **Phenotype file** – contains the simulated phenotypes for each individual.
Example from ``normal_pheno_normalized.phen``:
.. list-table::
:header-rows: 1
:widths: 20 20
* - person_id
- phenotype
* - 0
- 0.715620
* - 1
- -0.706513
* - 2
- 0.416523
* - 3
- 0.657675
* - 4
- -0.747519
- ``person_id`` – numeric identifier for each individual
- ``phenotype`` – simulated phenotype value
Binary phenotype (case–control) example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Supports binary phenotype simulation.
This step follows the standard pipeline to simulate a continuous phenotype,
then employs a ``population_prevalence`` parameter to establish a Gaussian threshold
to convert continuous values into binary outcomes.
Lastly, the system overwrites continuous phenotypes with their binary counterparts.
.. code-block:: python
import pygrgl
from grg_pheno_sim.phenotype import sim_binary_phenotypes
# 1. Load an immutable GRG
grg_1 = pygrgl.load_immutable_grg("test.grg")
# 2. Set model parameters
heritability = 0.33 # liability-scale h^2
population_prevalence = 0.10 # 1 in 10 individuals are cases on average
# 3. Simulate binary phenotypes (0 = control, 1 = case)
phenotypes_binary = sim_binary_phenotypes(
grg_1,
population_prevalence,
heritability=heritability,
# optional:
# num_causal=1000,
# random_seed=42,
# standardized=True,
)
The output is the same DataFrame as the previous example except ``phenotype`` is now a binary outcome (``0`` control, ``1`` case) determined by the prevalence threshold.
Standardized genotype example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``sim_phenotypes`` function also supports internally standardizing the genotype matrix
before computing genetic values. This is done by setting ``standardized=True``, which
centers and scales each mutation's genotype values to have mean zero and variance one.
.. code-block:: python
import pygrgl
from grg_pheno_sim.phenotype import sim_phenotypes
# 1. Load an immutable GRG
grg_1 = pygrgl.load_immutable_grg("test.grg")
# 2. Set desired heritability (h²)
heritability = 0.33
# 3. Simulate phenotypes with standardized genotypes
phenotypes_standardized_genes = sim_phenotypes(
grg_1,
heritability=heritability,
standardized=True
)
This produces the same output format as the basic example, but with genotypes standardized before effect application.
Custom effect sizes example
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``sim_phenotypes_custom`` function allows you to provide your own effect sizes instead
of sampling from a distribution-based model. This can be useful for testing, reproducing known
phenotypes, or running simulations with fixed effect patterns.
Supported effect size input types:
- **List** – a list of effect sizes in mutation order
- **Dictionary** – keys are mutation IDs, values are effect sizes
- **Pandas DataFrame** – must contain at least ``mutation_id`` and ``effect_size`` columns
(for multivariate simulations, also include ``causal_mutation_id``)
.. code-block:: python
import numpy as np
import pandas as pd
import pygrgl
import random
from grg_pheno_sim.phenotype import sim_phenotypes_custom
# 1. Load an immutable GRG
grg_1 = pygrgl.load_immutable_grg("test.grg")
# 2. Prepare effect sizes
n = grg_1.num_mutations
specific_effects = [1.0 for _ in range(n)] # list input, fixed effects
effect_sizes = np.random.randn(n) # NumPy array
mutation_dict = {i: effect_sizes[i] for i in range(n)} # dictionary input
input_df = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) # DataFrame input
# 3. Simulation parameters
heritability = 0.33
normalize_genetic_values_before_noise = True
standardized_output = True
output_path = 'custom_pheno.phen'
# 4. Simulate phenotypes using custom effect sizes
phenotypes_list = sim_phenotypes_custom(
grg_1,
specific_effects,
normalize_genetic_values_before_noise=normalize_genetic_values_before_noise,
heritability=heritability,
standardized_output=standardized_output,
path=output_path
)
The returned DataFrame has the same structure as in the basic example, with phenotypes generated using
your provided effect sizes rather than randomly sampled ones.
Quantile normalization example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example demonstrates quantile normalization for a single causal mutation's phenotypic DataFrame.
.. code-block:: python
import numpy as np
import pygrgl
import matplotlib.pyplot as plt
from grg_pheno_sim.effect_size import sim_grg_causal_mutation, additive_effect_sizes, samples_to_individuals
from grg_pheno_sim.model import grg_causal_mutation_model
from grg_pheno_sim.noise_sim import sim_env_noise
from grg_pheno_sim.normalization import quantile_normalize
# 1. Load an immutable GRG
grg_1 = pygrgl.load_immutable_grg("test.grg")
# 2. Define model for causal effects
mean_1 = 0.0
var_1 = 1.0
model_normal = grg_causal_mutation_model("normal", mean=mean_1, var=var_1)
# 3. Simulate genetic values
trait_df_normal = sim_grg_causal_mutation(grg_1, num_causal=1000, model=model_normal, random_seed=1)
sample_nodes_df = additive_effect_sizes(grg_1, trait_df_normal)
individual_genetic_value_df = samples_to_individuals(sample_nodes_df)
# 4. Add environmental noise without normalizing genetic values
phenotypes = sim_env_noise(individual_genetic_value_df, h2=0.5)
phenotype_df = phenotypes.phenotype_df
# 5. Quantile normalize to the normal distribution
quantile_normalize_phenotype_df = quantile_normalize(phenotype_df)
More usage examples can be found in the `example jupyter notebooks `_.
**TODO: Provide a simple end-to-end example that performs both phenotype simulation and GWAS.**
Splitting GRGs
--------------
See command ``grg split --help`` and Python API :py:meth:`save_subset`.
Splitting equally
-----------------
You can use ``grg split input.grg -s `` to split a GRG into equal graphs, each of which cover
a base-pair range as specified by the ``-s`` flag. If you specify the ``--rec-map `` option,
the size is assumed to be in centimorgans (cM).
If you have a list of ranges that you want to split a GRG into, you can put them in a text file with two columns
(space separated) and a header line. Here is an example file:
::
start end
0 1000000
1000000 2000000
5000000 7000000
Then save that file (e.g., as ``ranges.txt``) and pass it to the split command like: ``grg split input.grg -f ranges.txt``.
This will create 3 GRG files, each spanning one of the ranges from the text file. These ranges are inclusive on "start"
and exlusive on "end", so to get full coverage of the genome you want each consecutive range to have a "start" that is
identical to the previous range's "end".
By default, for a file ``input.grg`` the split command will produce a directory ``input.grg.split/`` and place all the
resulting split GRGs in that directory. You can change the directory using the ``-o`` flag, like:
``grg split input.grg -s 1000000 -o my_split_grgs``.
The output directory must not already exist, or splitting will fail and ask you to remove the directory or
specify a different one.