GRG-Backed Genotypes
This notebook demonstrates how to use GRG (Genealogical Record of Genotypes) files as a compressed genotype backend in xftsim. GRG is a graph-based representation of genotype data developed by the grg project that can represent large cohort genotype data in a fraction of the space required by conventional matrix formats (VCF, PLINK BED).
Key advantages of the GRG representation:
Compression: GRG exploits sharing of identical-by-state haplotype segments across individuals, achieving substantial compression without information loss.
Graph traversal matvec: Matrix-vector products (genotypes times effect sizes) can be computed directly on the graph via tree traversal, without ever materializing the full genotype matrix. This is O(nodes) per variant rather than O(n) per variant.
Seamless integration: xftsim’s
GraphHaplotypeOperatorimplements the sameHaplotypeOperatorinterface asDenseHaplotypeArray, so genetic architectures work identically with either backend.
We will cover:
Loading a GRG file with
load_grg()Inspecting the
GraphHaplotypeOperator(dimensions, metadata, allele frequencies)Matrix-vector operations and verifying equivalence with dense materialization
Running a full simulation with GRG-backed founder haplotypes
Current limitations (meiosis materializes to dense)
Note: GRG support requires the pygrgl package, which provides Python bindings to the GRG C++ library. If pygrgl is not installed, the GRG-specific cells will not run.
1. Imports and pygrgl Availability Check
We first check whether pygrgl is available. All subsequent GRG operations are guarded by this check. If you do not have pygrgl installed, the notebook still serves as documentation of the API.
[ ]:
import numpy as np
import os
# Check for pygrgl availability
try:
import pygrgl
HAS_PYGRGL = True
print(f"pygrgl is available (version: {getattr(pygrgl, '__version__', 'unknown')})")
except ImportError:
HAS_PYGRGL = False
print("pygrgl is NOT installed. GRG cells will be skipped.")
print("Install with: pip install pygrgl")
from xftsim.io import load_grg
from xftsim.struct import (
GraphHaplotypeOperator, DenseHaplotypeArray, HaplotypeOperator,
SampleMeta, VariantMeta,
)
from xftsim.effect import AdditiveEffects
from xftsim.arch import Architecture
from xftsim.mate import RandomMating
from xftsim.reproduce import RecombinationMap
from xftsim.sim import Simulation
from xftsim.stats import SampleStatistics
2. Locate GRG Fixture Files
We use test fixture datasets that ship with the glink test suite. Two sizes are available:
Dataset |
Individuals |
Variants |
GRG size |
BED size |
|---|---|---|---|---|
|
20 |
100 |
~6 KB |
~0.5 KB |
|
100 |
1000 |
~74 KB |
~25 KB |
Each dataset directory contains a .grg file (the graph) and a .bim file (PLINK-format variant metadata with chromosome, position, and allele codes). The BIM file is optional but provides richer variant metadata than what the GRG stores natively.
[ ]:
# Paths to GRG fixture files
FIXTURE_BASE = os.path.expanduser(
"~/Dropbox/grg/glink/tests/fixtures/datasets"
)
TINY_GRG = os.path.join(FIXTURE_BASE, "tiny_clean", "genotypes.grg")
TINY_BIM = os.path.join(FIXTURE_BASE, "tiny_clean", "genotypes.bim")
SMALL_GRG = os.path.join(FIXTURE_BASE, "small_clean", "genotypes.grg")
SMALL_BIM = os.path.join(FIXTURE_BASE, "small_clean", "genotypes.bim")
# Verify files exist
for label, path in [("tiny GRG", TINY_GRG), ("tiny BIM", TINY_BIM),
("small GRG", SMALL_GRG), ("small BIM", SMALL_BIM)]:
exists = os.path.exists(path)
size = os.path.getsize(path) if exists else 0
print(f"{label:>10s}: {'OK' if exists else 'MISSING':>7s} ({size:,} bytes)")
3. Loading a GRG File
The load_grg() function from xftsim.io loads a .grg file and returns a GraphHaplotypeOperator. It accepts an optional bim_path argument for richer variant metadata (chromosome, position, ref/alt alleles).
load_grg(path, generation=0, bim_path=None)
Without a BIM file, variant IDs are constructed as "{position}:{ref}:{alt}" from the GRG’s native mutation metadata. Chromosome information is only available when a BIM file is provided.
[ ]:
if HAS_PYGRGL:
# Load the tiny dataset with BIM metadata
hap_tiny = load_grg(TINY_GRG, bim_path=TINY_BIM)
print(hap_tiny)
print(f"Type: {type(hap_tiny).__name__}")
print(f"Is HaplotypeOperator: {isinstance(hap_tiny, HaplotypeOperator)}")
else:
print("Skipped: pygrgl not available.")
[ ]:
if HAS_PYGRGL:
# Load the small dataset for a larger example
hap_small = load_grg(SMALL_GRG, bim_path=SMALL_BIM)
print(hap_small)
else:
print("Skipped: pygrgl not available.")
4. Inspecting the GraphHaplotypeOperator
The GraphHaplotypeOperator exposes the same core properties as DenseHaplotypeArray:
Property |
Description |
|---|---|
|
Number of diploid individuals |
|
Number of biallelic variants |
|
Generation number |
|
|
|
|
[ ]:
if HAS_PYGRGL:
print("--- Tiny dataset ---")
print(f" Individuals (n): {hap_tiny.n}")
print(f" Variants (m): {hap_tiny.m}")
print(f" Generation: {hap_tiny.generation}")
print()
print(f" Sample IIDs (first 5): {hap_tiny.samples.iid[:5]}")
print(f" Variant IDs (first 5): {hap_tiny.variants.vid[:5]}")
print(f" Chromosomes (first 5): {hap_tiny.variants.chrom[:5]}")
print(f" Positions (first 5): {hap_tiny.variants.pos_bp[:5]}")
print()
print("--- Small dataset ---")
print(f" Individuals (n): {hap_small.n}")
print(f" Variants (m): {hap_small.m}")
else:
print("Skipped: pygrgl not available.")
4.1 Allele Frequencies
Allele frequencies are computed via an UP traversal of the GRG graph: G^T @ 1 / (2n). This computes the count of alternate alleles across all haplotypes without materializing the genotype matrix.
The result is cached after the first call, so subsequent calls to recompute_af() return the cached array immediately.
[ ]:
if HAS_PYGRGL:
af = hap_tiny.recompute_af()
print(f"Allele frequency array shape: {af.shape}")
print(f"AF range: [{af.min():.4f}, {af.max():.4f}]")
print(f"Mean AF: {af.mean():.4f}")
print(f"First 10 AFs: {np.round(af[:10], 4)}")
print()
# Verify caching: second call returns the same object
af2 = hap_tiny.recompute_af()
print(f"Cached (same object): {af is af2}")
else:
print("Skipped: pygrgl not available.")
4.2 Loading Without a BIM File
When no BIM file is provided, variant metadata is extracted directly from the GRG’s mutation records. Variant IDs are formatted as "{position}:{ref_allele}:{alt_allele}". Chromosome information is not available from GRG metadata alone.
[ ]:
if HAS_PYGRGL:
hap_no_bim = load_grg(TINY_GRG) # no bim_path
print(f"Without BIM -- variant IDs (first 5): {hap_no_bim.variants.vid[:5]}")
print(f"Without BIM -- chrom: {hap_no_bim.variants.chrom}")
print()
print(f"With BIM -- variant IDs (first 5): {hap_tiny.variants.vid[:5]}")
print(f"With BIM -- chrom (first 5): {hap_tiny.variants.chrom[:5]}")
else:
print("Skipped: pygrgl not available.")
5. Matrix-Vector Operations
The core purpose of GraphHaplotypeOperator is to compute genotype-by-effect products without materializing the full (n x m) genotype matrix. The operator provides:
Method |
Operation |
Description |
|---|---|---|
|
G @ v |
Diploid genotype matrix times vector |
|
G.T @ v |
Transpose times vector |
|
G_mat @ v |
Maternal haplotype times vector |
|
G_pat @ v |
Paternal haplotype times vector |
|
(G - 2p) @ v |
Centered genotype matvec |
These are computed via pygrgl’s matmul function which traverses the GRG graph in either DOWN (samples from variants) or UP (variants from samples) direction.
We will verify that GRG matvec produces identical results to the dense materialization.
[ ]:
if HAS_PYGRGL:
# Materialize to dense for comparison
dense_tiny = hap_tiny.to_dense()
print(f"Dense materialization: {type(dense_tiny).__name__}")
print(f"Genotype array shape: {dense_tiny.genotypes.shape}")
print(f"Values are binary: {set(np.unique(dense_tiny.genotypes))}")
else:
print("Skipped: pygrgl not available.")
5.1 Diploid Matvec (G @ v)
The basic matvec computes the diploid genotype matrix times a vector. For a single effect vector of length m, the result is a vector of length n giving each individual’s genetic value.
[ ]:
if HAS_PYGRGL:
rng = np.random.RandomState(42)
v = rng.randn(hap_tiny.m)
# GRG matvec
result_grg = hap_tiny.matvec(v)
# Dense matvec for comparison
result_dense = dense_tiny.matvec(v)
print(f"GRG matvec shape: {result_grg.shape}")
print(f"Dense matvec shape: {result_dense.shape}")
print(f"Max absolute difference: {np.max(np.abs(result_grg - result_dense)):.2e}")
print(f"Results match: {np.allclose(result_grg, result_dense, atol=1e-10)}")
print()
print(f"First 5 values (GRG): {np.round(result_grg[:5], 4)}")
print(f"First 5 values (dense): {np.round(result_dense[:5], 4)}")
else:
print("Skipped: pygrgl not available.")
5.2 Multi-Column Matvec (G @ V)
Both matvec and rmatvec support 2-D inputs. If v has shape (m, k), the result is (n, k) – computing k genetic values simultaneously. This is efficient for multivariate traits.
[ ]:
if HAS_PYGRGL:
V = rng.randn(hap_tiny.m, 3) # 3 traits
result_grg_2d = hap_tiny.matvec(V)
result_dense_2d = dense_tiny.matvec(V)
print(f"Input shape: {V.shape} (m x k_traits)")
print(f"Output shape: {result_grg_2d.shape} (n x k_traits)")
print(f"Max difference: {np.max(np.abs(result_grg_2d - result_dense_2d)):.2e}")
print(f"All close: {np.allclose(result_grg_2d, result_dense_2d, atol=1e-10)}")
else:
print("Skipped: pygrgl not available.")
5.3 Maternal and Paternal Haplotype Matvec
GRG stores haploid data (2n haplotypes). The operator provides separate maternal and paternal matvec operations. The diploid matvec is the sum of maternal and paternal contributions: matvec(v) = matvec_maternal(v) + matvec_paternal(v).
[ ]:
if HAS_PYGRGL:
v = np.ones(hap_tiny.m, dtype=np.float64)
mat = hap_tiny.matvec_maternal(v)
pat = hap_tiny.matvec_paternal(v)
dip = hap_tiny.matvec(v)
print(f"Maternal sum (first 5): {mat[:5]}")
print(f"Paternal sum (first 5): {pat[:5]}")
print(f"Diploid sum (first 5): {dip[:5]}")
print(f"Mat + Pat (first 5): {(mat + pat)[:5]}")
print()
print(f"mat + pat == diploid: {np.allclose(mat + pat, dip, atol=1e-10)}")
# Also verify against dense
mat_dense = dense_tiny.matvec_maternal(v)
pat_dense = dense_tiny.matvec_paternal(v)
print(f"Maternal matches dense: {np.allclose(mat, mat_dense, atol=1e-10)}")
print(f"Paternal matches dense: {np.allclose(pat, pat_dense, atol=1e-10)}")
else:
print("Skipped: pygrgl not available.")
5.4 Standardized Matvec
The standardized_matvec computes (G - 2p) @ v without materializing G. This is the centered genotype matvec used by GeneticComponent when standardized=True (the default). The centering correction is computed as:
(G - 2p) @ v = G @ v - 2p @ v
Both terms are computed in compressed form: G @ v via graph traversal and 2p @ v as a simple dot product of allele frequencies with the effect vector.
[ ]:
if HAS_PYGRGL:
v = rng.randn(hap_tiny.m)
af = hap_tiny.recompute_af()
std_grg = hap_tiny.standardized_matvec(v, af=af)
std_dense = dense_tiny.standardized_matvec(v, af=af)
print(f"Standardized matvec shape: {std_grg.shape}")
print(f"Max difference: {np.max(np.abs(std_grg - std_dense)):.2e}")
print(f"Results match: {np.allclose(std_grg, std_dense, atol=1e-10)}")
print()
print(f"Mean of centered result: {std_grg.mean():.6f} (should be near 0)")
else:
print("Skipped: pygrgl not available.")
5.5 Reverse Matvec (G^T @ v)
The reverse matvec computes G^T @ v via an UP traversal. This is useful for operations like computing allele-level statistics from individual-level data (e.g., GWAS score statistics).
[ ]:
if HAS_PYGRGL:
v_n = rng.randn(hap_tiny.n)
rmat_grg = hap_tiny.rmatvec(v_n)
rmat_dense = dense_tiny.rmatvec(v_n)
print(f"Input shape: {v_n.shape} (n,)")
print(f"Output shape: {rmat_grg.shape} (m,)")
print(f"Max difference: {np.max(np.abs(rmat_grg - rmat_dense)):.2e}")
print(f"Results match: {np.allclose(rmat_grg, rmat_dense, atol=1e-10)}")
else:
print("Skipped: pygrgl not available.")
6. Dense Materialization and Subsetting
When you need direct access to the full genotype array (e.g., for visualization or debugging), use to_dense() to materialize the GRG into a DenseHaplotypeArray with shape (n, m, 2).
Indexing a GraphHaplotypeOperator with [] also triggers materialization, returning a dense subset.
[ ]:
if HAS_PYGRGL:
# Full materialization
dense = hap_tiny.to_dense()
print(f"Full dense: {dense.genotypes.shape}")
print(f" n={dense.n}, m={dense.m}, generation={dense.generation}")
print(f" Unique values: {sorted(np.unique(dense.genotypes))}")
print()
# Subset: first 5 individuals, first 20 variants
sub = hap_tiny[:5, :20]
print(f"Subset [5 individuals, 20 variants]: {sub.genotypes.shape}")
print(f" Type: {type(sub).__name__}")
print()
# Show a small slice of the genotype matrix (diploid sums)
diploid = sub.genotypes[:, :10, 0] + sub.genotypes[:, :10, 1]
print("Diploid genotypes (5 individuals x 10 variants):")
print(diploid)
else:
print("Skipped: pygrgl not available.")
7. Running a Simulation with GRG-Backed Haplotypes
Since GraphHaplotypeOperator implements the HaplotypeOperator interface, it can be used directly as founder_haplotypes in Simulation. The architecture’s genetic components call matvec / standardized_matvec on the operator, which are dispatched to the GRG graph traversal.
We will use the small_clean dataset (100 individuals, 1000 variants) for a more realistic example.
Important: The founder population must have balanced sex for mating to work. We override the sample metadata to ensure this.
[ ]:
if HAS_PYGRGL:
# Load the small dataset
hap_founders = load_grg(SMALL_GRG, bim_path=SMALL_BIM)
print(f"Founder haplotypes: {hap_founders}")
# Assign balanced sex (required for mating)
n = hap_founders.n
sex = np.tile([0, 1], (n + 1) // 2)[:n]
samples_with_sex = SampleMeta(
iid=hap_founders.samples.iid,
sex=sex,
generation=0,
)
# Reconstruct the operator with balanced sex metadata
hap_founders = GraphHaplotypeOperator(
hap_founders._grg,
samples=samples_with_sex,
variants=hap_founders.variants,
)
print(f"With sex assigned: n={hap_founders.n}, sex counts={np.bincount(sex)}")
else:
print("Skipped: pygrgl not available.")
[ ]:
if HAS_PYGRGL:
m = hap_founders.m # 1000 variants
# Define effects and architecture
eff = AdditiveEffects.from_h2(h2=0.5, m=m, seed=123)
formula = """
Y.G ~ genetic(eff)
Y.E ~ noise(0.5)
Y ~ Y.G + Y.E
"""
arch = Architecture(formula=formula, effects={'eff': eff})
print(arch)
else:
print("Skipped: pygrgl not available.")
[ ]:
if HAS_PYGRGL:
# Configure mating and recombination
mating = RandomMating(offspring_per_pair=2)
recomb = RecombinationMap.constant_map(m=m, p=0.5)
# Create and run the simulation
sim = Simulation(
founder_haplotypes=hap_founders,
architecture=arch,
mating_regime=mating,
recombination_map=recomb,
retain_haplotypes=2,
retain_phenotypes=5,
statistics=[SampleStatistics()],
seed=42,
)
sim.run(4)
print(f"Simulation finished at generation {sim.generation}")
print(sim)
else:
print("Skipped: pygrgl not available.")
7.1 Inspect Results
After the simulation completes, we can inspect the phenotype history and statistics just as with dense-backed simulations.
[ ]:
if HAS_PYGRGL:
print("Variance decomposition across generations:")
print(f"{'Gen':<5} {'Var(Y)':>10} {'Var(G)':>10} {'Var(E)':>10} {'h2':>8}")
print("-" * 48)
for result in sim.results:
stats = result.statistics['SampleStatistics']
keys = stats['keys']
variances = stats['var']
y_idx = keys.index('Y')
yg_idx = keys.index('Y.G')
ye_idx = keys.index('Y.E')
var_y = variances[y_idx]
var_g = variances[yg_idx]
var_e = variances[ye_idx]
h2 = var_g / var_y if var_y > 0 else 0.0
print(f"{result.generation:<5} {var_y:>10.4f} {var_g:>10.4f} {var_e:>10.4f} {h2:>8.4f}")
else:
print("Skipped: pygrgl not available.")
[ ]:
if HAS_PYGRGL:
# Show phenotype distribution at the final generation
pheno = sim.phenotypes
y = pheno['Y']
yg = pheno['Y.G']
ye = pheno['Y.E']
print(f"Generation {sim.generation} phenotypes:")
print(f" n = {pheno.samples.n}")
print(f" Y: mean={y.mean():.4f}, std={y.std():.4f}, range=[{y.min():.3f}, {y.max():.3f}]")
print(f" Y.G: mean={yg.mean():.4f}, std={yg.std():.4f}")
print(f" Y.E: mean={ye.mean():.4f}, std={ye.std():.4f}")
print(f" Cov(G, E): {np.cov(yg, ye)[0, 1]:.6f} (should be near 0)")
else:
print("Skipped: pygrgl not available.")
7.2 Haplotype Representation Across Generations
An important detail: the GRG-backed operator is only used for the founder generation. After meiosis, offspring genotypes are represented as DenseHaplotypeArray, because GRG has no native support for recombination.
This means generation 0 has a GraphHaplotypeOperator, while all subsequent generations have DenseHaplotypeArray.
[ ]:
if HAS_PYGRGL:
print("Haplotype types by generation:")
for gen, hap in sorted(sim.haplotype_history.items()):
print(f" Gen {gen}: {type(hap).__name__} (n={hap.n}, m={hap.m})")
print()
print("Note: Only the founder generation (if retained) uses GraphHaplotypeOperator.")
print("All offspring generations revert to DenseHaplotypeArray after meiosis.")
else:
print("Skipped: pygrgl not available.")
8. Verifying GRG vs Dense Simulation Equivalence
To confirm that the GRG backend produces identical results to dense, we run the same simulation with dense-materialized founders and compare the generation-0 phenotypes. (Subsequent generations will differ due to random mating assignments, but the founder phenotypes should match exactly since they depend only on the genotypes and effect sizes.)
[ ]:
if HAS_PYGRGL:
# Materialize founders to dense
dense_founders = hap_founders.to_dense()
print(f"Dense founders: {type(dense_founders).__name__} ({dense_founders.n} x {dense_founders.m})")
# Run the same simulation with dense founders
sim_dense = Simulation(
founder_haplotypes=dense_founders,
architecture=arch,
mating_regime=mating,
recombination_map=recomb,
retain_haplotypes=2,
retain_phenotypes=5,
statistics=[SampleStatistics()],
seed=42,
)
sim_dense.run(4)
# Compare generation 0 phenotypes
# Note: phenotype history retention may have pruned gen 0;
# compare the earliest available generation
gen0_grg = sim.results[0].statistics['SampleStatistics']
gen0_dense = sim_dense.results[0].statistics['SampleStatistics']
keys = gen0_grg['keys']
print(f"\nGeneration 0 variance comparison (GRG vs Dense):")
print(f"{'Component':<10} {'GRG Var':>10} {'Dense Var':>10} {'Match':>8}")
print("-" * 42)
for key in keys:
idx = keys.index(key)
v_grg = gen0_grg['var'][idx]
v_dense = gen0_dense['var'][idx]
match = np.isclose(v_grg, v_dense, atol=1e-8)
print(f"{key:<10} {v_grg:>10.6f} {v_dense:>10.6f} {'YES' if match else 'NO':>8}")
else:
print("Skipped: pygrgl not available.")
9. Current Limitations and Future Directions
The GRG integration has a few known limitations:
Meiosis materializes to dense. When
GraphHaplotypeOperator.meiosis()is called, it first converts to aDenseHaplotypeArrayviato_dense(), then performs standard dense meiosis. This means the GRG compression advantage is limited to the founder generation. Offspring are always dense.Subsetting materializes to dense. Indexing with
[]triggers full materialization. For large datasets, prefer usingmatvec/rmatvecover direct genotype access.AF caching.
recompute_af()caches its result. This is appropriate for a static founder GRG, but the cache is not invalidated if the underlying data were to change. (In practice, GRG objects are immutable, so this is not a problem.)pygrgl dependency. The
pygrglpackage must be installed separately. All GRG imports are lazy (inside function bodies), so the rest of xftsim works without pygrgl.
Despite these limitations, the GRG backend is useful for:
Large founder populations where the full genotype matrix does not fit in memory but the GRG does.
Fast founder-generation phenotyping via graph-traversal matvec.
Real genotype data stored in GRG format (e.g., from tree sequence inference or direct GRG construction from VCF/BED files).
Summary
This notebook covered the GRG (Genealogical Record of Genotypes) integration in xftsim:
Loading GRG files with
load_grg(), optionally enriched by a PLINK BIM fileInspecting the
GraphHaplotypeOperator: dimensions, sample/variant metadata, and allele frequencies computed via graph traversalMatrix-vector operations:
matvec,rmatvec,matvec_maternal,matvec_paternal, andstandardized_matvec– all producing results identical to the dense representationFull simulation using GRG-backed founder haplotypes, demonstrating seamless interchangeability with
DenseHaplotypeArrayLimitations: meiosis and subsetting trigger dense materialization; GRG compression benefits are concentrated in the founder generation
The GraphHaplotypeOperator is a drop-in replacement for DenseHaplotypeArray in any xftsim workflow. As the GRG ecosystem matures, future versions may support graph-native recombination, extending the compression benefits beyond the founder generation.