GRG-Backed Genotypes

This notebook demonstrates how to use GRG (Genealogical Record of Genotypes) files as a compressed genotype backend in xftsim. GRG is a graph-based representation of genotype data developed by the grg project that can represent large cohort genotype data in a fraction of the space required by conventional matrix formats (VCF, PLINK BED).

Key advantages of the GRG representation:

Compression: GRG exploits sharing of identical-by-state haplotype segments across individuals, achieving substantial compression without information loss.
Graph traversal matvec: Matrix-vector products (genotypes times effect sizes) can be computed directly on the graph via tree traversal, without ever materializing the full genotype matrix. This is O(nodes) per variant rather than O(n) per variant.
Seamless integration: xftsim’s GraphHaplotypeOperator implements the same HaplotypeOperator interface as DenseHaplotypeArray, so genetic architectures work identically with either backend.

We will cover:

Loading a GRG file with load_grg()
Inspecting the GraphHaplotypeOperator (dimensions, metadata, allele frequencies)
Matrix-vector operations and verifying equivalence with dense materialization
Running a full simulation with GRG-backed founder haplotypes
Current limitations (meiosis materializes to dense)

Note: GRG support requires the pygrgl package, which provides Python bindings to the GRG C++ library. If pygrgl is not installed, the GRG-specific cells will not run.

1. Imports and pygrgl Availability Check

We first check whether pygrgl is available. All subsequent GRG operations are guarded by this check. If you do not have pygrgl installed, the notebook still serves as documentation of the API.

[ ]:

import numpy as np
import os

# Check for pygrgl availability
try:
    import pygrgl
    HAS_PYGRGL = True
    print(f"pygrgl is available (version: {getattr(pygrgl, '__version__', 'unknown')})")
except ImportError:
    HAS_PYGRGL = False
    print("pygrgl is NOT installed. GRG cells will be skipped.")
    print("Install with: pip install pygrgl")

from xftsim.io import load_grg
from xftsim.struct import (
    GraphHaplotypeOperator, DenseHaplotypeArray, HaplotypeOperator,
    SampleMeta, VariantMeta,
)
from xftsim.effect import AdditiveEffects
from xftsim.arch import Architecture
from xftsim.mate import RandomMating
from xftsim.reproduce import RecombinationMap
from xftsim.sim import Simulation
from xftsim.stats import SampleStatistics

2. Locate GRG Fixture Files

We use test fixture datasets that ship with the glink test suite. Two sizes are available:

Dataset	Individuals	Variants	GRG size	BED size
`tiny_clean`	20	100	~6 KB	~0.5 KB
`small_clean`	100	1000	~74 KB	~25 KB

Each dataset directory contains a .grg file (the graph) and a .bim file (PLINK-format variant metadata with chromosome, position, and allele codes). The BIM file is optional but provides richer variant metadata than what the GRG stores natively.

[ ]:

# Paths to GRG fixture files
FIXTURE_BASE = os.path.expanduser(
    "~/Dropbox/grg/glink/tests/fixtures/datasets"
)

TINY_GRG = os.path.join(FIXTURE_BASE, "tiny_clean", "genotypes.grg")
TINY_BIM = os.path.join(FIXTURE_BASE, "tiny_clean", "genotypes.bim")

SMALL_GRG = os.path.join(FIXTURE_BASE, "small_clean", "genotypes.grg")
SMALL_BIM = os.path.join(FIXTURE_BASE, "small_clean", "genotypes.bim")

# Verify files exist
for label, path in [("tiny GRG", TINY_GRG), ("tiny BIM", TINY_BIM),
                     ("small GRG", SMALL_GRG), ("small BIM", SMALL_BIM)]:
    exists = os.path.exists(path)
    size = os.path.getsize(path) if exists else 0
    print(f"{label:>10s}: {'OK' if exists else 'MISSING':>7s}  ({size:,} bytes)")

3. Loading a GRG File

The load_grg() function from xftsim.io loads a .grg file and returns a GraphHaplotypeOperator. It accepts an optional bim_path argument for richer variant metadata (chromosome, position, ref/alt alleles).

load_grg(path, generation=0, bim_path=None)

Without a BIM file, variant IDs are constructed as "{position}:{ref}:{alt}" from the GRG’s native mutation metadata. Chromosome information is only available when a BIM file is provided.

[ ]:

if HAS_PYGRGL:
    # Load the tiny dataset with BIM metadata
    hap_tiny = load_grg(TINY_GRG, bim_path=TINY_BIM)
    print(hap_tiny)
    print(f"Type: {type(hap_tiny).__name__}")
    print(f"Is HaplotypeOperator: {isinstance(hap_tiny, HaplotypeOperator)}")
else:
    print("Skipped: pygrgl not available.")

[ ]:

if HAS_PYGRGL:
    # Load the small dataset for a larger example
    hap_small = load_grg(SMALL_GRG, bim_path=SMALL_BIM)
    print(hap_small)
else:
    print("Skipped: pygrgl not available.")

4. Inspecting the GraphHaplotypeOperator

The GraphHaplotypeOperator exposes the same core properties as DenseHaplotypeArray:

Property	Description
`.n`	Number of diploid individuals
`.m`	Number of biallelic variants
`.generation`	Generation number
`.samples`	`SampleMeta` (IID, sex, FID, etc.)
`.variants`	`VariantMeta` (VID, chrom, pos, alleles)

[ ]:

if HAS_PYGRGL:
    print("--- Tiny dataset ---")
    print(f"  Individuals (n): {hap_tiny.n}")
    print(f"  Variants (m):    {hap_tiny.m}")
    print(f"  Generation:      {hap_tiny.generation}")
    print()
    print(f"  Sample IIDs (first 5): {hap_tiny.samples.iid[:5]}")
    print(f"  Variant IDs (first 5): {hap_tiny.variants.vid[:5]}")
    print(f"  Chromosomes (first 5): {hap_tiny.variants.chrom[:5]}")
    print(f"  Positions   (first 5): {hap_tiny.variants.pos_bp[:5]}")
    print()
    print("--- Small dataset ---")
    print(f"  Individuals (n): {hap_small.n}")
    print(f"  Variants (m):    {hap_small.m}")
else:
    print("Skipped: pygrgl not available.")

4.1 Allele Frequencies

Allele frequencies are computed via an UP traversal of the GRG graph: G^T @ 1 / (2n). This computes the count of alternate alleles across all haplotypes without materializing the genotype matrix.

The result is cached after the first call, so subsequent calls to recompute_af() return the cached array immediately.

[ ]:

if HAS_PYGRGL:
    af = hap_tiny.recompute_af()
    print(f"Allele frequency array shape: {af.shape}")
    print(f"AF range: [{af.min():.4f}, {af.max():.4f}]")
    print(f"Mean AF:  {af.mean():.4f}")
    print(f"First 10 AFs: {np.round(af[:10], 4)}")
    print()

    # Verify caching: second call returns the same object
    af2 = hap_tiny.recompute_af()
    print(f"Cached (same object): {af is af2}")
else:
    print("Skipped: pygrgl not available.")

4.2 Loading Without a BIM File

When no BIM file is provided, variant metadata is extracted directly from the GRG’s mutation records. Variant IDs are formatted as "{position}:{ref_allele}:{alt_allele}". Chromosome information is not available from GRG metadata alone.

[ ]:

if HAS_PYGRGL:
    hap_no_bim = load_grg(TINY_GRG)  # no bim_path
    print(f"Without BIM -- variant IDs (first 5): {hap_no_bim.variants.vid[:5]}")
    print(f"Without BIM -- chrom: {hap_no_bim.variants.chrom}")
    print()
    print(f"With BIM    -- variant IDs (first 5): {hap_tiny.variants.vid[:5]}")
    print(f"With BIM    -- chrom (first 5):       {hap_tiny.variants.chrom[:5]}")
else:
    print("Skipped: pygrgl not available.")

5. Matrix-Vector Operations

The core purpose of GraphHaplotypeOperator is to compute genotype-by-effect products without materializing the full (n x m) genotype matrix. The operator provides:

Method	Operation	Description
`matvec(v)`	G @ v	Diploid genotype matrix times vector
`rmatvec(v)`	G.T @ v	Transpose times vector
`matvec_maternal(v)`	G_mat @ v	Maternal haplotype times vector
`matvec_paternal(v)`	G_pat @ v	Paternal haplotype times vector
`standardized_matvec(v, af)`	(G - 2p) @ v	Centered genotype matvec

These are computed via pygrgl’s matmul function which traverses the GRG graph in either DOWN (samples from variants) or UP (variants from samples) direction.

We will verify that GRG matvec produces identical results to the dense materialization.

[ ]:

if HAS_PYGRGL:
    # Materialize to dense for comparison
    dense_tiny = hap_tiny.to_dense()
    print(f"Dense materialization: {type(dense_tiny).__name__}")
    print(f"Genotype array shape:  {dense_tiny.genotypes.shape}")
    print(f"Values are binary:     {set(np.unique(dense_tiny.genotypes))}")
else:
    print("Skipped: pygrgl not available.")

5.1 Diploid Matvec (G @ v)

The basic matvec computes the diploid genotype matrix times a vector. For a single effect vector of length m, the result is a vector of length n giving each individual’s genetic value.

[ ]:

if HAS_PYGRGL:
    rng = np.random.RandomState(42)
    v = rng.randn(hap_tiny.m)

    # GRG matvec
    result_grg = hap_tiny.matvec(v)

    # Dense matvec for comparison
    result_dense = dense_tiny.matvec(v)

    print(f"GRG matvec   shape: {result_grg.shape}")
    print(f"Dense matvec shape: {result_dense.shape}")
    print(f"Max absolute difference: {np.max(np.abs(result_grg - result_dense)):.2e}")
    print(f"Results match: {np.allclose(result_grg, result_dense, atol=1e-10)}")
    print()
    print(f"First 5 values (GRG):   {np.round(result_grg[:5], 4)}")
    print(f"First 5 values (dense): {np.round(result_dense[:5], 4)}")
else:
    print("Skipped: pygrgl not available.")

5.2 Multi-Column Matvec (G @ V)

Both matvec and rmatvec support 2-D inputs. If v has shape (m, k), the result is (n, k) – computing k genetic values simultaneously. This is efficient for multivariate traits.

[ ]:

if HAS_PYGRGL:
    V = rng.randn(hap_tiny.m, 3)  # 3 traits

    result_grg_2d = hap_tiny.matvec(V)
    result_dense_2d = dense_tiny.matvec(V)

    print(f"Input shape:  {V.shape}  (m x k_traits)")
    print(f"Output shape: {result_grg_2d.shape}  (n x k_traits)")
    print(f"Max difference: {np.max(np.abs(result_grg_2d - result_dense_2d)):.2e}")
    print(f"All close: {np.allclose(result_grg_2d, result_dense_2d, atol=1e-10)}")
else:
    print("Skipped: pygrgl not available.")

5.3 Maternal and Paternal Haplotype Matvec

GRG stores haploid data (2n haplotypes). The operator provides separate maternal and paternal matvec operations. The diploid matvec is the sum of maternal and paternal contributions: matvec(v) = matvec_maternal(v) + matvec_paternal(v).

[ ]:

if HAS_PYGRGL:
    v = np.ones(hap_tiny.m, dtype=np.float64)

    mat = hap_tiny.matvec_maternal(v)
    pat = hap_tiny.matvec_paternal(v)
    dip = hap_tiny.matvec(v)

    print(f"Maternal sum (first 5):  {mat[:5]}")
    print(f"Paternal sum (first 5):  {pat[:5]}")
    print(f"Diploid sum  (first 5):  {dip[:5]}")
    print(f"Mat + Pat    (first 5):  {(mat + pat)[:5]}")
    print()
    print(f"mat + pat == diploid: {np.allclose(mat + pat, dip, atol=1e-10)}")

    # Also verify against dense
    mat_dense = dense_tiny.matvec_maternal(v)
    pat_dense = dense_tiny.matvec_paternal(v)
    print(f"Maternal matches dense: {np.allclose(mat, mat_dense, atol=1e-10)}")
    print(f"Paternal matches dense: {np.allclose(pat, pat_dense, atol=1e-10)}")
else:
    print("Skipped: pygrgl not available.")

5.4 Standardized Matvec

The standardized_matvec computes (G - 2p) @ v without materializing G. This is the centered genotype matvec used by GeneticComponent when standardized=True (the default). The centering correction is computed as:

(G - 2p) @ v = G @ v - 2p @ v

Both terms are computed in compressed form: G @ v via graph traversal and 2p @ v as a simple dot product of allele frequencies with the effect vector.

[ ]:

if HAS_PYGRGL:
    v = rng.randn(hap_tiny.m)
    af = hap_tiny.recompute_af()

    std_grg = hap_tiny.standardized_matvec(v, af=af)
    std_dense = dense_tiny.standardized_matvec(v, af=af)

    print(f"Standardized matvec shape: {std_grg.shape}")
    print(f"Max difference: {np.max(np.abs(std_grg - std_dense)):.2e}")
    print(f"Results match: {np.allclose(std_grg, std_dense, atol=1e-10)}")
    print()
    print(f"Mean of centered result: {std_grg.mean():.6f}  (should be near 0)")
else:
    print("Skipped: pygrgl not available.")

5.5 Reverse Matvec (G^T @ v)

The reverse matvec computes G^T @ v via an UP traversal. This is useful for operations like computing allele-level statistics from individual-level data (e.g., GWAS score statistics).

[ ]:

if HAS_PYGRGL:
    v_n = rng.randn(hap_tiny.n)

    rmat_grg = hap_tiny.rmatvec(v_n)
    rmat_dense = dense_tiny.rmatvec(v_n)

    print(f"Input shape:  {v_n.shape}  (n,)")
    print(f"Output shape: {rmat_grg.shape}  (m,)")
    print(f"Max difference: {np.max(np.abs(rmat_grg - rmat_dense)):.2e}")
    print(f"Results match: {np.allclose(rmat_grg, rmat_dense, atol=1e-10)}")
else:
    print("Skipped: pygrgl not available.")

6. Dense Materialization and Subsetting

When you need direct access to the full genotype array (e.g., for visualization or debugging), use to_dense() to materialize the GRG into a DenseHaplotypeArray with shape (n, m, 2).

Indexing a GraphHaplotypeOperator with [] also triggers materialization, returning a dense subset.

[ ]:

if HAS_PYGRGL:
    # Full materialization
    dense = hap_tiny.to_dense()
    print(f"Full dense: {dense.genotypes.shape}")
    print(f"  n={dense.n}, m={dense.m}, generation={dense.generation}")
    print(f"  Unique values: {sorted(np.unique(dense.genotypes))}")
    print()

    # Subset: first 5 individuals, first 20 variants
    sub = hap_tiny[:5, :20]
    print(f"Subset [5 individuals, 20 variants]: {sub.genotypes.shape}")
    print(f"  Type: {type(sub).__name__}")
    print()

    # Show a small slice of the genotype matrix (diploid sums)
    diploid = sub.genotypes[:, :10, 0] + sub.genotypes[:, :10, 1]
    print("Diploid genotypes (5 individuals x 10 variants):")
    print(diploid)
else:
    print("Skipped: pygrgl not available.")

7. Running a Simulation with GRG-Backed Haplotypes

Since GraphHaplotypeOperator implements the HaplotypeOperator interface, it can be used directly as founder_haplotypes in Simulation. The architecture’s genetic components call matvec / standardized_matvec on the operator, which are dispatched to the GRG graph traversal.

We will use the small_clean dataset (100 individuals, 1000 variants) for a more realistic example.

Important: The founder population must have balanced sex for mating to work. We override the sample metadata to ensure this.

[ ]:

if HAS_PYGRGL:
    # Load the small dataset
    hap_founders = load_grg(SMALL_GRG, bim_path=SMALL_BIM)
    print(f"Founder haplotypes: {hap_founders}")

    # Assign balanced sex (required for mating)
    n = hap_founders.n
    sex = np.tile([0, 1], (n + 1) // 2)[:n]
    samples_with_sex = SampleMeta(
        iid=hap_founders.samples.iid,
        sex=sex,
        generation=0,
    )

    # Reconstruct the operator with balanced sex metadata
    hap_founders = GraphHaplotypeOperator(
        hap_founders._grg,
        samples=samples_with_sex,
        variants=hap_founders.variants,
    )
    print(f"With sex assigned: n={hap_founders.n}, sex counts={np.bincount(sex)}")
else:
    print("Skipped: pygrgl not available.")

[ ]:

if HAS_PYGRGL:
    m = hap_founders.m  # 1000 variants

    # Define effects and architecture
    eff = AdditiveEffects.from_h2(h2=0.5, m=m, seed=123)

    formula = """
    Y.G ~ genetic(eff)
    Y.E ~ noise(0.5)
    Y ~ Y.G + Y.E
    """

    arch = Architecture(formula=formula, effects={'eff': eff})
    print(arch)
else:
    print("Skipped: pygrgl not available.")

[ ]:

if HAS_PYGRGL:
    # Configure mating and recombination
    mating = RandomMating(offspring_per_pair=2)
    recomb = RecombinationMap.constant_map(m=m, p=0.5)

    # Create and run the simulation
    sim = Simulation(
        founder_haplotypes=hap_founders,
        architecture=arch,
        mating_regime=mating,
        recombination_map=recomb,
        retain_haplotypes=2,
        retain_phenotypes=5,
        statistics=[SampleStatistics()],
        seed=42,
    )

    sim.run(4)
    print(f"Simulation finished at generation {sim.generation}")
    print(sim)
else:
    print("Skipped: pygrgl not available.")

7.1 Inspect Results

After the simulation completes, we can inspect the phenotype history and statistics just as with dense-backed simulations.

[ ]:

if HAS_PYGRGL:
    print("Variance decomposition across generations:")
    print(f"{'Gen':<5} {'Var(Y)':>10} {'Var(G)':>10} {'Var(E)':>10} {'h2':>8}")
    print("-" * 48)

    for result in sim.results:
        stats = result.statistics['SampleStatistics']
        keys = stats['keys']
        variances = stats['var']

        y_idx = keys.index('Y')
        yg_idx = keys.index('Y.G')
        ye_idx = keys.index('Y.E')

        var_y = variances[y_idx]
        var_g = variances[yg_idx]
        var_e = variances[ye_idx]
        h2 = var_g / var_y if var_y > 0 else 0.0

        print(f"{result.generation:<5} {var_y:>10.4f} {var_g:>10.4f} {var_e:>10.4f} {h2:>8.4f}")
else:
    print("Skipped: pygrgl not available.")

[ ]:

if HAS_PYGRGL:
    # Show phenotype distribution at the final generation
    pheno = sim.phenotypes
    y = pheno['Y']
    yg = pheno['Y.G']
    ye = pheno['Y.E']

    print(f"Generation {sim.generation} phenotypes:")
    print(f"  n = {pheno.samples.n}")
    print(f"  Y:   mean={y.mean():.4f}, std={y.std():.4f}, range=[{y.min():.3f}, {y.max():.3f}]")
    print(f"  Y.G: mean={yg.mean():.4f}, std={yg.std():.4f}")
    print(f"  Y.E: mean={ye.mean():.4f}, std={ye.std():.4f}")
    print(f"  Cov(G, E): {np.cov(yg, ye)[0, 1]:.6f}  (should be near 0)")
else:
    print("Skipped: pygrgl not available.")

7.2 Haplotype Representation Across Generations

An important detail: the GRG-backed operator is only used for the founder generation. After meiosis, offspring genotypes are represented as DenseHaplotypeArray, because GRG has no native support for recombination.

This means generation 0 has a GraphHaplotypeOperator, while all subsequent generations have DenseHaplotypeArray.

[ ]:

if HAS_PYGRGL:
    print("Haplotype types by generation:")
    for gen, hap in sorted(sim.haplotype_history.items()):
        print(f"  Gen {gen}: {type(hap).__name__}  (n={hap.n}, m={hap.m})")

    print()
    print("Note: Only the founder generation (if retained) uses GraphHaplotypeOperator.")
    print("All offspring generations revert to DenseHaplotypeArray after meiosis.")
else:
    print("Skipped: pygrgl not available.")

8. Verifying GRG vs Dense Simulation Equivalence

To confirm that the GRG backend produces identical results to dense, we run the same simulation with dense-materialized founders and compare the generation-0 phenotypes. (Subsequent generations will differ due to random mating assignments, but the founder phenotypes should match exactly since they depend only on the genotypes and effect sizes.)

[ ]:

if HAS_PYGRGL:
    # Materialize founders to dense
    dense_founders = hap_founders.to_dense()
    print(f"Dense founders: {type(dense_founders).__name__} ({dense_founders.n} x {dense_founders.m})")

    # Run the same simulation with dense founders
    sim_dense = Simulation(
        founder_haplotypes=dense_founders,
        architecture=arch,
        mating_regime=mating,
        recombination_map=recomb,
        retain_haplotypes=2,
        retain_phenotypes=5,
        statistics=[SampleStatistics()],
        seed=42,
    )

    sim_dense.run(4)

    # Compare generation 0 phenotypes
    # Note: phenotype history retention may have pruned gen 0;
    # compare the earliest available generation
    gen0_grg = sim.results[0].statistics['SampleStatistics']
    gen0_dense = sim_dense.results[0].statistics['SampleStatistics']

    keys = gen0_grg['keys']
    print(f"\nGeneration 0 variance comparison (GRG vs Dense):")
    print(f"{'Component':<10} {'GRG Var':>10} {'Dense Var':>10} {'Match':>8}")
    print("-" * 42)
    for key in keys:
        idx = keys.index(key)
        v_grg = gen0_grg['var'][idx]
        v_dense = gen0_dense['var'][idx]
        match = np.isclose(v_grg, v_dense, atol=1e-8)
        print(f"{key:<10} {v_grg:>10.6f} {v_dense:>10.6f} {'YES' if match else 'NO':>8}")
else:
    print("Skipped: pygrgl not available.")

9. Current Limitations and Future Directions

The GRG integration has a few known limitations:

Meiosis materializes to dense. When GraphHaplotypeOperator.meiosis() is called, it first converts to a DenseHaplotypeArray via to_dense(), then performs standard dense meiosis. This means the GRG compression advantage is limited to the founder generation. Offspring are always dense.
Subsetting materializes to dense. Indexing with [] triggers full materialization. For large datasets, prefer using matvec / rmatvec over direct genotype access.
AF caching. recompute_af() caches its result. This is appropriate for a static founder GRG, but the cache is not invalidated if the underlying data were to change. (In practice, GRG objects are immutable, so this is not a problem.)
pygrgl dependency. The pygrgl package must be installed separately. All GRG imports are lazy (inside function bodies), so the rest of xftsim works without pygrgl.

Despite these limitations, the GRG backend is useful for:

Large founder populations where the full genotype matrix does not fit in memory but the GRG does.
Fast founder-generation phenotyping via graph-traversal matvec.
Real genotype data stored in GRG format (e.g., from tree sequence inference or direct GRG construction from VCF/BED files).

Summary

This notebook covered the GRG (Genealogical Record of Genotypes) integration in xftsim:

Loading GRG files with load_grg(), optionally enriched by a PLINK BIM file
Inspecting the GraphHaplotypeOperator: dimensions, sample/variant metadata, and allele frequencies computed via graph traversal
Matrix-vector operations: matvec, rmatvec, matvec_maternal, matvec_paternal, and standardized_matvec – all producing results identical to the dense representation
Full simulation using GRG-backed founder haplotypes, demonstrating seamless interchangeability with DenseHaplotypeArray
Limitations: meiosis and subsetting trigger dense materialization; GRG compression benefits are concentrated in the founder generation

The GraphHaplotypeOperator is a drop-in replacement for DenseHaplotypeArray in any xftsim workflow. As the GRG ecosystem matures, future versions may support graph-native recombination, extending the compression benefits beyond the founder generation.