Getting Started: Univariate Simulation
This notebook demonstrates how to set up and run a simple forward-time genetics simulation using the new xftsim API. We will:
Generate founder haplotypes
Define a genetic architecture with a formula DSL
Configure mating and recombination
Run the simulation for 5 generations
Inspect the results
The phenotype model is: Y = G + E, where G is an additive genetic component with heritability h2 = 0.5, and E is independent environmental noise.
1. Imports
We import the core modules needed for a simulation: effect specifications, architecture (via formula DSL), mating, recombination, the simulation engine, founder haplotype generation, and statistics/filters.
[ ]:
import numpy as np
from xftsim.effect import AdditiveEffects
from xftsim.arch import Architecture
from xftsim.mate import RandomMating
from xftsim.reproduce import RecombinationMap
from xftsim.sim import Simulation
from xftsim.stats import SampleStatistics
from xftsim.founders import founder_haplotypes_uniform_AFs
2. Create Founder Haplotypes
We generate haplotypes for 1000 founder individuals at 200 biallelic variants. Allele frequencies are drawn uniformly between 0.1 and 0.9 (the default). The result is a DenseHaplotypeArray with shape (1000, 200, 2) – the last dimension holds the two haplotype copies (maternal and paternal).
[ ]:
np.random.seed(42)
n = 1000 # number of founders
m = 200 # number of variants
founder_hap = founder_haplotypes_uniform_AFs(n=n, m=m)
print(f"Founder haplotypes: {founder_hap.n} individuals, {founder_hap.m} variants")
print(f"Genotype array shape: {founder_hap.genotypes.shape}")
print(f"Sample metadata: {founder_hap.samples}")
3. Define Genetic Effects
AdditiveEffects.from_h2() generates effect sizes drawn from N(0, h2/m) so that E[sum(beta^2)] = h2 under standardized genotypes. All 200 variants are causal.
The standardized=True default means effects are calibrated for standardized genotypes (zero mean, unit variance per SNP).
[ ]:
eff = AdditiveEffects.from_h2(h2=0.5, m=m, seed=123)
print(eff)
print(f"Effect sizes shape: {eff.effects.shape}")
print(f"Sum of squared effects: {np.sum(eff.effects**2):.4f} (target ~ 0.5)")
4. Define the Architecture via Formula DSL
The architecture defines how phenotypes are computed from haplotypes each generation. We use the formula DSL with one statement per line:
Y.G ~ genetic(eff): genetic value = genotypes @ effect_sizesY.E ~ noise(0.5): environmental noise ~ N(0, 0.5)Y ~ Y.G + Y.E: total phenotype = genetic + environment
The effects dict maps the name 'eff' used in the formula to our AdditiveEffects object.
Important: the formula parser expects exactly one component per line. Do NOT write genetic(eff) + noise(0.5) on a single line.
[ ]:
formula = """
Y.G ~ genetic(eff)
Y.E ~ noise(0.5)
Y ~ Y.G + Y.E
"""
arch = Architecture(formula=formula, effects={'eff': eff})
print(arch)
print(f"\nNodes in topological order:")
for node in arch.nodes:
print(f" {node.outputs} <- {node.component}")
5. Configure Mating and Recombination
RandomMating pairs individuals randomly each generation. With offspring_per_pair=2, population size stays constant (each pair of parents produces 2 offspring).
RecombinationMap.constant_map creates free recombination (p=0.5) across all 200 loci, meaning crossover positions are independent at each variant.
[ ]:
mating = RandomMating(offspring_per_pair=2)
recomb = RecombinationMap.constant_map(m=m, p=0.5)
print(mating)
print(f"Recombination map: {m} variants, constant p=0.5")
6. Set Up and Run the Simulation
Simulation brings together all components. Key parameters:
retain_haplotypes=2: keep haplotypes from the last 2 generations in memoryretain_phenotypes=5: keep phenotypes from the last 5 generationsstatistics=[SampleStatistics()]: compute sample covariance each generationseed=42: for reproducibility
sim.run(n) runs n generations including generation 0 (the founders). So run(5) simulates generations 0, 1, 2, 3, 4.
[ ]:
sim = Simulation(
founder_haplotypes=founder_hap,
architecture=arch,
mating_regime=mating,
recombination_map=recomb,
retain_haplotypes=2,
retain_phenotypes=5,
statistics=[SampleStatistics()],
seed=42,
)
sim.run(5)
print(f"Simulation finished at generation {sim.generation}")
print(sim)
7. Access Phenotype History
After running, sim.phenotype_history is a dict mapping generation number to PhenotypeArray. Each phenotype array stores named 1-D arrays (the components and total phenotype) and has .samples metadata.
Note: pheno.keys is a property (not a method call), returning dict_keys.
[ ]:
print("Generations with phenotypes stored:", list(sim.phenotype_history.keys()))
# Access the current (latest) generation's phenotypes
pheno = sim.phenotypes # shortcut for phenotype_history[sim.generation]
print(f"\nCurrent generation ({sim.generation}) phenotype:")
print(f" Keys: {list(pheno.keys)}")
print(f" n individuals: {pheno.samples.n}")
print(f" Y mean: {np.mean(pheno['Y']):.4f}")
print(f" Y std: {np.std(pheno['Y']):.4f}")
8. Access Haplotype History
Haplotype history is also a dict keyed by generation. With retain_haplotypes=2, only the most recent 2 generations are kept in memory (older ones are pruned to save RAM).
[ ]:
print("Generations with haplotypes stored:", list(sim.haplotype_history.keys()))
current_hap = sim.haplotypes # shortcut for haplotype_history[sim.generation]
print(f"\nCurrent generation haplotypes:")
print(f" {current_hap.n} individuals, {current_hap.m} variants")
9. Inspect Statistics Results
sim.results is a list of GenerationResult objects, one per generation. Each result has .generation (int) and .statistics (dict).
Since we used SampleStatistics(), each generation’s statistics contain: - 'cov': the sample covariance matrix across all phenotype keys - 'var': the diagonal (variances) - 'keys': the phenotype component names
[ ]:
print(f"Number of results: {len(sim.results)}")
print()
for result in sim.results:
stats = result.statistics['SampleStatistics']
keys = stats['keys']
variances = stats['var']
# Find indices for specific components
y_idx = keys.index('Y')
yg_idx = keys.index('Y.G')
ye_idx = keys.index('Y.E')
var_y = variances[y_idx]
var_g = variances[yg_idx]
var_e = variances[ye_idx]
h2_realized = var_g / var_y
print(f"Gen {result.generation}: Var(Y)={var_y:.3f}, Var(G)={var_g:.3f}, "
f"Var(E)={var_e:.3f}, h2={h2_realized:.3f}")
10. Variance Decomposition
Let us verify that the phenotypic variance decomposes additively: Var(Y) should approximately equal Var(G) + Var(E), since G and E are independent.
Note that the realized heritability may differ from the design h2=0.5 because standardized_matvec uses centered-but-unscaled genotypes. The realized Var(G) depends on the allele frequency distribution.
[ ]:
# Check variance additivity for the founder generation
pheno_0 = sim.phenotype_history[min(sim.phenotype_history.keys())]
gen_0_label = min(sim.phenotype_history.keys())
var_g = np.var(pheno_0['Y.G'])
var_e = np.var(pheno_0['Y.E'])
var_y = np.var(pheno_0['Y'])
cov_ge = np.cov(pheno_0['Y.G'], pheno_0['Y.E'])[0, 1]
print(f"Generation {gen_0_label} variance decomposition:")
print(f" Var(G): {var_g:.4f}")
print(f" Var(E): {var_e:.4f}")
print(f" Cov(G, E): {cov_ge:.4f} (should be ~0)")
print(f" Var(G) + Var(E): {var_g + var_e:.4f}")
print(f" Var(Y): {var_y:.4f}")
print(f" Ratio: {var_y / (var_g + var_e):.4f} (should be ~1.0)")
Summary
In this notebook we covered the basic xftsim workflow:
Founder haplotypes – generated from uniform allele frequencies
Effect sizes –
AdditiveEffects.from_h2()targeting a design heritabilityArchitecture – defined via the formula DSL with one component per line
Mating + recombination –
RandomMatingandRecombinationMap.constant_mapSimulation –
Simulation.run()executing the full forward-time loopResults – accessing
phenotype_history,haplotype_history, andresults
The next notebooks cover bivariate traits with assortative mating and vertical transmission.