Core Data Structures

Foundational types used throughout xftsim: sample and variant metadata, haplotype operators, phenotype arrays, and pedigree arrays.

class xftsim.struct.GeneticMap(chrom, pos_bp, pos_cM)[source]

Bases: object

Map between physical and genetic distances.

Parameters:
  • chrom (Iterable) – Chromsomes variants are located on

  • pos_bp (Iterable) – Physical positions of variants

  • pos_cM (Iterable) – Map distances in cM

Parameters:
frame

Pandas DataFrame with the above columns

Type:

pd.DataFrame

chroms

Unique chromosomes present in map

Type:

np.ndarray

Parameters:
classmethod from_pyrho_maps(paths, sep='\\t', **kwargs)[source]

Construct genetic map objects from maps provided at https://github.com/popgenmethods/pyrho Please cite their work if you use their maps.

Parameters:
  • paths (Iterable) – Paths for each chromosome

  • sep (str, optional) – Passed to pd.read_csv()

  • **kwargs – Additional arguments to pd.read_csv()

Return type:

GeneticMap

Returns:

GeneticMap

Parameters:

paths (Iterable)

interpolate_cM_chrom(pos_bp, chrom, **kwargs)[source]

Interpolate cM values in a specified chromosome based on genetic map information.

Parameters:
  • pos_bp (Iterable) – Physical positions for which to interpolate cM values

  • chrom (str) – Chromosome on which to interpolate

  • **kwargs – Additional keyword arguments to be passed to scipy.interpolate.interp1d.

Parameters:
class xftsim.struct.SampleMeta(iid, fid=None, sex=None, generation=0, extra=<factory>)[source]

Bases: object

Immutable metadata for samples/individuals.

Parameters:
  • iid (np.ndarray) – Individual IDs (required).

  • fid (np.ndarray, optional) – Family IDs. Defaults to iid if not provided.

  • sex (np.ndarray, optional) – Biological sex (0=female, 1=male). Defaults to alternating 0,1.

  • generation (int, optional) – Generation number. Default is 0.

  • extra (dict, optional) – Arbitrary metadata arrays (ancestry PCs, batch IDs, etc.).

Parameters:
n

Number of individuals.

Type:

int

n_fam

Number of unique families.

Type:

int

n_female

Number of biological females (sex=0).

Type:

int

n_male

Number of biological males (sex=1).

Type:

int

Parameters:
iid: ndarray
fid: ndarray = None
sex: ndarray = None
generation: int = 0
extra: dict
property n: int

Number of individuals.

property n_fam: int

Number of unique families.

property n_female: int

Number of biological females (sex=0).

property n_male: int

Number of biological males (sex=1).

property unique_identifier: ndarray

Unique identifier for each sample, combining generation, iid, and fid. Format: ‘{generation}.{iid}.{fid}’

subset(idx)[source]

Return a new SampleMeta with a subset of samples.

Return type:

SampleMeta

with_generation(generation)[source]

Return a new SampleMeta with a different generation.

Parameters:

generation (int)

Return type:

SampleMeta

to_sample_index()[source]

Convert to legacy SampleIndex for compatibility with PhenotypeArray.

Return type:

SampleIndex

Returns:

xft.index.SampleIndex – A SampleIndex with the same data.

class xftsim.struct.VariantMeta(vid, chrom=None, pos_bp=None, pos_cM=None, af=None, zero_allele=None, one_allele=None, extra=<factory>)[source]

Bases: object

Immutable metadata for genetic variants.

Parameters:
  • vid (np.ndarray) – Variant IDs (required).

  • chrom (np.ndarray, optional) – Chromosome for each variant.

  • pos_bp (np.ndarray, optional) – Base pair position.

  • pos_cM (np.ndarray, optional) – Centimorgan position.

  • af (np.ndarray, optional) – Allele frequencies.

  • zero_allele (np.ndarray, optional) – Reference allele (e.g., ‘A’).

  • one_allele (np.ndarray, optional) – Alternate allele (e.g., ‘G’).

  • extra (dict, optional) – Arbitrary metadata arrays (annotation flags, etc.).

Parameters:
m

Number of variants.

Type:

int

Parameters:
vid: ndarray
chrom: ndarray = None
pos_bp: ndarray = None
pos_cM: ndarray = None
af: ndarray = None
zero_allele: ndarray = None
one_allele: ndarray = None
extra: dict
property m: int

Number of variants.

subset(idx)[source]

Return a new VariantMeta with a subset of variants.

Return type:

VariantMeta

to_variant_index(af=None)[source]

Convert to legacy DiploidVariantIndex for compatibility.

Parameters:

af (np.ndarray, optional) – Allele frequencies. If not provided, uses stored af or NaN.

Return type:

DiploidVariantIndex

Returns:

xft.index.DiploidVariantIndex – A DiploidVariantIndex with the same data.

Parameters:

af (ndarray)

class xftsim.struct.HaplotypeOperator[source]

Bases: ABC

Abstract base class for all genotype representations.

Defines the interface for matrix-vector operations on haplotype data, used by the architecture’s genetic components. All implementations must provide samples (SampleMeta) and variants (VariantMeta) attributes.

Concrete implementations:

  • DenseHaplotypeArray – NumPy-backed (n, m, 2) array

  • GraphHaplotypeOperator – GRG wrapper (pygrgl graph traversal)

samples

Sample metadata (set by concrete implementations).

Type:

SampleMeta

variants

Variant metadata (set by concrete implementations).

Type:

VariantMeta

abstract property n: int

Number of samples (individuals).

abstract property m: int

Number of diploid variants.

abstract matvec(v)[source]

Diploid genotype-vector product: G @ v.

Parameters:

v (np.ndarray) – Effect vector of shape (m,) or (m, k).

Return type:

ndarray

Returns:

np.ndarray – Result of shape (n,) or (n, k).

Parameters:

v (ndarray)

abstract rmatvec(v)[source]

Transpose genotype-vector product: G.T @ v.

Parameters:

v (np.ndarray) – Vector of shape (n,) or (n, k).

Return type:

ndarray

Returns:

np.ndarray – Result of shape (m,) or (m, k).

Parameters:

v (ndarray)

abstract matvec_maternal(v)[source]

Maternal haplotype matvec: hap[:,:,0] @ v.

Parameters:

v (np.ndarray) – Effect vector of shape (m,).

Return type:

ndarray

Returns:

np.ndarray – Result of shape (n,).

Parameters:

v (ndarray)

abstract matvec_paternal(v)[source]

Paternal haplotype matvec: hap[:,:,1] @ v.

Parameters:

v (np.ndarray) – Effect vector of shape (m,).

Return type:

ndarray

Returns:

np.ndarray – Result of shape (n,).

Parameters:

v (ndarray)

abstract standardized_matvec(v, af=None)[source]

Per-SNP standardized diploid matvec.

Computes ((G - 2p) / sqrt(2p(1-p))) @ v where p is the allele frequency vector. Each column is centered AND scaled to unit variance under HWE.

Parameters:
  • v (np.ndarray) – Effect vector of shape (m,).

  • af (np.ndarray, optional) – Allele frequencies. If None, recomputed from data.

Return type:

ndarray

Returns:

np.ndarray – Result of shape (n,).

Parameters:
abstract recompute_af()[source]

Recompute empirical allele frequencies from current data.

Return type:

ndarray

Returns:

np.ndarray – Allele frequencies of shape (m,).

abstract to_dense()[source]

Materialize as a DenseHaplotypeArray.

Return type:

DenseHaplotypeArray

Returns:

DenseHaplotypeArray – Dense representation of the haplotype data.

abstract meiosis(assignment, recombination_map, rng=None)[source]

Perform meiosis to produce offspring haplotypes.

Parameters:
  • assignment (MateAssignment) – Mate assignment with maternal/paternal indices and offspring metadata.

  • recombination_map (RecombinationMap) – Recombination probabilities between loci.

  • rng (numpy.random.RandomState, optional) – Master RNG used to derive per-offspring crossover seeds. Pass self.rng from the simulation loop to keep crossover sampling tied to the simulation’s seed; None falls back to a fresh RandomState (non-deterministic).

Return type:

HaplotypeOperator

Returns:

HaplotypeOperator – Offspring haplotypes.

class xftsim.struct.DenseHaplotypeArray(genotypes, generation=0, samples=None, variants=None)[source]

Bases: HaplotypeOperator

Dense numpy-backed haplotype array implementing both old HaplotypeArray interface and new HaplotypeOperator ABC.

Stores haplotypes as a 3D array with shape (n_samples, m_variants, 2) where the last dimension represents the two haplotype copies. Convention: genotypes[:,:,0] = maternal, genotypes[:,:,1] = paternal.

Parameters:
  • genotypes (np.ndarray) – 3D array of shape (n, m, 2) containing haplotype data.

  • generation (int, optional) – Generation number. Default is 0.

  • samples (SampleMeta, optional) – Sample metadata (iid, fid, sex).

  • variants (VariantMeta, optional) – Variant metadata (vid, chrom, pos_bp, pos_cM, af, alleles).

Parameters:
property generation: int

Generation number.

property n: int

Number of samples.

property m: int

Number of diploid variants.

property diploid_genotypes: ndarray

Return diploid genotype counts (0, 1, or 2) as 2D array (n, m).

property iid: ndarray

Individual IDs.

property fid: ndarray

Family IDs.

property sex: ndarray

Biological sex (0=female, 1=male).

property n_fam: int

Number of unique families.

property n_female: int

Number of biological females.

property n_male: int

Number of biological males.

property vid: ndarray

Variant IDs.

property chrom: ndarray | None

Chromosome for each variant.

property pos_bp: ndarray | None

Base pair position.

property pos_cM: ndarray | None

Centimorgan position.

property af: ndarray | None

Stored allele frequencies (from VariantMeta).

matvec(v)[source]

Diploid matrix-vector product: G @ v where G is the (n, m) dosage matrix.

Parameters:

v (ndarray)

Return type:

ndarray

rmatvec(v)[source]

Transpose matrix-vector product: G.T @ v.

Parameters:

v (ndarray)

Return type:

ndarray

standardized_matvec(v, af=None)[source]

Per-SNP standardized matvec: ((G - 2p) / sqrt(2p(1-p))) @ v.

Parameters:
Return type:

ndarray

standardized_rmatvec(v, af=None)[source]

Per-SNP standardized rmatvec: ((G - 2p) / sqrt(2p(1-p))).T @ v.

Parameters:
Return type:

ndarray

diploid_matvec(u)[source]

(G[:, :, 0] + G[:, :, 1]) @ u.

Parameters:

u (ndarray)

Return type:

ndarray

standardized_haploid_matvec(u, haploid)[source]

Standardized matvec for one haplotype (0 or 1): center & scale each variant column, then multiply by u.

Parameters:
Return type:

ndarray

subset(sample_idx=None, variant_idx=None, copy=True)[source]

Return a new DenseHaplotypeArray with a subset of samples and/or variants.

Parameters:
  • sample_idx (array-like or slice, optional) – Indices of samples to keep.

  • variant_idx (array-like or slice, optional) – Indices of variants to keep.

  • copy (bool, optional) – If True, copy the data. Default True.

Return type:

DenseHaplotypeArray

Returns:

DenseHaplotypeArray – Subsetted haplotype array.

Parameters:

copy (bool)

drop_isel(sample=None, variant=None)[source]

Drop samples or variants by index (xarray-style compatibility).

Parameters:
  • sample (array-like, optional) – Indices of samples to drop.

  • variant (array-like, optional) – Indices of variants to drop.

Return type:

DenseHaplotypeArray

Returns:

DenseHaplotypeArray – Haplotype array with specified samples/variants removed.

property shape: Tuple[int, int]

Shape as (n_samples, 2*m_variants) for compatibility with 2D expectations.

property data: ndarray

Return genotypes in 2D interleaved format (n, 2m) for compatibility.

property values: ndarray

Alias for data property.

property attrs: dict

Return attributes dict for compatibility with xarray interface.

property af_empirical: ndarray

Compute empirical allele frequencies from genotype data.

to_diploid_standardized(af=None, scale=False)[source]

Return standardized diploid genotypes.

Parameters:
  • af (np.ndarray, optional) – Allele frequencies to use for standardization. If None, uses empirical.

  • scale (bool, optional) – If True, scale by sqrt(2*p*(1-p)). Default False.

Return type:

ndarray

Returns:

np.ndarray – Standardized diploid genotypes (n, m).

Parameters:
get_sample_indexer()[source]

Create a SampleIndex from this haplotype array.

Deprecated since version Use: the samples attribute directly instead.

Return type:

SampleIndex

Returns:

xft.index.SampleIndex – Sample indexer with data from this array.

get_variant_indexer()[source]

Create a DiploidVariantIndex from this haplotype array.

Deprecated since version Use: the variants attribute directly instead.

Return type:

DiploidVariantIndex

Returns:

xft.index.DiploidVariantIndex – Variant indexer with data from this array.

matvec_maternal(v)[source]

Maternal haplotype matvec: hap[:,:,0] @ v.

Parameters:

v (ndarray)

Return type:

ndarray

matvec_paternal(v)[source]

Paternal haplotype matvec: hap[:,:,1] @ v.

Parameters:

v (ndarray)

Return type:

ndarray

recompute_af()[source]

Recompute and return empirical allele frequencies.

Return type:

ndarray

to_dense()[source]

Return self (already dense).

Return type:

DenseHaplotypeArray

meiosis(assignment, recombination_map, rng=None)[source]

Perform meiosis to produce offspring haplotypes.

Delegates to the existing numba-jitted _meiosis_3d kernel in reproduce.py.

Parameters:
  • assignment (MateAssignment) – Mate assignment with maternal/paternal indices and offspring metadata.

  • recombination_map (RecombinationMap) – Recombination probabilities between loci.

  • rng (numpy.random.RandomState, optional) – Master RNG; forwarded to reproduce.meiosis for per-offspring crossover seeding. None preserves the prior non-deterministic behavior.

Return type:

DenseHaplotypeArray

Returns:

DenseHaplotypeArray – Offspring haplotypes with inherited VariantMeta.

property xft: HaplotypeArrayAccessor

Return accessor object for compatibility with xarray .xft interface.

class xftsim.struct.GraphHaplotypeOperator(grg, generation=0, samples=None, variants=None)[source]

Bases: HaplotypeOperator

GRG-backed haplotype operator using pygrgl graph traversal.

Provides O(nodes)-per-variant matvec without materializing the full genotype matrix. After meiosis, offspring revert to DenseHaplotypeArray since GRG has no native recombination support.

Parameters:
  • grg (pygrgl.GRG) – Loaded GRG object (via pygrgl.load_immutable_grg).

  • generation (int) – Generation number (default 0).

  • samples (SampleMeta, optional) – Sample metadata. If None, extracted from GRG individual IDs.

  • variants (VariantMeta, optional) – Variant metadata. If None, extracted from GRG mutation data.

Parameters:
property n: int

Number of samples (individuals).

property m: int

Number of diploid variants.

property generation: int
matvec(v)[source]

Diploid G @ v via GRG DOWN traversal (by_individual=True).

Parameters:

v (ndarray)

Return type:

ndarray

rmatvec(v)[source]

G.T @ v via GRG UP traversal (by_individual=True).

Parameters:

v (ndarray)

Return type:

ndarray

matvec_maternal(v)[source]

Maternal haplotype matvec via GRG DOWN haploid, even indices.

Parameters:

v (ndarray)

Return type:

ndarray

matvec_paternal(v)[source]

Paternal haplotype matvec via GRG DOWN haploid, odd indices.

Parameters:

v (ndarray)

Return type:

ndarray

standardized_matvec(v, af=None)[source]

Per-SNP standardized diploid matvec (no materialization).

Computes ((G - 2p) / sqrt(2pq)) @ v = (G - 2p) @ (v / sqrt(2pq)).

Parameters:
Return type:

ndarray

recompute_af()[source]

Compute allele frequencies via GRG UP traversal: G.T @ 1 / (2n).

Return type:

ndarray

to_dense()[source]

Materialize full genotype matrix via identity matvec.

Return type:

DenseHaplotypeArray

meiosis(assignment, recombination_map, rng=None)[source]

Perform meiosis natively on the GRG via the bubble-insertion algorithm.

The underlying pygrgl.MutableGRG is mutated in place: offspring sample nodes and bubble nodes are added, then set_samples demotes the parent generation’s samples to internal nodes. The returned operator wraps the same GRG with the offspring SampleMeta; the original parent operator becomes stale.

Crossover positions are sampled per locus from recombination_map._probabilities (matching the dense _meiosis_3d kernel’s distribution), then translated into bp-space segments using self.variants.pos_bp.

Parameters:
  • assignment (MateAssignment) – Maternal/paternal indices into the parent generation and offspring SampleMeta.

  • recombination_map (RecombinationMap) – Per-locus recombination probabilities.

  • rng (numpy.random.RandomState, optional) – Master RNG. Used to derive one independent seed per offspring via SeedSequence.spawn; each seed is applied via np.random.seed immediately before that offspring’s two _meiosis_i phase draws, so crossover sampling is deterministic given rng’s state. Matches the seed-derivation strategy used by the dense kernel — both paths consume one rng.randint draw before spawning.

Return type:

GraphHaplotypeOperator

Returns:

GraphHaplotypeOperator – Offspring operator wrapping the (mutated) same GRG.

class xftsim.struct.StandardizedHaplotypeOperator(haplotypes, means=None, stds=None)[source]

Bases: HaplotypeOperator

Wraps a HaplotypeOperator so that matvec/rmatvec act on the column-standardized matrix S = (X - mu) / sigma without materializing S.

Identities used (mu, sigma broadcast across rows of X):

  • S @ u   = H.matvec(u / sigma) - <mu, u / sigma>

  • S.T @ v = (H.rmatvec(v) - mu * sum(v)) / sigma

Defaults follow the HWE convention used elsewhere: mu = 2p, sigma = sqrt(2 p (1 - p)) where p comes from H.recompute_af(). Loci with sigma == 0 (monomorphic) keep sigma = 1 to avoid division by zero, matching the existing standardized_matvec implementations.

All other HaplotypeOperator methods (matvec_maternal, matvec_paternal, recompute_af, to_dense, meiosis, __getitem__) forward to the underlying operator and return raw (un-standardized) results. Re-wrap explicitly if you need standardized semantics on the result.

Parameters:
  • haplotypes (HaplotypeOperator) – Underlying operator providing raw genotype matvec.

  • means (np.ndarray, optional) – Per-variant means (length m). Defaults to 2 * af.

  • stds (np.ndarray, optional) – Per-variant standard deviations (length m). Defaults to sqrt(2 * af * (1 - af)). Zeros are replaced with 1.

Parameters:
property samples
property variants
property n: int

Number of samples (individuals).

property m: int

Number of diploid variants.

property means: ndarray
property stds: ndarray
matvec(v)[source]

S @ v = H.matvec(v / sigma) - <mu, v / sigma>.

Parameters:

v (ndarray)

Return type:

ndarray

rmatvec(v)[source]

S.T @ v = (H.rmatvec(v) - mu * sum(v)) / sigma.

Parameters:

v (ndarray)

Return type:

ndarray

standardized_matvec(v, af=None)[source]

Already standardized: equivalent to matvec.

The af argument is ignored; standardization parameters are fixed at construction time.

Parameters:
Return type:

ndarray

matvec_maternal(v)[source]

Maternal haplotype matvec: hap[:,:,0] @ v.

Parameters:

v (np.ndarray) – Effect vector of shape (m,).

Return type:

ndarray

Returns:

np.ndarray – Result of shape (n,).

Parameters:

v (ndarray)

matvec_paternal(v)[source]

Paternal haplotype matvec: hap[:,:,1] @ v.

Parameters:

v (np.ndarray) – Effect vector of shape (m,).

Return type:

ndarray

Returns:

np.ndarray – Result of shape (n,).

Parameters:

v (ndarray)

recompute_af()[source]

Recompute empirical allele frequencies from current data.

Return type:

ndarray

Returns:

np.ndarray – Allele frequencies of shape (m,).

to_dense()[source]

Materialize as a DenseHaplotypeArray.

Return type:

DenseHaplotypeArray

Returns:

DenseHaplotypeArray – Dense representation of the haplotype data.

meiosis(assignment, recombination_map, rng=None)[source]

Perform meiosis to produce offspring haplotypes.

Parameters:
  • assignment (MateAssignment) – Mate assignment with maternal/paternal indices and offspring metadata.

  • recombination_map (RecombinationMap) – Recombination probabilities between loci.

  • rng (numpy.random.RandomState, optional) – Master RNG used to derive per-offspring crossover seeds. Pass self.rng from the simulation loop to keep crossover sampling tied to the simulation’s seed; None falls back to a fresh RandomState (non-deterministic).

Return type:

HaplotypeOperator

Returns:

HaplotypeOperator – Offspring haplotypes.

class xftsim.struct.HaplotypeArrayAccessor(haplotypes)[source]

Bases: object

Accessor class that mimics the xarray .xft interface for DenseHaplotypeArray. Provides compatibility with code expecting xarray-style access.

Parameters:

haplotypes (DenseHaplotypeArray)

property n: int

Number of samples.

property m: int

Number of variants.

property generation: int

Generation number.

property samples: SampleMeta

Sample metadata.

property variants: VariantMeta

Variant metadata.

property af_empirical: ndarray

Empirical allele frequencies.

get_sample_indexer()[source]

Get sample indexer (deprecated).

Return type:

SampleIndex

get_variant_indexer()[source]

Get variant indexer (deprecated).

Return type:

DiploidVariantIndex

to_diploid()[source]

Return diploid genotype counts (0, 1, or 2) as 2D array (n, m).

Return type:

ndarray

to_diploid_standardized(af=None, scale=False)[source]

Return standardized diploid genotypes.

Parameters:
Return type:

ndarray

class xftsim.struct.PhenotypeArray(samples, values=None)[source]

Bases: object

Thin wrapper around a flat dict of named 1-D arrays.

Each key is a component/phenotype name (e.g. ‘height.G’, ‘height’). The dot is purely a human convention — not parsed.

Parameters:
  • samples (SampleMeta) – Sample metadata that travels with the data.

  • values (dict, optional) – Initial name → (n,) array mapping.

Parameters:
property keys

Return the names of all stored components.

subset(idx)[source]

Return a new PhenotypeArray with a subset of samples.

Return type:

PhenotypeArray

class xftsim.struct.PedigreeArray(offspring_samples, maternal_idx, paternal_idx, parent_n)[source]

Bases: object

Integer index arrays linking offspring to parents.

Produced at reproduction time; consumed by parent/mother/father references and by filters (TrioFilter, SibPairFilter).

Parameters:
  • offspring_samples (SampleMeta) – Metadata for the offspring generation.

  • maternal_idx (np.ndarray) – (n,) indices into the parent generation’s SampleMeta for each offspring’s mother.

  • paternal_idx (np.ndarray) – (n,) indices into the parent generation’s SampleMeta for each offspring’s father.

  • parent_n (int) – Number of individuals in the parent generation (for bounds checking).

Parameters:
offspring_samples: SampleMeta
maternal_idx: ndarray
paternal_idx: ndarray
parent_n: int