Sibling Effects and Family Structure

This notebook demonstrates sibling aggregation effects in xftsim. In many behavioral and social traits, an individual’s phenotype is influenced by their siblings’ phenotypes (peer effects, shared environment, competition).

We will:

Define an architecture with a sibling mean effect
Run a simulation with multiple offspring per pair to ensure siblings exist
Compare sibling correlation with and without sibling effects
Use SibPairFilter to extract sibling pairs and compute ICC
Compare with HasemanElstonEstimator results
Demonstrate other sibling aggregation functions: sibling_count, sibling_any

1. Imports

[ ]:

import numpy as np

from xftsim.effect import AdditiveEffects
from xftsim.arch import Architecture
from xftsim.mate import RandomMating
from xftsim.reproduce import RecombinationMap
from xftsim.sim import Simulation
from xftsim.stats import SampleStatistics, HasemanElstonEstimator
from xftsim.filters import SibPairFilter
from xftsim.founders import founder_haplotypes_uniform_AFs

2. Define Architecture with Sibling Mean Effect

The phenotype model is:

Y = Y.G + Y.E + Y.sib

where: - Y.G = additive genetic value (h2 ~ 0.4) - Y.E = environmental noise (variance = 0.4) - Y.sib = sibling mean effect: the average of Y within each family (FID group)

The sibling_mean(Y) component averages the total phenotype Y across all individuals sharing the same FID. This creates a feedback: Y depends on Y.sib which depends on Y. At execution time, Y is computed from Y.G + Y.E first (since Y.sib requires Y as input), then Y.sib is computed, and finally Y is updated to include Y.sib.

Important: When using the formula parser, sibling_mean(Y) automatically detects that Y is an input. When adding programmatically, you must pass inputs=['Y'] explicitly.

[ ]:

np.random.seed(42)

n = 1000
m = 200

founder_hap = founder_haplotypes_uniform_AFs(n=n, m=m)
eff = AdditiveEffects.from_h2(h2=0.4, m=m, seed=123)

# Architecture with sibling mean effect
formula_sib = """
Y.G ~ genetic(eff)
Y.E ~ noise(0.4)
Y.sib ~ sibling_mean(Y)
Y ~ Y.G + Y.E + Y.sib
"""

arch_sib = Architecture(formula=formula_sib, effects={'eff': eff})
print(arch_sib)
print(f"\nNodes in execution order:")
for node in arch_sib.nodes:
    print(f"  {node.outputs} <- {node.component}  (inputs={node.inputs})")

3. Define a Control Architecture (No Sibling Effect)

For comparison, we build the same model without the sibling mean component. The noise variance is adjusted to keep total phenotypic variance comparable.

[ ]:

formula_ctrl = """
Y.G ~ genetic(eff)
Y.E ~ noise(0.6)
Y ~ Y.G + Y.E
"""

arch_ctrl = Architecture(formula=formula_ctrl, effects={'eff': eff})
print("Control architecture (no sibling effect):")
print(arch_ctrl)

4. Run Simulations

We use offspring_per_pair=4 to ensure each family has multiple siblings. With 2 offspring per pair, each family would have exactly 2 members, which still works but 4 gives richer family structure.

We include SibPairFilter in the filters dict and HasemanElstonEstimator in the statistics list. The HE estimator requires a SibPairFilter keyed by the name 'sibpair'.

Important: filters is a dict (not a list): {'sibpair': SibPairFilter()}.

[ ]:

mating = RandomMating(offspring_per_pair=4)
recomb = RecombinationMap.constant_map(m=m, p=0.5)

# Simulation WITH sibling effect
sim_sib = Simulation(
    founder_haplotypes=founder_hap,
    architecture=arch_sib,
    mating_regime=mating,
    recombination_map=recomb,
    retain_haplotypes=2,
    retain_phenotypes=5,
    statistics=[SampleStatistics(), HasemanElstonEstimator(filter_name='sibpair')],
    filters={'sibpair': SibPairFilter()},
    seed=42,
)

sim_sib.run(5)
print(f"Sibling effect simulation: generation {sim_sib.generation}")
print(f"Phenotype keys: {list(sim_sib.phenotypes.keys)}")

# Simulation WITHOUT sibling effect (control)
np.random.seed(42)
founder_hap_ctrl = founder_haplotypes_uniform_AFs(n=n, m=m)

sim_ctrl = Simulation(
    founder_haplotypes=founder_hap_ctrl,
    architecture=arch_ctrl,
    mating_regime=mating,
    recombination_map=recomb,
    retain_haplotypes=2,
    retain_phenotypes=5,
    statistics=[SampleStatistics(), HasemanElstonEstimator(filter_name='sibpair')],
    filters={'sibpair': SibPairFilter()},
    seed=42,
)

sim_ctrl.run(5)
print(f"Control simulation: generation {sim_ctrl.generation}")

5. Compare Sibling Correlations

The sibling mean effect should inflate sibling correlations relative to the control. Under the control (purely genetic), sibling correlation for the total phenotype Y is approximately h2/2 (for full siblings sharing ~50% of additive genetic variance). With the sibling effect, the correlation is inflated by the shared sibling mean component.

[ ]:

def compute_sibling_icc(sim, generation):
    """Compute sibling ICC for trait Y at a given generation."""
    sibpair_filter = SibPairFilter()
    view = sibpair_filter.apply(
        generation,
        sim.phenotype_history,
        sim.pedigree_history,
    )
    if view is None or view.n_pairs == 0:
        return None, 0

    y1 = view.sib1_phenotypes['Y']
    y2 = view.sib2_phenotypes['Y']

    # Sibling ICC via Pearson correlation of paired data
    r = np.corrcoef(y1, y2)[0, 1]
    return r, view.n_pairs

print("Sibling ICC for trait Y over generations:")
print(f"{'Gen':<5} {'Sib Effect ICC':<18} {'Control ICC':<18} {'n_pairs':<10}")
print("-" * 55)

for gen in range(1, 5):  # gen 0 has no siblings (founders)
    if gen in sim_sib.phenotype_history and gen in sim_ctrl.phenotype_history:
        icc_sib, n_pairs_sib = compute_sibling_icc(sim_sib, gen)
        icc_ctrl, n_pairs_ctrl = compute_sibling_icc(sim_ctrl, gen)

        icc_sib_str = f"{icc_sib:.3f}" if icc_sib is not None else "N/A"
        icc_ctrl_str = f"{icc_ctrl:.3f}" if icc_ctrl is not None else "N/A"

        print(f"{gen:<5} {icc_sib_str:<18} {icc_ctrl_str:<18} {n_pairs_sib:<10}")

print(f"\nExpected control ICC ~ h2/2 = {0.4/2:.2f}")
print("Sibling effect inflates ICC above the genetic-only prediction.")

6. Extract Sibling Pairs with SibPairFilter

The SibPairFilter groups individuals by FID and forms all unique within-family pairs. A SibPairView contains:

sib1_phenotypes: dict of name -> (n_pairs,) array
sib2_phenotypes: dict of name -> (n_pairs,) array
n_pairs: number of sibling pairs
sib1_idx, sib2_idx: original sample indices

[ ]:

# Extract sibling pairs from the last generation
sibpair_filter = SibPairFilter()
last_gen = sim_sib.generation

sib_view = sibpair_filter.apply(
    last_gen,
    sim_sib.phenotype_history,
    sim_sib.pedigree_history,
)

print(f"Sibling pairs at generation {last_gen}:")
print(f"  Number of pairs: {sib_view.n_pairs}")
print(f"  Phenotype keys: {list(sib_view.sib1_phenotypes.keys())}")
print(f"  Sib1 indices shape: {sib_view.sib1_idx.shape}")

# With offspring_per_pair=4, each family of 4 produces C(4,2) = 6 pairs
n_families = sim_sib.haplotypes.n // 4
expected_pairs = n_families * 6
print(f"  Expected pairs (~{n_families} families x 6): {expected_pairs}")

[ ]:

# Compute component-wise sibling correlations
print("Sibling correlations by component (with sibling effect):")
for key in sib_view.sib1_phenotypes:
    y1 = sib_view.sib1_phenotypes[key]
    y2 = sib_view.sib2_phenotypes[key]
    if len(y1) > 1:
        r = np.corrcoef(y1, y2)[0, 1]
        print(f"  {key:>6s}: ICC = {r:.3f}")

# Same for control
sib_view_ctrl = sibpair_filter.apply(
    last_gen, sim_ctrl.phenotype_history, sim_ctrl.pedigree_history
)

print("\nSibling correlations by component (control, no sibling effect):")
for key in sib_view_ctrl.sib1_phenotypes:
    y1 = sib_view_ctrl.sib1_phenotypes[key]
    y2 = sib_view_ctrl.sib2_phenotypes[key]
    if len(y1) > 1:
        r = np.corrcoef(y1, y2)[0, 1]
        print(f"  {key:>6s}: ICC = {r:.3f}")

7. Haseman-Elston Estimator Results

The HasemanElstonEstimator computes h2 estimates from sibling pair covariance. For full sibs under an additive model:

h2 = 2 * r_sib

where r_sib is the sibling intraclass correlation. With the sibling effect, the HE estimate will be inflated above the true genetic h2, because the sibling mean component adds non-genetic familial resemblance.

[ ]:

print("Haseman-Elston h2 estimates for Y over generations:")
print(f"{'Gen':<5} {'Sib Effect h2':<18} {'Control h2':<18} {'n_pairs':<10}")
print("-" * 55)

for i, (r_sib, r_ctrl) in enumerate(zip(sim_sib.results, sim_ctrl.results)):
    gen = r_sib.generation

    he_sib = r_sib.statistics.get('HasemanElstonEstimator')
    he_ctrl = r_ctrl.statistics.get('HasemanElstonEstimator')

    if he_sib is not None and 'Y' in he_sib:
        h2_sib = he_sib['Y']['h2']
        n_p = he_sib['Y']['n_pairs']
        h2_sib_str = f"{h2_sib:.3f}"
    else:
        h2_sib_str = "N/A"
        n_p = 0

    if he_ctrl is not None and 'Y' in he_ctrl:
        h2_ctrl = he_ctrl['Y']['h2']
        h2_ctrl_str = f"{h2_ctrl:.3f}"
    else:
        h2_ctrl_str = "N/A"

    print(f"{gen:<5} {h2_sib_str:<18} {h2_ctrl_str:<18} {n_p:<10}")

print(f"\nTrue genetic h2: 0.40")
print("The sibling effect inflates the HE h2 estimate above the true value.")

8. Other Sibling Aggregation Functions

xftsim provides several sibling aggregation functions beyond sibling_mean:

Function	Description
`sibling_mean(X)`	Mean of X within family
`sibling_sum(X)`	Sum of X within family
`sibling_count(X)`	Number of family members
`sibling_any(X)`	1.0 if any member has X > 0, else 0.0
`sibling_eldest(X)`	X value of the first (eldest) family member
`sibling_youngest(X)`	X value of the last (youngest) family member

All sibling functions group by FID by default. Grouping can be overridden with the | operator.

Let us demonstrate sibling_count and sibling_any in a new architecture.

[ ]:

# Architecture with sibling_count and sibling_any
formula_multi = """
Y.G ~ genetic(eff)
Y.E ~ noise(0.4)
Y ~ Y.G + Y.E
Y.fam_size ~ sibling_count(Y)
Y.any_positive ~ sibling_any(Y)
"""

arch_multi = Architecture(formula=formula_multi, effects={'eff': eff})
print(arch_multi)
print(f"\nNodes:")
for node in arch_multi.nodes:
    print(f"  {node.outputs} <- {node.component}  (inputs={node.inputs})")

[ ]:

# Run a quick simulation with the multi-sibling architecture
np.random.seed(42)
founder_hap_multi = founder_haplotypes_uniform_AFs(n=n, m=m)

sim_multi = Simulation(
    founder_haplotypes=founder_hap_multi,
    architecture=arch_multi,
    mating_regime=RandomMating(offspring_per_pair=4),
    recombination_map=recomb,
    retain_haplotypes=2,
    retain_phenotypes=3,
    statistics=[SampleStatistics()],
    seed=42,
)

sim_multi.run(3)
print(f"Simulation completed at generation {sim_multi.generation}")
print(f"Phenotype keys: {list(sim_multi.phenotypes.keys)}")

[ ]:

# Inspect the sibling components at the last generation
pheno = sim_multi.phenotypes

fam_size = pheno['Y.fam_size']
any_positive = pheno['Y.any_positive']

print("Family size (sibling_count):")
unique_sizes, size_counts = np.unique(fam_size, return_counts=True)
for sz, cnt in zip(unique_sizes, size_counts):
    print(f"  Size {int(sz)}: {cnt} individuals")

print(f"\nsibling_any(Y) -- proportion with any Y > 0 in family:")
print(f"  Fraction = 1.0: {np.mean(any_positive == 1.0):.3f}")
print(f"  Fraction = 0.0: {np.mean(any_positive == 0.0):.3f}")
print("  (Most families will have at least one member with Y > 0,")
print("   so fraction=1.0 dominates.)")

9. Variance Decomposition with Sibling Effect

Let us examine how the sibling mean effect contributes to total phenotypic variance. Since Y.sib is correlated with Y.G (siblings share genes) and Y.E (siblings share the sibling mean), the covariance between Y.sib and other components is informative.

[ ]:

# Get statistics from the sibling effect simulation
last_result = sim_sib.results[-1]
stats = last_result.statistics['SampleStatistics']
keys = stats['keys']
cov = stats['cov']

print(f"Covariance matrix at generation {last_result.generation}:")
print(f"Components: {keys}\n")

header = f"{'':>10s}" + "".join(f"{k:>10s}" for k in keys)
print(header)
for i, ki in enumerate(keys):
    row = f"{ki:>10s}" + "".join(f"{cov[i, j]:>10.4f}" for j in range(len(keys)))
    print(row)

# Variance proportions
if 'Y' in keys:
    y_idx = keys.index('Y')
    var_y = cov[y_idx, y_idx]
    print(f"\nTotal Var(Y) = {var_y:.4f}")
    for key in ['Y.G', 'Y.E', 'Y.sib']:
        if key in keys:
            k_idx = keys.index(key)
            var_k = cov[k_idx, k_idx]
            print(f"  Var({key}) = {var_k:.4f}  ({100*var_k/var_y:.1f}% of Var(Y))")

Summary

This notebook demonstrated:

Sibling mean effect via sibling_mean(Y) in the formula DSL
Family structure with offspring_per_pair=4 for multiple siblings per family
Sibling ICC inflation – the sibling effect increases sibling correlations beyond genetic prediction
SibPairFilter for extracting aligned sibling pair phenotype data
HasemanElstonEstimator for sibling-based heritability estimation
Other sibling functions: sibling_count and sibling_any
Variance decomposition showing how the sibling component contributes to total variance

Key findings: - Sibling effects inflate sibling ICC above the h2/2 prediction from genetics alone - This inflates HE h2 estimates, demonstrating confounding of genetic and shared-environment effects - The sibling aggregation functions (mean, count, any, etc.) provide flexible family-level features - All sibling functions group by FID by default; the | operator can override grouping