GWAS and Polygenic Scores

This notebook demonstrates how to run a genome-wide association study (GWAS) on simulated data and compute polygenic scores (PGS) using xftsim.

We will:

  1. Run a simulation with known genetic architecture (h2 = 0.5)

  2. Run GWAS on the simulated data using GWAS(haplotypes, phenotypes)

  3. Examine Manhattan-style output (variant index vs -log10(p))

  4. Show beta recovery (true effects vs estimated effects scatter)

  5. Compute PGS with true weights and with GWAS-estimated weights

  6. Compare PGS R-squared with known h2

  7. Use UnrelatedFilter to obtain independent individuals for GWAS

1. Imports

[ ]:
import numpy as np

from xftsim.effect import AdditiveEffects
from xftsim.arch import Architecture
from xftsim.mate import RandomMating
from xftsim.reproduce import RecombinationMap
from xftsim.sim import Simulation
from xftsim.stats import SampleStatistics
from xftsim.filters import UnrelatedFilter
from xftsim.founders import founder_haplotypes_uniform_AFs
from xftsim.gwas import GWAS, PGS

2. Run a Simulation with Known Genetic Architecture

We simulate a simple univariate trait with h2 = 0.5 using 2000 individuals and 500 variants. A larger sample size gives GWAS more power to detect effects. We run for 3 generations to create family structure (needed later for demonstrating the UnrelatedFilter).

[ ]:
np.random.seed(42)

n = 2000   # number of founders
m = 500    # number of variants

founder_hap = founder_haplotypes_uniform_AFs(n=n, m=m)
eff = AdditiveEffects.from_h2(h2=0.5, m=m, seed=123)

formula = """
Y.G ~ genetic(eff)
Y.E ~ noise(0.5)
Y ~ Y.G + Y.E
"""

arch = Architecture(formula=formula, effects={'eff': eff})
mating = RandomMating(offspring_per_pair=2)
recomb = RecombinationMap.constant_map(m=m, p=0.5)

sim = Simulation(
    founder_haplotypes=founder_hap,
    architecture=arch,
    mating_regime=mating,
    recombination_map=recomb,
    retain_haplotypes=2,
    retain_phenotypes=3,
    statistics=[SampleStatistics()],
    seed=42,
)

sim.run(3)
print(f"Simulation completed: generation {sim.generation}")
print(f"Phenotype keys: {list(sim.phenotypes.keys)}")
print(f"n = {sim.haplotypes.n}, m = {sim.haplotypes.m}")

3. Run GWAS on the Latest Generation

GWAS(haplotypes, phenotypes) performs per-variant OLS regression of each phenotype on genotype dosage. It returns a dict of GWASResult objects, one per phenotype key.

Each GWASResult contains: - beta: regression coefficients (m,) - se: standard errors (m,) - t_stat: t-statistics (m,) - p_value: two-sided p-values (m,) - af: allele frequencies (m,) - n: sample size

[ ]:
hap = sim.haplotypes
pheno = sim.phenotypes

gwas = GWAS(hap, pheno)
results = gwas.run(['Y'])  # run GWAS for the total phenotype Y

r = results['Y']
print(f"GWAS sample size: n = {r.n}")
print(f"Number of variants tested: m = {len(r.beta)}")
print(f"Mean |beta|: {np.mean(np.abs(r.beta)):.4f}")
print(f"Median p-value: {np.median(r.p_value):.4f}")
print(f"Number of genome-wide significant hits (p < 0.05/{m}): "
      f"{np.sum(r.p_value < 0.05 / m)}")

4. Manhattan-Style Output

A Manhattan plot shows -log10(p-value) vs variant index. Since all variants are causal in this simulation (no null variants), we expect a relatively uniform distribution of signals, with larger-effect variants showing smaller p-values.

We print the top 20 most significant variants and their effect sizes.

[ ]:
neg_log10_p = -np.log10(np.maximum(r.p_value, 1e-300))  # avoid log(0)

# Top 20 most significant variants
top_idx = np.argsort(r.p_value)[:20]

print("Top 20 most significant variants:")
print(f"{'Variant':>8s}  {'Beta_hat':>10s}  {'SE':>10s}  {'-log10(p)':>10s}  {'AF':>6s}")
print("-" * 50)
for i in top_idx:
    print(f"{i:>8d}  {r.beta[i]:>10.4f}  {r.se[i]:>10.4f}  "
          f"{neg_log10_p[i]:>10.2f}  {r.af[i]:>6.3f}")
[ ]:
# Distribution summary of -log10(p)
bonferroni_threshold = -np.log10(0.05 / m)

print(f"\n-log10(p) distribution:")
print(f"  Min:    {np.min(neg_log10_p):.2f}")
print(f"  Median: {np.median(neg_log10_p):.2f}")
print(f"  Max:    {np.max(neg_log10_p):.2f}")
print(f"  Bonferroni threshold (-log10(0.05/{m})): {bonferroni_threshold:.2f}")
print(f"  Variants exceeding threshold: {np.sum(neg_log10_p > bonferroni_threshold)} / {m}")

5. Beta Recovery: True vs Estimated Effects

Since we know the true effect sizes from AdditiveEffects, we can compare them to the GWAS-estimated betas. The correlation between true and estimated effects indicates how well GWAS recovers the genetic signal.

Note: the true effects are defined on standardized genotypes, but the GWAS estimates are on raw dosage (0/1/2). To compare fairly, we need to account for the scaling difference. For variant j with allele frequency p_j, the relationship is: beta_raw = beta_std / sqrt(2 * p_j * (1 - p_j)).

[ ]:
true_effects = eff.effects  # (m,) standardized effects

# Scale true effects to raw dosage scale for comparison
af = r.af
scale = np.sqrt(2 * af * (1 - af))
scale = np.where(scale > 0, scale, 1.0)  # guard against monomorphic variants
true_effects_raw = true_effects / scale

# Correlation between true and estimated effects
beta_corr = np.corrcoef(true_effects_raw, r.beta)[0, 1]
print(f"Correlation between true and estimated effects: {beta_corr:.3f}")

# Summary statistics
print(f"\nTrue effects (raw scale): mean = {np.mean(true_effects_raw):.4f}, "
      f"std = {np.std(true_effects_raw):.4f}")
print(f"GWAS estimates:           mean = {np.mean(r.beta):.4f}, "
      f"std = {np.std(r.beta):.4f}")

6. Compute PGS with True Weights

PGS(haplotypes, weights) computes polygenic scores = haplotypes @ weights.

When standardized=True, the genotypes are centered by 2*AF before the matrix-vector product (matching the standardized effect definition). When standardized=False (default), raw dosage is used.

PGS with true (oracle) weights should correlate with the true genetic component Y.G. The R-squared of PGS vs Y should approximate h2.

[ ]:
# PGS with true standardized weights
pgs_true = PGS(hap, weights=true_effects, standardized=True)
scores_true = pgs_true.score()

# Verify: PGS should be identical to Y.G (the genetic component)
yg = pheno['Y.G']
y = pheno['Y']

r2_pgs_yg = np.corrcoef(scores_true, yg)[0, 1] ** 2
r2_pgs_y = np.corrcoef(scores_true, y)[0, 1] ** 2

print(f"PGS (true weights) vs Y.G:  R-squared = {r2_pgs_yg:.4f}  (should be ~1.0)")
print(f"PGS (true weights) vs Y:    R-squared = {r2_pgs_y:.4f}  (should be ~h2 = 0.5)")

7. Compute PGS with GWAS-Estimated Weights

In practice, we do not know the true effects. Instead, we use GWAS estimates as PGS weights. This introduces estimation noise, so the PGS R-squared will be lower than with true weights.

When GWAS is run on the same sample that PGS is computed on (in-sample), the R-squared is upward biased. In a proper evaluation, GWAS is run on a discovery sample and PGS is evaluated in a held-out target sample.

[ ]:
# PGS with GWAS-estimated weights (raw dosage scale)
pgs_gwas = PGS(hap, weights=r.beta, standardized=False)
scores_gwas = pgs_gwas.score()

r2_gwas_y = np.corrcoef(scores_gwas, y)[0, 1] ** 2
r2_gwas_yg = np.corrcoef(scores_gwas, yg)[0, 1] ** 2

print(f"PGS (GWAS weights) vs Y:    R-squared = {r2_gwas_y:.4f}  (in-sample, biased upward)")
print(f"PGS (GWAS weights) vs Y.G:  R-squared = {r2_gwas_yg:.4f}")
print(f"\nComparison:")
print(f"  True weights R2(Y):  {r2_pgs_y:.4f}")
print(f"  GWAS weights R2(Y):  {r2_gwas_y:.4f}")
print(f"  Design h2:           0.500")

8. Using UnrelatedFilter for GWAS

After multiple generations of mating, individuals share family structure (siblings have the same FID). Running GWAS on related individuals inflates test statistics. The UnrelatedFilter selects one individual per family, yielding an independent subsample.

We apply the filter manually and pass the resulting indices to GWAS via the sample_indices parameter.

[ ]:
# Apply UnrelatedFilter to get one individual per family
unrelated_filter = UnrelatedFilter()
unrelated_view = unrelated_filter.apply(
    sim.generation,
    sim.phenotype_history,
    sim.pedigree_history,
)

print(f"Total individuals: {hap.n}")
print(f"Unrelated subsample: {len(unrelated_view.indices)}")
print(f"Reduction: {1 - len(unrelated_view.indices) / hap.n:.1%}")
[ ]:
# Run GWAS on unrelated individuals only
gwas_unrelated = GWAS(hap, pheno, sample_indices=unrelated_view.indices)
results_unrelated = gwas_unrelated.run(['Y'])

r_unrel = results_unrelated['Y']
print(f"Unrelated GWAS sample size: n = {r_unrel.n}")
print(f"Number of significant hits (p < 0.05/{m}): "
      f"{np.sum(r_unrel.p_value < 0.05 / m)}")

# Compare effect estimates
beta_corr_unrel = np.corrcoef(true_effects_raw, r_unrel.beta)[0, 1]
print(f"\nCorrelation(true beta, estimated beta):")
print(f"  Full sample:     {beta_corr:.3f}  (n = {r.n})")
print(f"  Unrelated only:  {beta_corr_unrel:.3f}  (n = {r_unrel.n})")

9. Out-of-Sample PGS Evaluation

A fair PGS evaluation uses weights estimated in one sample (discovery) and scores computed in a different sample (target). We can approximate this by running GWAS on the unrelated subsample and evaluating PGS on the remaining (related) individuals.

[ ]:
# Target sample: everyone NOT in the unrelated discovery set
all_indices = np.arange(hap.n)
target_mask = np.ones(hap.n, dtype=bool)
target_mask[unrelated_view.indices] = False
target_indices = all_indices[target_mask]

print(f"Discovery sample (unrelated): n = {len(unrelated_view.indices)}")
print(f"Target sample (remaining):    n = {len(target_indices)}")

# Compute PGS on target sample using discovery GWAS weights
target_hap = hap[target_indices]
pgs_oos = PGS(target_hap, weights=r_unrel.beta, standardized=False)
scores_oos = pgs_oos.score()

target_y = pheno['Y'][target_indices]
target_yg = pheno['Y.G'][target_indices]

r2_oos = np.corrcoef(scores_oos, target_y)[0, 1] ** 2
r2_oos_g = np.corrcoef(scores_oos, target_yg)[0, 1] ** 2

print(f"\nOut-of-sample PGS R-squared:")
print(f"  vs Y:    {r2_oos:.4f}")
print(f"  vs Y.G:  {r2_oos_g:.4f}")
print(f"\nDesign h2: 0.500")

10. Running GWAS on All Phenotype Components

By default, gwas.run() with no arguments runs GWAS on all phenotype keys. This lets us verify that the genetic component Y.G has the strongest association signal, while the noise component Y.E should show no signal.

[ ]:
# Run GWAS on all components
all_results = gwas.run()  # no keys argument -> all keys

print(f"GWAS run on keys: {list(all_results.keys())}\n")

for key in all_results:
    res = all_results[key]
    n_sig = np.sum(res.p_value < 0.05 / m)
    median_p = np.median(res.p_value)
    print(f"  {key:>5s}: {n_sig:>4d} significant variants, median p = {median_p:.4f}")

print("\nAs expected, Y.G (genetic value) has the strongest signal,")
print("Y.E (noise) has near-null signal, and Y (total) is intermediate.")

Summary

This notebook demonstrated:

  1. GWAS via GWAS(haplotypes, phenotypes).run() – vectorized per-variant OLS

  2. GWASResult fields: beta, se, t_stat, p_value, af, n

  3. Beta recovery – comparing true and estimated effect sizes

  4. PGS via PGS(haplotypes, weights).score() – with true or estimated weights

  5. Standardized vs raw weights and the standardized=True option

  6. UnrelatedFilter for extracting independent individuals from family-structured data

  7. Out-of-sample evaluation – discovery GWAS in one sample, PGS in another

Key observations: - PGS with true weights gives R-squared approximately equal to h2 - PGS with GWAS weights is noisier, especially with smaller sample sizes - In-sample PGS R-squared is upward biased; out-of-sample is more honest - UnrelatedFilter selects one individual per family for proper GWAS inference