Quickstart

In this brief tutorial, we’ll run a quick simulation that illustrates the impact of cross-trait assortative mating on the the Haseman-Elston regression genetic correlation estimator. In the process, we’ll learning about the essential components of an xftsim simulation.

Setting up a simulation

[1]:
import xftsim as xft
from xftsim.sim import Simulation
import numpy as np
import pandas as pd
import seaborn as sns

xft.config.print_durations_threshold=10. ## reduce verbosity
np.random.seed(123) ## set random seed for reproducibility

Simulation objects require the following arguments:

  • founder_haplotypes: xarray.core.dataarray.DataArray,

  • architecture: xftsim.arch.Architecture,

  • recombination_map: xftsim.reproduce.RecombinationMap,

  • mating_regime: xftsim.mate.MatingRegime,

Together, these comprise our initial generation’s haplotypes (founder_haplotypes), a phenogenetic architecture mapping genotypes to phenotypes (architecture), a method for assigning mates (mating_regime), and a recombination map for meiosis (recombination_map).

We can also supply the following iterables:

  • statistics: Iterable = [],

  • post_processors: Iterable = [],

which correspond to any estimators (statistics) or other procedures (post_processors) we’d like to run every generation.

For our founder haplotypes, we’ll simply generate independent binomial random variants for n=8000 individuals at m=1000 diploid sites:

[2]:
founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=8000,
                                                                 m=1000)
founder_haplotypes
[2]:
<xarray.DataArray 'HaplotypeArray' (sample: 8000, variant: 2000)>
array([[1, 0, 0, ..., 0, 1, 1],
       [1, 1, 0, ..., 1, 0, 0],
       [1, 1, 0, ..., 1, 0, 1],
       ...,
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 1, 0]], dtype=int8)
Coordinates: (12/13)
  * sample       (sample) <U16 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_7999.0_7999'
    iid          (sample) object '0_0' '0_1' '0_2' ... '0_7998' '0_7999'
    fid          (sample) object '0_0' '0_1' '0_2' ... '0_7998' '0_7999'
    sex          (sample) int64 0 1 0 1 0 1 0 1 0 1 0 ... 1 0 1 0 1 0 1 0 1 0 1
  * variant      (variant) <U5 '0.0' '0.1' '1.0' ... '998.1' '999.0' '999.1'
    vid          (variant) object '0' '0' '1' '1' ... '998' '998' '999' '999'
    ...           ...
    hcopy        (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
    zero_allele  (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    one_allele   (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G'
    af           (variant) float64 0.6572 0.6572 0.3289 ... 0.1033 0.3359 0.3359
    pos_bp       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
    pos_cM       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
    generation:  0

For our genetic architecture, we will use the independent GCTA model, where for each individual we have

\[y = m^{-1/2}X\beta + e\]
\[\beta\overset{iid}{\sim}N(0, h^2)\]
\[e\sim N(0, 1-h^2)\]

for heritability \(h^2 \in [0,1]\). This can be implemented in xftsim via the arch.GCTA_Architecture class:

[3]:
architecture = xft.arch.GCTA_Architecture(
    h2=[.5,.5],
    phenotype_name=['pheno_1', 'pheno_2'],
    haplotypes=founder_haplotypes)

To keep things simple, we will use an exchangleable recombination map wherein recombination occurs between contiguous loci with independent probability p=.1:

[4]:
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes,
                                                                                p =.1)

Finally, we choose our mating regime to reflect cross-trait assortative mating on a linear combination of two phenotypes with an exchangeable correlation of 0.5. Concretely, given female and male phenotypes \(Y,\tilde{Y}\in R^{n\times2}\), respectively, we will find a mating permutation \(P\)

such that the empirical correlation is

\[\text{corr}(Y,P\tilde{Y})=0.5\cdot1_{2×2}.\]

We will have each couple generate two offspring, maintaining a constant population size. We implement this regime below:

[5]:
mating_regime = xft.mate.LinearAssortativeMatingRegime(r = .5,
    component_index = xft.index.ComponentIndex.from_product(['pheno_1', 'pheno_2'],
                                                            ['phenotype']),
    offspring_per_pair=2)

At each generation, we will request sample statistics and HE regression estimates. We will also use the xft.proc.LimitMemory post-processor to erase all but the most recent generation’s haplotypes in order to conserve memory:

[6]:
statistics = [xft.stats.MatingStatistics(),
              xft.stats.SampleStatistics(),
              xft.stats.HasemanElstonEstimator()]

post_processors = [xft.proc.LimitMemory(n_haplotype_generations=1)]

Running a simulation

Our simulation is constructed by supplying each of the above to the Simulation constructor:

[7]:
sim = Simulation(founder_haplotypes=founder_haplotypes,
                 mating_regime=mating_regime,
                 recombination_map=recombination_map,
                 architecture=architecture,
                 statistics=statistics,
                 post_processors=post_processors,)

We will run the simulation for 5 generations, which takes just a few seconds on a laptop:

[8]:
%time sim.run(5)
CPU times: user 18.3 s, sys: 5.36 s, total: 23.7 s
Wall time: 8.19 s

The most recent generations are accessible in as a dict object in sim.results, and all generations’ results are stored in sim.results_store. First let’s check that mating proceded as expected:

[9]:
## generation 0
sim.results_store[0]['mating_statistics']['mate_correlations']
[9]:
component pheno_1.phenotype.mother pheno_2.phenotype.mother pheno_1.phenotype.father pheno_2.phenotype.father
component
pheno_1.phenotype.mother 1.000000 -0.047681 0.463318 0.448655
pheno_2.phenotype.mother -0.047681 1.000000 0.449102 0.462603
pheno_1.phenotype.father 0.463318 0.449102 1.000000 -0.043387
pheno_2.phenotype.father 0.448655 0.462603 -0.043387 1.000000
[10]:
## generation 4
sim.results_store[4]['mating_statistics']['mate_correlations']
## same as
sim.results['mating_statistics']['mate_correlations']

[10]:
component pheno_1.phenotype.mother pheno_2.phenotype.mother pheno_1.phenotype.father pheno_2.phenotype.father
component
pheno_1.phenotype.mother 1.000000 0.094649 0.479426 0.511044
pheno_2.phenotype.mother 0.094649 1.000000 0.508266 0.498006
pheno_1.phenotype.father 0.479426 0.508266 1.000000 0.132462
pheno_2.phenotype.father 0.511044 0.498006 0.132462 1.000000

Next, let’s compare three quanitites across generations: - rho_beta_true, the correlation between true genetic effects for pheno_1 and pheno_2, which doesn’t vary over time - rho_score_true, the correlation between true polygenic scores for pheno_1 and pheno_2, which reflects changes in genetic architecture induced by xAM - rho_beta_HE, the HE-regression estimate of rho_beta_true, which is biased upwards under xAM

[11]:
results = pd.DataFrame.from_records([{'generation':key,
  'rho_beta_HE':value['HE_regression']['corr_HE'].iloc[1,0],
  'rho_score_true':value['sample_statistics']['vcov'].iloc[1,0],
  'rho_beta_true':sim.architecture.components[0].true_rho_beta[1,0]} for key,value in sim.results_store.items()]
                         )

pdat = pd.melt(results, id_vars='generation', var_name='quantity',
               value_name='genetic correlation measure')
sns.lineplot(data=pdat,
           x='generation',
           y='genetic correlation measure',
           hue='quantity',)
sns.scatterplot(data=pdat,
           x='generation',
           y='genetic correlation measure',
           hue='quantity',legend=False)
[11]:
<Axes: xlabel='generation', ylabel='genetic correlation measure'>
../_images/gettingstarted_quickstart_20_1.png

Writing output to disk

We save the generation 4 genotypes to disk. Here, we write to zarr, an efficient parallel format ideal for cases where we’d like to reload the data into python, but also to the plink binary format:

[12]:
xft.io.write_to_plink1(sim.haplotypes, '/tmp/plink_output')
/home/rsb/Dropbox/ftsim/xftsim/xftsim/io.py:194: UserWarning: Writing to the plink bfile format removes phase information
  warnings.warn("Writing to the plink bfile format removes phase information")
Writing BED: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.34it/s]
Writing FAM... done.
Writing BIM... done.

/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:261: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
  df.to_csv(
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:288: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
  df.to_csv(
[13]:
xft.io.save_haplotype_zarr(sim.haplotypes, '/tmp/zarr_output', mode='w')

We can convert the phenotypes to Pandas dataframes, which we can then save using any method we like:

[14]:
pheno_df = sim.phenotypes.xft.as_pd()
pheno_df
[14]:
phenotype_name pheno_1 pheno_2 pheno_1 pheno_2 pheno_1 pheno_2
component_name additiveGenetic additiveGenetic additiveNoise additiveNoise phenotype phenotype
vorigin_relative proband proband proband proband proband proband
iid fid sex
4_0 4_0 0 -1.966416 -1.934732 -0.002914 -0.362107 -1.969329 -2.296839
4_1 4_0 0 -2.926725 -1.776690 0.189355 0.207299 -2.737370 -1.569391
4_2 4_1 0 -1.092239 -1.597979 1.325921 0.199160 0.233683 -1.398819
4_3 4_1 0 -0.711714 -1.140154 -0.652625 1.048390 -1.364339 -0.091765
4_4 4_2 0 -0.574885 -0.999057 -0.593752 -0.475246 -1.168637 -1.474302
... ... ... ... ... ... ... ... ...
4_7995 4_3997 1 1.663567 1.608789 1.210872 0.930285 2.874439 2.539074
4_7996 4_3998 0 1.173014 0.511302 1.246304 -0.215005 2.419319 0.296297
4_7997 4_3998 1 1.179811 1.610954 -0.950225 0.467568 0.229586 2.078522
4_7998 4_3999 0 0.418514 1.636576 -0.764532 -0.758268 -0.346019 0.878308
4_7999 4_3999 0 2.185312 1.953812 0.093433 0.410642 2.278744 2.364453

8000 rows × 6 columns

[15]:
pheno_df.to_csv('/tmp/pheno_out.csv')