Statistics

Overview

We provide a number of built in procedures for computing sample statistics and running various common estimation procedures (e.g., variance-component estimation, association studies, and the like). During each generation of a simulation, each procedure provided as a statistics attribute of a sim.Simulation object will be run. Of course, you can also configure xftsim to write genotypes and phenotypes to disk each generation and can thereby use any external software you like, as is detailed in the post-processing section.

As a starting point, we’ll initialize a small simulation that incorporates three Statistic objects:

[1]:
import xftsim as xft
from xftsim import stats
xft.config.print_durations_threshold = 10
n=2000; m=800

founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=n, m=m)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.4], phenotype_name=['height', 'BMD'],
                                          haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes,
                                                                                p =.1)
mating_regime = xft.mate.RandomMatingRegime(mates_per_female=1,
                                            offspring_per_pair=2)

sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         architecture=architecture,
                         recombination_map=recombination_map,
                         mating_regime=mating_regime,
                         statistics = [])
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp_type[component_name==key] = value
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp_type[component_name==key] = value

We didn’t specify any statistics, so the simulation will just proceed through phenotype construction and mating each generation without saving any additional results.

Accessing results

After adding a Statistic, we’ll begin to accumulate results in the results_store attribute as we iterate. The results_store is just a dict of dicts that we can print nicely using xft.utils.print_tree():

[2]:
sim.statistics = [stats.SampleStatistics(means=False,
                                         variances=False,
                                         variance_components = False)]
sim.run(2)
xft.utils.print_tree(sim.results_store)
0:
|__sample_statistics:
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
1:
|__sample_statistics:
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>

The results_store is a dict of dicts of dicts indexed by generation, than statistic name, than statistic component. For convenience, calling sim.results returns the most recent generation only, and is equivalent to calling sim.results_store[sim.generation]:

[3]:
sim.results
[3]:
{'sample_statistics': {'vcov': phenotype_name                                           height  \
  component_name                                  additiveGenetic
  vorigin_relative                                        proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband                 0.490407
  BMD            additiveGenetic proband                -0.014808
  height         additiveNoise   proband                -0.004141
  BMD            additiveNoise   proband                 0.000860
  height         phenotype       proband                 0.486266
  BMD            phenotype       proband                -0.013948

  phenotype_name                                              BMD        height  \
  component_name                                  additiveGenetic additiveNoise
  vorigin_relative                                        proband       proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband                -0.014808     -0.004141
  BMD            additiveGenetic proband                 0.398832     -0.008932
  height         additiveNoise   proband                -0.008932      0.494048
  BMD            additiveNoise   proband                 0.003188      0.018640
  height         phenotype       proband                -0.023740      0.489907
  BMD            phenotype       proband                 0.402021      0.009708

  phenotype_name                                            BMD    height  \
  component_name                                  additiveNoise phenotype
  vorigin_relative                                      proband   proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband               0.000860  0.486266
  BMD            additiveGenetic proband               0.003188 -0.023740
  height         additiveNoise   proband               0.018640  0.489907
  BMD            additiveNoise   proband               0.598778  0.019500
  height         phenotype       proband               0.019500  0.976173
  BMD            phenotype       proband               0.601967 -0.004240

  phenotype_name                                        BMD
  component_name                                  phenotype
  vorigin_relative                                  proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband          -0.013948
  BMD            additiveGenetic proband           0.402021
  height         additiveNoise   proband           0.009708
  BMD            additiveNoise   proband           0.601967
  height         phenotype       proband          -0.004240
  BMD            phenotype       proband           1.003987  ,
  'corr': phenotype_name                                           height  \
  component_name                                  additiveGenetic
  vorigin_relative                                        proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband                 1.000000
  BMD            additiveGenetic proband                -0.033483
  height         additiveNoise   proband                -0.008413
  BMD            additiveNoise   proband                 0.001587
  height         phenotype       proband                 0.702801
  BMD            phenotype       proband                -0.019878

  phenotype_name                                              BMD        height  \
  component_name                                  additiveGenetic additiveNoise
  vorigin_relative                                        proband       proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband                -0.033483     -0.008413
  BMD            additiveGenetic proband                 1.000000     -0.020121
  height         additiveNoise   proband                -0.020121      1.000000
  BMD            additiveNoise   proband                 0.006524      0.034271
  height         phenotype       proband                -0.038047      0.705449
  BMD            phenotype       proband                 0.635315      0.013785

  phenotype_name                                            BMD    height  \
  component_name                                  additiveNoise phenotype
  vorigin_relative                                      proband   proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband               0.001587  0.702801
  BMD            additiveGenetic proband               0.006524 -0.038047
  height         additiveNoise   proband               0.034271  0.705449
  BMD            additiveNoise   proband               1.000000  0.025506
  height         phenotype       proband               0.025506  1.000000
  BMD            phenotype       proband               0.776382 -0.004283

  phenotype_name                                        BMD
  component_name                                  phenotype
  vorigin_relative                                  proband
  phenotype_name component_name  vorigin_relative
  height         additiveGenetic proband          -0.019878
  BMD            additiveGenetic proband           0.635315
  height         additiveNoise   proband           0.013785
  BMD            additiveNoise   proband           0.776382
  height         phenotype       proband          -0.004283
  BMD            phenotype       proband           1.000000  }}

Skipping generations

We dont always care to run each statistic each generation. We can pick and choose by altering the statistics attribute of a simulation object on the fly. Here we only compute sample statistics for generation zero and four:

[4]:
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         architecture=architecture,
                         recombination_map=recombination_map,
                         mating_regime=mating_regime,
                         statistics = [stats.SampleStatistics()])
sim.run(1)
sim.statistics = []
sim.run(3)
sim.statistics = [stats.SampleStatistics()]
sim.run(1)
xft.utils.print_tree(sim.results_store)
0:
|__sample_statistics:
|____means: <class 'pandas.core.series.Series'>
|____variances: <class 'pandas.core.series.Series'>
|____variance_components: <class 'pandas.core.series.Series'>
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
4:
|__sample_statistics:
|____means: <class 'pandas.core.series.Series'>
|____variances: <class 'pandas.core.series.Series'>
|____variance_components: <class 'pandas.core.series.Series'>
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>

Sample and mating statistics

Two of the helpful sets of statistics are provided by stats.SampleStatistics and stats.MatingStatistics. We’ve seen plenty of the former already, but the latter will produce statistics at the level of the mate pairing (rather than the offspring, for which there could be multiple for a give pairing). By default, stats.MatingStatistics will generate information about reproduction rates and cross-mate cross-trait phenotypic correlation matrices:

[5]:
sim.statistics = [stats.MatingStatistics()]
sim.run(1)
sim.results
[5]:
{'mating_statistics': {'n_reproducing_pairs': 1000,
  'n_total_offspring': 2000,
  'mean_n_offspring_per_pair': 2.0,
  'mean_n_female_offspring_per_pair': 1.056,
  'mate_correlations': component                height.phenotype.mother  BMD.phenotype.mother  \
  component
  height.phenotype.mother                 1.000000             -0.004248
  BMD.phenotype.mother                   -0.004248              1.000000
  height.phenotype.father                -0.010330             -0.012517
  BMD.phenotype.father                    0.021719              0.025569

  component                height.phenotype.father  BMD.phenotype.father
  component
  height.phenotype.mother                -0.010330              0.021719
  BMD.phenotype.mother                   -0.012517              0.025569
  height.phenotype.father                 1.000000             -0.013564
  BMD.phenotype.father                   -0.013564              1.000000  }}

We can expand this to include all phenotype components (e.g., heritable and non-heritable portions) by supplying the full flag:

[6]:
sim.statistics = [stats.MatingStatistics(full=True)]
sim.run(1)
sim.results['mating_statistics']['mate_correlations']
[6]:
component height.additiveGenetic.mother BMD.additiveGenetic.mother height.additiveNoise.mother BMD.additiveNoise.mother height.phenotype.mother BMD.phenotype.mother height.additiveGenetic.father BMD.additiveGenetic.father height.additiveNoise.father BMD.additiveNoise.father height.phenotype.father BMD.phenotype.father
component
height.additiveGenetic.mother 1.000000 -0.021017 -0.007734 0.049782 0.697232 0.026017 -0.054339 -0.007990 0.023681 -0.058008 -0.019167 -0.050465
BMD.additiveGenetic.mother -0.021017 1.000000 -0.022737 -0.000432 -0.031070 0.620076 -0.021841 -0.028459 0.034324 0.007174 0.010244 -0.012142
height.additiveNoise.mother -0.007734 -0.022737 1.000000 0.005671 0.711432 -0.009657 -0.024672 0.031432 0.010509 -0.028199 -0.008878 -0.002485
BMD.additiveNoise.mother 0.049782 -0.000432 0.005671 1.000000 0.039051 0.784274 0.023771 -0.037270 0.007536 -0.023036 0.021333 -0.041328
height.phenotype.mother 0.697232 -0.031070 0.711432 0.039051 1.000000 0.011361 -0.055875 0.016917 0.024176 -0.060981 -0.019834 -0.037247
BMD.phenotype.mother 0.026017 0.620076 -0.009657 0.784274 0.011361 1.000000 0.005098 -0.046896 0.027208 -0.013622 0.023093 -0.039956
height.additiveGenetic.father -0.054339 -0.021841 -0.024672 0.023771 -0.055875 0.005098 1.000000 -0.049287 0.031182 -0.020988 0.690617 -0.047224
BMD.additiveGenetic.father -0.007990 -0.028459 0.031432 -0.037270 0.016917 -0.046896 -0.049287 1.000000 -0.030599 -0.004466 -0.055067 0.620793
height.additiveNoise.father 0.023681 0.034324 0.010509 0.007536 0.024176 0.027208 0.031182 -0.030599 1.000000 0.046876 0.744404 0.017647
BMD.additiveNoise.father -0.058008 0.007174 -0.028199 -0.023036 -0.060981 -0.013622 -0.020988 -0.004466 0.046876 1.000000 0.019897 0.781194
height.phenotype.father -0.019167 0.010244 -0.008878 0.021333 -0.019834 0.023093 0.690617 -0.055067 0.744404 0.019897 1.000000 -0.018779
BMD.phenotype.father -0.050465 -0.012142 -0.002485 -0.041328 -0.037247 -0.039956 -0.047224 0.620793 0.017647 0.781194 -0.018779 1.000000

Estimators

So far, xftsim includes routines for genome-wide association studies and Haseman Elston regression for estimating heritability and genetic correlations, which we demonstrate briefly below. For further details see the API documentation: {ref}stats <xftsim:xftsim.stats module>.

[7]:
sim.statistics = [stats.GWAS_Estimator(), stats.HasemanElstonEstimator(randomized=True)]
sim.run(1)

GWAS

The GWAS estimator returns a 3-dimensional array consisting of slopes, standard errors, test statistics and p-values:

[8]:
sim.results['GWAS']['estimates']
[8]:
<xarray.DataArray (variant: 800, statistic: 4, component: 2)>
array([[[-1.0781049e-02, -3.9665982e-02],
        [ 2.2370547e-02,  2.2354241e-02],
        [-4.8193049e-01, -1.7744275e+00],
        [ 1.3700919e+00,  1.9238553e+00]],

       [[ 2.2661731e-02, -1.7306263e-02],
        [ 2.2366120e-02,  2.2368513e-02],
        [ 1.0132170e+00, -7.7368855e-01],
        [ 3.1107920e-01,  1.5607935e+00]],

       [[ 4.8385037e-04, -2.2465256e-03],
        [ 2.2371849e-02,  2.2371795e-02],
        [ 2.1627642e-02, -1.0041776e-01],
        [ 9.8274714e-01,  1.0799773e+00]],

       ...,

       [[-4.0265251e-02,  1.1330526e-02],
        [ 2.2353718e-02,  2.2370426e-02],
        [-1.8012775e+00,  5.0649577e-01],
        [ 1.9281901e+00,  6.1256456e-01]],

       [[ 3.0709656e-02,  3.2169621e-02],
        [ 2.2361310e-02,  2.2360282e-02],
        [ 1.3733388e+00,  1.4386948e+00],
        [ 1.6980113e-01,  1.5039364e-01]],

       [[-2.5879662e-02,  5.0908484e-02],
        [ 2.2364387e-02,  2.2342870e-02],
        [-1.1571819e+00,  2.2785113e+00],
        [ 1.7526636e+00,  2.2801254e-02]]], dtype=float32)
Coordinates: (12/15)
  * component         (component) <U24 'height.phenotype.proband' 'BMD.phenot...
    phenotype_name    (component) object 'height' 'BMD'
    component_name    (component) object 'phenotype' 'phenotype'
    vorigin_relative  (component) int64 -1 -1
    comp_type         (component) object 'outcome' 'outcome'
  * variant           (variant) <U5 '0.d' '1.d' '2.d' ... '798.d' '799.d'
    ...                ...
    zero_allele       (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A'
    one_allele        (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G'
    af                (variant) float64 0.1277 0.4497 0.4859 ... 0.5554 0.7163
    pos_bp            (variant) float64 nan nan nan nan nan ... nan nan nan nan
    pos_cM            (variant) float64 nan nan nan nan nan ... nan nan nan nan
  * statistic         (statistic) <U4 'beta' 'se' 't' 'p'

HE Regression

The HE regression estimator produces genetic (co)variance and correlation estimates. We highly recommend setting the randomized flag to True (as we have done above) for all but the smallest problems.

[9]:
sim.results['HE_regression']
[9]:
{'cov_HE': phenotype_name                                    height       BMD
 component_name                                 phenotype phenotype
 vorigin_relative                                 proband   proband
 phenotype_name component_name vorigin_relative
 height         phenotype      proband           0.478390  0.021629
 BMD            phenotype      proband           0.021629  0.420101,
 'corr_HE': phenotype_name                                    height       BMD
 component_name                                 phenotype phenotype
 vorigin_relative                                 proband   proband
 phenotype_name component_name vorigin_relative
 height         phenotype      proband           1.000000  0.048248
 BMD            phenotype      proband           0.048248  1.000000}

Coming soon

We plan to implement the following additional estimation procedures in the near-term:

  • polygenic score computation

  • partitioned HE-regression

  • LD score regression

  • a general cross-validation wrapper for exmaining out-of-sample performance of all estimators

Creating custom estimators

Tutorial coming soon.