Statistics
Overview
We provide a number of built in procedures for computing sample statistics and running various common estimation procedures (e.g., variance-component estimation, association studies, and the like). During each generation of a simulation, each procedure provided as a statistics
attribute of a sim.Simulation
object will be run. Of course, you can also configure xftsim
to write genotypes and phenotypes to disk each generation and can thereby use any external software you like, as is
detailed in the post-processing section.
As a starting point, we’ll initialize a small simulation that incorporates three Statistic
objects:
[1]:
import xftsim as xft
from xftsim import stats
xft.config.print_durations_threshold = 10
n=2000; m=800
founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=n, m=m)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.4], phenotype_name=['height', 'BMD'],
haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes,
p =.1)
mating_regime = xft.mate.RandomMatingRegime(mates_per_female=1,
offspring_per_pair=2)
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
architecture=architecture,
recombination_map=recombination_map,
mating_regime=mating_regime,
statistics = [])
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
comp_type[component_name==key] = value
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
comp_type[component_name==key] = value
We didn’t specify any statistics, so the simulation will just proceed through phenotype construction and mating each generation without saving any additional results.
Accessing results
After adding a Statistic
, we’ll begin to accumulate results in the results_store
attribute as we iterate. The results_store
is just a dict of dicts that we can print nicely using xft.utils.print_tree()
:
[2]:
sim.statistics = [stats.SampleStatistics(means=False,
variances=False,
variance_components = False)]
sim.run(2)
xft.utils.print_tree(sim.results_store)
0:
|__sample_statistics:
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
1:
|__sample_statistics:
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
The results_store
is a dict of dicts of dicts indexed by generation, than statistic name, than statistic component. For convenience, calling sim.results
returns the most recent generation only, and is equivalent to calling sim.results_store[sim.generation]
:
[3]:
sim.results
[3]:
{'sample_statistics': {'vcov': phenotype_name height \
component_name additiveGenetic
vorigin_relative proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband 0.490407
BMD additiveGenetic proband -0.014808
height additiveNoise proband -0.004141
BMD additiveNoise proband 0.000860
height phenotype proband 0.486266
BMD phenotype proband -0.013948
phenotype_name BMD height \
component_name additiveGenetic additiveNoise
vorigin_relative proband proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband -0.014808 -0.004141
BMD additiveGenetic proband 0.398832 -0.008932
height additiveNoise proband -0.008932 0.494048
BMD additiveNoise proband 0.003188 0.018640
height phenotype proband -0.023740 0.489907
BMD phenotype proband 0.402021 0.009708
phenotype_name BMD height \
component_name additiveNoise phenotype
vorigin_relative proband proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband 0.000860 0.486266
BMD additiveGenetic proband 0.003188 -0.023740
height additiveNoise proband 0.018640 0.489907
BMD additiveNoise proband 0.598778 0.019500
height phenotype proband 0.019500 0.976173
BMD phenotype proband 0.601967 -0.004240
phenotype_name BMD
component_name phenotype
vorigin_relative proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband -0.013948
BMD additiveGenetic proband 0.402021
height additiveNoise proband 0.009708
BMD additiveNoise proband 0.601967
height phenotype proband -0.004240
BMD phenotype proband 1.003987 ,
'corr': phenotype_name height \
component_name additiveGenetic
vorigin_relative proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband 1.000000
BMD additiveGenetic proband -0.033483
height additiveNoise proband -0.008413
BMD additiveNoise proband 0.001587
height phenotype proband 0.702801
BMD phenotype proband -0.019878
phenotype_name BMD height \
component_name additiveGenetic additiveNoise
vorigin_relative proband proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband -0.033483 -0.008413
BMD additiveGenetic proband 1.000000 -0.020121
height additiveNoise proband -0.020121 1.000000
BMD additiveNoise proband 0.006524 0.034271
height phenotype proband -0.038047 0.705449
BMD phenotype proband 0.635315 0.013785
phenotype_name BMD height \
component_name additiveNoise phenotype
vorigin_relative proband proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband 0.001587 0.702801
BMD additiveGenetic proband 0.006524 -0.038047
height additiveNoise proband 0.034271 0.705449
BMD additiveNoise proband 1.000000 0.025506
height phenotype proband 0.025506 1.000000
BMD phenotype proband 0.776382 -0.004283
phenotype_name BMD
component_name phenotype
vorigin_relative proband
phenotype_name component_name vorigin_relative
height additiveGenetic proband -0.019878
BMD additiveGenetic proband 0.635315
height additiveNoise proband 0.013785
BMD additiveNoise proband 0.776382
height phenotype proband -0.004283
BMD phenotype proband 1.000000 }}
Skipping generations
We dont always care to run each statistic each generation. We can pick and choose by altering the statistics
attribute of a simulation object on the fly. Here we only compute sample statistics for generation zero and four:
[4]:
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
architecture=architecture,
recombination_map=recombination_map,
mating_regime=mating_regime,
statistics = [stats.SampleStatistics()])
sim.run(1)
sim.statistics = []
sim.run(3)
sim.statistics = [stats.SampleStatistics()]
sim.run(1)
xft.utils.print_tree(sim.results_store)
0:
|__sample_statistics:
|____means: <class 'pandas.core.series.Series'>
|____variances: <class 'pandas.core.series.Series'>
|____variance_components: <class 'pandas.core.series.Series'>
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
4:
|__sample_statistics:
|____means: <class 'pandas.core.series.Series'>
|____variances: <class 'pandas.core.series.Series'>
|____variance_components: <class 'pandas.core.series.Series'>
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
Sample and mating statistics
Two of the helpful sets of statistics are provided by stats.SampleStatistics
and stats.MatingStatistics
. We’ve seen plenty of the former already, but the latter will produce statistics at the level of the mate pairing (rather than the offspring, for which there could be multiple for a give pairing). By default, stats.MatingStatistics
will generate information about reproduction rates and cross-mate cross-trait phenotypic correlation matrices:
[5]:
sim.statistics = [stats.MatingStatistics()]
sim.run(1)
sim.results
[5]:
{'mating_statistics': {'n_reproducing_pairs': 1000,
'n_total_offspring': 2000,
'mean_n_offspring_per_pair': 2.0,
'mean_n_female_offspring_per_pair': 1.056,
'mate_correlations': component height.phenotype.mother BMD.phenotype.mother \
component
height.phenotype.mother 1.000000 -0.004248
BMD.phenotype.mother -0.004248 1.000000
height.phenotype.father -0.010330 -0.012517
BMD.phenotype.father 0.021719 0.025569
component height.phenotype.father BMD.phenotype.father
component
height.phenotype.mother -0.010330 0.021719
BMD.phenotype.mother -0.012517 0.025569
height.phenotype.father 1.000000 -0.013564
BMD.phenotype.father -0.013564 1.000000 }}
We can expand this to include all phenotype components (e.g., heritable and non-heritable portions) by supplying the full
flag:
[6]:
sim.statistics = [stats.MatingStatistics(full=True)]
sim.run(1)
sim.results['mating_statistics']['mate_correlations']
[6]:
component | height.additiveGenetic.mother | BMD.additiveGenetic.mother | height.additiveNoise.mother | BMD.additiveNoise.mother | height.phenotype.mother | BMD.phenotype.mother | height.additiveGenetic.father | BMD.additiveGenetic.father | height.additiveNoise.father | BMD.additiveNoise.father | height.phenotype.father | BMD.phenotype.father |
---|---|---|---|---|---|---|---|---|---|---|---|---|
component | ||||||||||||
height.additiveGenetic.mother | 1.000000 | -0.021017 | -0.007734 | 0.049782 | 0.697232 | 0.026017 | -0.054339 | -0.007990 | 0.023681 | -0.058008 | -0.019167 | -0.050465 |
BMD.additiveGenetic.mother | -0.021017 | 1.000000 | -0.022737 | -0.000432 | -0.031070 | 0.620076 | -0.021841 | -0.028459 | 0.034324 | 0.007174 | 0.010244 | -0.012142 |
height.additiveNoise.mother | -0.007734 | -0.022737 | 1.000000 | 0.005671 | 0.711432 | -0.009657 | -0.024672 | 0.031432 | 0.010509 | -0.028199 | -0.008878 | -0.002485 |
BMD.additiveNoise.mother | 0.049782 | -0.000432 | 0.005671 | 1.000000 | 0.039051 | 0.784274 | 0.023771 | -0.037270 | 0.007536 | -0.023036 | 0.021333 | -0.041328 |
height.phenotype.mother | 0.697232 | -0.031070 | 0.711432 | 0.039051 | 1.000000 | 0.011361 | -0.055875 | 0.016917 | 0.024176 | -0.060981 | -0.019834 | -0.037247 |
BMD.phenotype.mother | 0.026017 | 0.620076 | -0.009657 | 0.784274 | 0.011361 | 1.000000 | 0.005098 | -0.046896 | 0.027208 | -0.013622 | 0.023093 | -0.039956 |
height.additiveGenetic.father | -0.054339 | -0.021841 | -0.024672 | 0.023771 | -0.055875 | 0.005098 | 1.000000 | -0.049287 | 0.031182 | -0.020988 | 0.690617 | -0.047224 |
BMD.additiveGenetic.father | -0.007990 | -0.028459 | 0.031432 | -0.037270 | 0.016917 | -0.046896 | -0.049287 | 1.000000 | -0.030599 | -0.004466 | -0.055067 | 0.620793 |
height.additiveNoise.father | 0.023681 | 0.034324 | 0.010509 | 0.007536 | 0.024176 | 0.027208 | 0.031182 | -0.030599 | 1.000000 | 0.046876 | 0.744404 | 0.017647 |
BMD.additiveNoise.father | -0.058008 | 0.007174 | -0.028199 | -0.023036 | -0.060981 | -0.013622 | -0.020988 | -0.004466 | 0.046876 | 1.000000 | 0.019897 | 0.781194 |
height.phenotype.father | -0.019167 | 0.010244 | -0.008878 | 0.021333 | -0.019834 | 0.023093 | 0.690617 | -0.055067 | 0.744404 | 0.019897 | 1.000000 | -0.018779 |
BMD.phenotype.father | -0.050465 | -0.012142 | -0.002485 | -0.041328 | -0.037247 | -0.039956 | -0.047224 | 0.620793 | 0.017647 | 0.781194 | -0.018779 | 1.000000 |
Estimators
So far, xftsim
includes routines for genome-wide association studies and Haseman Elston regression for estimating heritability and genetic correlations, which we demonstrate briefly below. For further details see the API documentation: {ref}stats <xftsim:xftsim.stats module>
.
[7]:
sim.statistics = [stats.GWAS_Estimator(), stats.HasemanElstonEstimator(randomized=True)]
sim.run(1)
GWAS
The GWAS estimator returns a 3-dimensional array consisting of slopes, standard errors, test statistics and p-values:
[8]:
sim.results['GWAS']['estimates']
[8]:
<xarray.DataArray (variant: 800, statistic: 4, component: 2)> array([[[-1.0781049e-02, -3.9665982e-02], [ 2.2370547e-02, 2.2354241e-02], [-4.8193049e-01, -1.7744275e+00], [ 1.3700919e+00, 1.9238553e+00]], [[ 2.2661731e-02, -1.7306263e-02], [ 2.2366120e-02, 2.2368513e-02], [ 1.0132170e+00, -7.7368855e-01], [ 3.1107920e-01, 1.5607935e+00]], [[ 4.8385037e-04, -2.2465256e-03], [ 2.2371849e-02, 2.2371795e-02], [ 2.1627642e-02, -1.0041776e-01], [ 9.8274714e-01, 1.0799773e+00]], ..., [[-4.0265251e-02, 1.1330526e-02], [ 2.2353718e-02, 2.2370426e-02], [-1.8012775e+00, 5.0649577e-01], [ 1.9281901e+00, 6.1256456e-01]], [[ 3.0709656e-02, 3.2169621e-02], [ 2.2361310e-02, 2.2360282e-02], [ 1.3733388e+00, 1.4386948e+00], [ 1.6980113e-01, 1.5039364e-01]], [[-2.5879662e-02, 5.0908484e-02], [ 2.2364387e-02, 2.2342870e-02], [-1.1571819e+00, 2.2785113e+00], [ 1.7526636e+00, 2.2801254e-02]]], dtype=float32) Coordinates: (12/15) * component (component) <U24 'height.phenotype.proband' 'BMD.phenot... phenotype_name (component) object 'height' 'BMD' component_name (component) object 'phenotype' 'phenotype' vorigin_relative (component) int64 -1 -1 comp_type (component) object 'outcome' 'outcome' * variant (variant) <U5 '0.d' '1.d' '2.d' ... '798.d' '799.d' ... ... zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' one_allele (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' af (variant) float64 0.1277 0.4497 0.4859 ... 0.5554 0.7163 pos_bp (variant) float64 nan nan nan nan nan ... nan nan nan nan pos_cM (variant) float64 nan nan nan nan nan ... nan nan nan nan * statistic (statistic) <U4 'beta' 'se' 't' 'p'
HE Regression
The HE regression estimator produces genetic (co)variance and correlation estimates. We highly recommend setting the randomized
flag to True
(as we have done above) for all but the smallest problems.
[9]:
sim.results['HE_regression']
[9]:
{'cov_HE': phenotype_name height BMD
component_name phenotype phenotype
vorigin_relative proband proband
phenotype_name component_name vorigin_relative
height phenotype proband 0.478390 0.021629
BMD phenotype proband 0.021629 0.420101,
'corr_HE': phenotype_name height BMD
component_name phenotype phenotype
vorigin_relative proband proband
phenotype_name component_name vorigin_relative
height phenotype proband 1.000000 0.048248
BMD phenotype proband 0.048248 1.000000}
Coming soon
We plan to implement the following additional estimation procedures in the near-term:
polygenic score computation
partitioned HE-regression
LD score regression
a general cross-validation wrapper for exmaining out-of-sample performance of all estimators
Creating custom estimators
Tutorial coming soon.