Data structures
[1]:
import numpy as np
import pandas as pd
import xftsim as xft
from xftsim import index, struct
xft.config.print_durations_threshold=10. ## reduce verbosity
np.random.seed(123) ## set random seed for reproducibility
Here we introduce HaplotypeArray
and PhenotypeArray
objects, the two primary data objects xftsim
operates on. These objects are indexed as follows:
Object |
Row Index |
Column Index |
---|---|---|
|
|
|
|
|
|
It will be challenging to to understand this information if you haven’t read the tutorial on indexing!
Counterintuively, Neither HaplotypeArray
are PhenotypeArray
actual classes. Rather, both construct instances of xarray.DataArray
with an extended API available through the xft
accessor. If you’re confused, don’t worry–we’ll go through all of this step by step.
The xarray documentation can be very helpful if you haven’t used xarray
before!
First, we’ll run a small simulation (the details of which aren’t important for now) so that we have some arrays to work with:
[2]:
founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=800, m=100)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.5], phenotype_name=['height', 'BMD'], haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes, p =.1)
mating_regime = xft.mate.LinearAssortativeMatingRegime(r = .5, offspring_per_pair=2,
component_index = xft.index.ComponentIndex.from_product(['height', 'BMD'], ['phenotype']))
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
mating_regime=mating_regime,
recombination_map=recombination_map,
architecture=architecture)
sim.run(2)
haplo = sim.haplotypes
pheno = sim.phenotypes
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
comp_type[component_name==key] = value
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
comp_type[component_name==key] = value
Phenotype arrays
A phenotype array’s rows correspond to individuals and even columns correspond to phenotypic components:
[3]:
pheno
[3]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 6)> array([[-0.17213982, -2.90022769, 0.29332102, 0.13028801, 0.1211812 , -2.76993968], [-0.22037698, -2.6943213 , -1.18505 , -1.23524417, -1.40542698, -3.92956547], [-1.74917885, 0.77081186, -0.24744861, 0.02997934, -1.99662746, 0.80079119], ..., [ 0.57641418, 1.73169971, -0.68418291, -0.10919439, -0.10776873, 1.62250533], [ 1.04558349, 0.78416918, 0.60137529, -0.44608102, 1.64695877, 0.33808816], [ 0.91799803, 0.75944346, -1.11363213, -0.00429597, -0.1956341 , 0.75514749]]) Coordinates: * sample (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399' iid (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799' fid (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399' sex (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1 * component (component) <U30 'height.additiveGenetic.proband' ... '... phenotype_name (component) object 'height' 'BMD' ... 'height' 'BMD' component_name (component) object 'additiveGenetic' ... 'phenotype' vorigin_relative (component) int64 -1 -1 -1 -1 -1 -1 comp_type (component) object 'intermediate' ... 'outcome' Attributes: generation: 1
Retrieving index objects
We can exConversiontract component and sample index objects as follows:
[4]:
pheno.xft.get_sample_indexer() ## same as .get_row_indexer()
pheno.xft.get_component_indexer() ## same as .get_column_indexer()
[4]:
<ComponentIndex>
3 components of 2 phenotypes spanning 1 generation
phenotype_name component_name \
component
height.additiveGenetic.proband height additiveGenetic
BMD.additiveGenetic.proband BMD additiveGenetic
height.additiveNoise.proband height additiveNoise
BMD.additiveNoise.proband BMD additiveNoise
height.phenotype.proband height phenotype
BMD.phenotype.proband BMD phenotype
vorigin_relative comp_type
component
height.additiveGenetic.proband -1 intermediate
BMD.additiveGenetic.proband -1 intermediate
height.additiveNoise.proband -1 intermediate
BMD.additiveNoise.proband -1 intermediate
height.phenotype.proband -1 outcome
BMD.phenotype.proband -1 outcome
The six columns of pheno
are those indexed above. In confirmation, we observe that the 5th column (height phenotype) is the sum of the first and third columns (height genetic, noise, respectively):
[5]:
np.all(pheno[:,[0,2]].sum(axis=1) == pheno[:,4])
[5]:
<xarray.DataArray 'PhenotypeArray' ()> array(True) Coordinates: component <U30 'height.phenotype.proband' phenotype_name object 'height' component_name object 'phenotype' vorigin_relative int64 -1 comp_type object 'outcome'
Subsetting phenotype arrays
In addition to the standard xarray.DataArray
indexing shown above, we can subset according to phenotype_name, component_name, and/or vorigin_relative by supplying a dict
to xft.__getitem__
:
[6]:
pheno.xft[{'phenotype_name':'height'}]
[6]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 3)> array([[-0.17213982, 0.29332102, 0.1211812 ], [-0.22037698, -1.18505 , -1.40542698], [-1.74917885, -0.24744861, -1.99662746], ..., [ 0.57641418, -0.68418291, -0.10776873], [ 1.04558349, 0.60137529, 1.64695877], [ 0.91799803, -1.11363213, -0.1956341 ]]) Coordinates: * sample (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399' iid (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799' fid (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399' sex (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1 * component (component) <U30 'height.additiveGenetic.proband' ... '... phenotype_name (component) object 'height' 'height' 'height' component_name (component) object 'additiveGenetic' ... 'phenotype' vorigin_relative (component) int64 -1 -1 -1 comp_type (component) object 'intermediate' 'intermediate' 'outcome' Attributes: generation: 1
[7]:
pheno.xft[{'component_name':'phenotype'}]
[7]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 2)> array([[ 0.1211812 , -2.76993968], [-1.40542698, -3.92956547], [-1.99662746, 0.80079119], ..., [-0.10776873, 1.62250533], [ 1.64695877, 0.33808816], [-0.1956341 , 0.75514749]]) Coordinates: * sample (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399' iid (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799' fid (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399' sex (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1 * component (component) <U30 'height.phenotype.proband' 'BMD.phenot... phenotype_name (component) object 'height' 'BMD' component_name (component) object 'phenotype' 'phenotype' vorigin_relative (component) int64 -1 -1 comp_type (component) object 'outcome' 'outcome' Attributes: generation: 1
[8]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}]
[8]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 4)> array([[ 0.29332102, 0.13028801, 0.1211812 , -2.76993968], [-1.18505 , -1.23524417, -1.40542698, -3.92956547], [-0.24744861, 0.02997934, -1.99662746, 0.80079119], ..., [-0.68418291, -0.10919439, -0.10776873, 1.62250533], [ 0.60137529, -0.44608102, 1.64695877, 0.33808816], [-1.11363213, -0.00429597, -0.1956341 , 0.75514749]]) Coordinates: * sample (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399' iid (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799' fid (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399' sex (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1 * component (component) <U30 'height.additiveNoise.proband' ... 'BM... phenotype_name (component) object 'height' 'BMD' 'height' 'BMD' component_name (component) object 'additiveNoise' ... 'phenotype' vorigin_relative (component) int64 -1 -1 -1 -1 comp_type (component) object 'intermediate' ... 'outcome' Attributes: generation: 1
We can also use this method to alter values of underlying DataArray:
[9]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}] *= 2
pheno.xft[{'component_name':['phenotype','additiveNoise']}]
[9]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 4)> array([[ 0.58664204, 0.26057601, 0.2423624 , -5.53987937], [-2.37010001, -2.47048833, -2.81085396, -7.85913093], [-0.49489721, 0.05995868, -3.99325492, 1.60158239], ..., [-1.36836582, -0.21838877, -0.21553746, 3.24501066], [ 1.20275057, -0.89216205, 3.29391755, 0.67617632], [-2.22726426, -0.00859195, -0.39126819, 1.51029498]]) Coordinates: * sample (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399' iid (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799' fid (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399' sex (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1 * component (component) <U30 'height.additiveNoise.proband' ... 'BM... phenotype_name (component) object 'height' 'BMD' 'height' 'BMD' component_name (component) object 'additiveNoise' ... 'phenotype' vorigin_relative (component) int64 -1 -1 -1 -1 comp_type (component) object 'intermediate' ... 'outcome' Attributes: generation: 1
Conversion to Pandas
Finally, we can convert from a phenotype DataArray to a multiindexed Pandas DataFrame using the DataArray.xft.as_pd()
method:
[10]:
pheno.xft.as_pd()
[10]:
phenotype_name | height | BMD | height | BMD | height | BMD | ||
---|---|---|---|---|---|---|---|---|
component_name | additiveGenetic | additiveGenetic | additiveNoise | additiveNoise | phenotype | phenotype | ||
vorigin_relative | proband | proband | proband | proband | proband | proband | ||
iid | fid | sex | ||||||
1_0 | 1_0 | 0 | -0.172140 | -2.900228 | 0.586642 | 0.260576 | 0.242362 | -5.539879 |
1_1 | 1_0 | 0 | -0.220377 | -2.694321 | -2.370100 | -2.470488 | -2.810854 | -7.859131 |
1_2 | 1_1 | 1 | -1.749179 | 0.770812 | -0.494897 | 0.059959 | -3.993255 | 1.601582 |
1_3 | 1_1 | 1 | -0.452184 | -0.654971 | -0.873004 | -1.964959 | -1.777373 | -3.274901 |
1_4 | 1_2 | 0 | -1.181860 | -1.471807 | 0.849380 | 0.655862 | -1.514341 | -2.287753 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1_795 | 1_397 | 0 | 0.890462 | 0.752347 | -0.316503 | 1.258179 | 1.464421 | 2.762873 |
1_796 | 1_398 | 0 | 0.435559 | 0.841148 | -0.780206 | -0.057162 | 0.090912 | 1.625135 |
1_797 | 1_398 | 1 | 0.576414 | 1.731700 | -1.368366 | -0.218389 | -0.215537 | 3.245011 |
1_798 | 1_399 | 1 | 1.045583 | 0.784169 | 1.202751 | -0.892162 | 3.293918 | 0.676176 |
1_799 | 1_399 | 1 | 0.917998 | 0.759443 | -2.227264 | -0.008592 | -0.391268 | 1.510295 |
800 rows × 6 columns
Haplotype arrays
For \(n\) individuals and \(m\) diploid loci, a haplotype array is an \(n\times 2m\) xr.DataArray
of 8-bit integers with rows corresponding to individuals, even columns (indexed starting at zero) corresponding to maternal haplotypes, and odd columns corresponding to paternal haplotypes:
[15]:
haplo
[15]:
<xarray.DataArray 'HaplotypeArray' (sample: 800, variant: 200)> array([[1, 1, 0, ..., 1, 0, 0], [1, 1, 0, ..., 1, 0, 0], [0, 0, 1, ..., 0, 0, 0], ..., [1, 1, 0, ..., 0, 1, 1], [1, 1, 0, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0]], dtype=int8) Coordinates: (12/13) * sample (sample) <U14 '0..1_0.1_0' '0..1_1.1_0' ... '0..1_799.1_399' iid (sample) object '1_0' '1_1' '1_2' ... '1_797' '1_798' '1_799' fid (sample) object '1_0' '1_0' '1_1' ... '1_398' '1_399' '1_399' sex (sample) int64 0 0 1 1 0 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 1 1 * variant (variant) <U4 '0.0' '0.1' '1.0' '1.1' ... '98.1' '99.0' '99.1' vid (variant) object '0' '0' '1' '1' '2' ... '98' '98' '99' '99' ... ... hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1' zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A' one_allele (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G' af (variant) float64 0.6572 0.6572 0.3289 ... 0.2927 0.3748 0.3748 pos_bp (variant) float64 nan nan nan nan nan ... nan nan nan nan nan pos_cM (variant) float64 nan nan nan nan nan ... nan nan nan nan nan Attributes: generation: 1
Retrieving index objects
We can extract variant and sample index objects as follows:
[16]:
haplo.xft.get_sample_indexer() ## same as haplo.xft.get_row_indexer()
[16]:
<SampleIndex>
Generation 0
800 indviduals from 400 families
399 biological females
401 biological males
iid fid sex
sample
0..1_0.1_0 1_0 1_0 0
...
0..1_799.1_399 1_799 1_399 1
[800 rows x 3 columns]
[17]:
haplo.xft.get_variant_indexer() ## same as haplo.xft.get_column_indexer()
[17]:
<HaploidVariantIndex>
100 diploid variants on 20 chromosome(s)
MAF ranges from nan to nan
0 annotation(s)
vid chrom zero_allele one_allele af hcopy pos_bp pos_cM
variant
0.0 0 0 A G 0.657175 0 NaN NaN
0.1 0 0 A G 0.657175 1 NaN NaN
1.0 1 0 A G 0.328911 0 NaN NaN
1.1 1 0 A G 0.328911 1 NaN NaN
2.0 2 0 A G 0.281481 0 NaN NaN
... .. ... ... ... ... ... ... ...
97.1 97 19 A G 0.419101 1 NaN NaN
98.0 98 19 A G 0.292685 0 NaN NaN
98.1 98 19 A G 0.292685 1 NaN NaN
99.0 99 19 A G 0.374765 0 NaN NaN
99.1 99 19 A G 0.374765 1 NaN NaN
[200 rows x 8 columns]
Computing empirical allele frequencies
Empirical diploid allele frequences are available via the af_empirical
property. Here we compare ancestral and empirical allele frequencies:
[18]:
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
empirical = haplo.xft.af_empirical)).T
[18]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ancestral | 0.657175 | 0.328911 | 0.281481 | 0.541052 | 0.675575 | 0.438485 | 0.884611 | 0.647864 | 0.484746 | 0.413694 | ... | 0.663967 | 0.896287 | 0.384732 | 0.710038 | 0.574542 | 0.653361 | 0.220902 | 0.419101 | 0.292685 | 0.374765 |
empirical | 0.642500 | 0.330000 | 0.265000 | 0.511250 | 0.678750 | 0.450625 | 0.913125 | 0.647500 | 0.498125 | 0.435625 | ... | 0.654375 | 0.898125 | 0.388125 | 0.733750 | 0.573125 | 0.663750 | 0.233750 | 0.415625 | 0.325625 | 0.375000 |
2 rows × 100 columns
We can use the use_empirical_afs()
method to replace ancestral allele frequences with empirical allele frequences. This is useful when ancestral allele frequences are unknown, as is often the case with real data.
[19]:
haplo.xft.use_empirical_afs()
## now ancestral == empirical
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
empirical = haplo.xft.af_empirical)).T
[19]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ancestral | 0.6425 | 0.33 | 0.265 | 0.51125 | 0.67875 | 0.450625 | 0.913125 | 0.6475 | 0.498125 | 0.435625 | ... | 0.654375 | 0.898125 | 0.388125 | 0.73375 | 0.573125 | 0.66375 | 0.23375 | 0.415625 | 0.325625 | 0.375 |
empirical | 0.6425 | 0.33 | 0.265 | 0.51125 | 0.67875 | 0.450625 | 0.913125 | 0.6475 | 0.498125 | 0.435625 | ... | 0.654375 | 0.898125 | 0.388125 | 0.73375 | 0.573125 | 0.66375 | 0.23375 | 0.415625 | 0.325625 | 0.375 |
2 rows × 100 columns
Interpolating genetic distances
Given physical positions and a genetic map, centiMorgan distances can interpolated using the interpolate_cM()
method.
Example coming soon!