Data structures

[1]:
import numpy as np
import pandas as pd
import xftsim as xft
from xftsim import index, struct

xft.config.print_durations_threshold=10. ## reduce verbosity
np.random.seed(123) ## set random seed for reproducibility

Here we introduce HaplotypeArray and PhenotypeArray objects, the two primary data objects xftsim operates on. These objects are indexed as follows:

Object

Row Index

Column Index

struct.HaplotypeArray

index.SampleIndex

index.HaploidVariantIndex

struct.PhenotypeArray

index.SampleIndex

index.ComponentIndex

It will be challenging to to understand this information if you haven’t read the tutorial on indexing!

Counterintuively, Neither HaplotypeArray are PhenotypeArray actual classes. Rather, both construct instances of xarray.DataArray with an extended API available through the xft accessor. If you’re confused, don’t worry–we’ll go through all of this step by step.

The xarray documentation can be very helpful if you haven’t used xarray before!

First, we’ll run a small simulation (the details of which aren’t important for now) so that we have some arrays to work with:

[2]:
founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=800, m=100)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.5], phenotype_name=['height', 'BMD'], haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes, p =.1)
mating_regime = xft.mate.LinearAssortativeMatingRegime(r = .5, offspring_per_pair=2,
                                                       component_index = xft.index.ComponentIndex.from_product(['height', 'BMD'], ['phenotype']))
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         mating_regime=mating_regime,
                         recombination_map=recombination_map,
                         architecture=architecture)
sim.run(2)

haplo = sim.haplotypes
pheno = sim.phenotypes
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp_type[component_name==key] = value
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp_type[component_name==key] = value

Phenotype arrays

A phenotype array’s rows correspond to individuals and even columns correspond to phenotypic components:

[3]:
pheno
[3]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 6)>
array([[-0.17213982, -2.90022769,  0.29332102,  0.13028801,  0.1211812 ,
        -2.76993968],
       [-0.22037698, -2.6943213 , -1.18505   , -1.23524417, -1.40542698,
        -3.92956547],
       [-1.74917885,  0.77081186, -0.24744861,  0.02997934, -1.99662746,
         0.80079119],
       ...,
       [ 0.57641418,  1.73169971, -0.68418291, -0.10919439, -0.10776873,
         1.62250533],
       [ 1.04558349,  0.78416918,  0.60137529, -0.44608102,  1.64695877,
         0.33808816],
       [ 0.91799803,  0.75944346, -1.11363213, -0.00429597, -0.1956341 ,
         0.75514749]])
Coordinates:
  * sample            (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399'
    iid               (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799'
    fid               (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399'
    sex               (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1
  * component         (component) <U30 'height.additiveGenetic.proband' ... '...
    phenotype_name    (component) object 'height' 'BMD' ... 'height' 'BMD'
    component_name    (component) object 'additiveGenetic' ... 'phenotype'
    vorigin_relative  (component) int64 -1 -1 -1 -1 -1 -1
    comp_type         (component) object 'intermediate' ... 'outcome'
Attributes:
    generation:  1

Retrieving index objects

We can exConversiontract component and sample index objects as follows:

[4]:
pheno.xft.get_sample_indexer() ## same as .get_row_indexer()
pheno.xft.get_component_indexer() ## same as .get_column_indexer()

[4]:
<ComponentIndex>
  3 components of 2 phenotypes spanning 1 generation
                               phenotype_name   component_name  \
component
height.additiveGenetic.proband         height  additiveGenetic
BMD.additiveGenetic.proband               BMD  additiveGenetic
height.additiveNoise.proband           height    additiveNoise
BMD.additiveNoise.proband                 BMD    additiveNoise
height.phenotype.proband               height        phenotype
BMD.phenotype.proband                     BMD        phenotype

                                vorigin_relative     comp_type
component
height.additiveGenetic.proband                -1  intermediate
BMD.additiveGenetic.proband                   -1  intermediate
height.additiveNoise.proband                  -1  intermediate
BMD.additiveNoise.proband                     -1  intermediate
height.phenotype.proband                      -1       outcome
BMD.phenotype.proband                         -1       outcome

The six columns of pheno are those indexed above. In confirmation, we observe that the 5th column (height phenotype) is the sum of the first and third columns (height genetic, noise, respectively):

[5]:
np.all(pheno[:,[0,2]].sum(axis=1) == pheno[:,4])
[5]:
<xarray.DataArray 'PhenotypeArray' ()>
array(True)
Coordinates:
    component         <U30 'height.phenotype.proband'
    phenotype_name    object 'height'
    component_name    object 'phenotype'
    vorigin_relative  int64 -1
    comp_type         object 'outcome'

Subsetting phenotype arrays

In addition to the standard xarray.DataArray indexing shown above, we can subset according to phenotype_name, component_name, and/or vorigin_relative by supplying a dict to xft.__getitem__:

[6]:
pheno.xft[{'phenotype_name':'height'}]
[6]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 3)>
array([[-0.17213982,  0.29332102,  0.1211812 ],
       [-0.22037698, -1.18505   , -1.40542698],
       [-1.74917885, -0.24744861, -1.99662746],
       ...,
       [ 0.57641418, -0.68418291, -0.10776873],
       [ 1.04558349,  0.60137529,  1.64695877],
       [ 0.91799803, -1.11363213, -0.1956341 ]])
Coordinates:
  * sample            (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399'
    iid               (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799'
    fid               (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399'
    sex               (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1
  * component         (component) <U30 'height.additiveGenetic.proband' ... '...
    phenotype_name    (component) object 'height' 'height' 'height'
    component_name    (component) object 'additiveGenetic' ... 'phenotype'
    vorigin_relative  (component) int64 -1 -1 -1
    comp_type         (component) object 'intermediate' 'intermediate' 'outcome'
Attributes:
    generation:  1
[7]:
pheno.xft[{'component_name':'phenotype'}]
[7]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 2)>
array([[ 0.1211812 , -2.76993968],
       [-1.40542698, -3.92956547],
       [-1.99662746,  0.80079119],
       ...,
       [-0.10776873,  1.62250533],
       [ 1.64695877,  0.33808816],
       [-0.1956341 ,  0.75514749]])
Coordinates:
  * sample            (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399'
    iid               (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799'
    fid               (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399'
    sex               (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1
  * component         (component) <U30 'height.phenotype.proband' 'BMD.phenot...
    phenotype_name    (component) object 'height' 'BMD'
    component_name    (component) object 'phenotype' 'phenotype'
    vorigin_relative  (component) int64 -1 -1
    comp_type         (component) object 'outcome' 'outcome'
Attributes:
    generation:  1
[8]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}]
[8]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 4)>
array([[ 0.29332102,  0.13028801,  0.1211812 , -2.76993968],
       [-1.18505   , -1.23524417, -1.40542698, -3.92956547],
       [-0.24744861,  0.02997934, -1.99662746,  0.80079119],
       ...,
       [-0.68418291, -0.10919439, -0.10776873,  1.62250533],
       [ 0.60137529, -0.44608102,  1.64695877,  0.33808816],
       [-1.11363213, -0.00429597, -0.1956341 ,  0.75514749]])
Coordinates:
  * sample            (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399'
    iid               (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799'
    fid               (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399'
    sex               (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1
  * component         (component) <U30 'height.additiveNoise.proband' ... 'BM...
    phenotype_name    (component) object 'height' 'BMD' 'height' 'BMD'
    component_name    (component) object 'additiveNoise' ... 'phenotype'
    vorigin_relative  (component) int64 -1 -1 -1 -1
    comp_type         (component) object 'intermediate' ... 'outcome'
Attributes:
    generation:  1

We can also use this method to alter values of underlying DataArray:

[9]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}] *= 2
pheno.xft[{'component_name':['phenotype','additiveNoise']}]
[9]:
<xarray.DataArray 'PhenotypeArray' (sample: 800, component: 4)>
array([[ 0.58664204,  0.26057601,  0.2423624 , -5.53987937],
       [-2.37010001, -2.47048833, -2.81085396, -7.85913093],
       [-0.49489721,  0.05995868, -3.99325492,  1.60158239],
       ...,
       [-1.36836582, -0.21838877, -0.21553746,  3.24501066],
       [ 1.20275057, -0.89216205,  3.29391755,  0.67617632],
       [-2.22726426, -0.00859195, -0.39126819,  1.51029498]])
Coordinates:
  * sample            (sample) <U14 '0..1_0.1_0' ... '0..1_799.1_399'
    iid               (sample) object '1_0' '1_1' '1_2' ... '1_798' '1_799'
    fid               (sample) object '1_0' '1_0' '1_1' ... '1_399' '1_399'
    sex               (sample) int64 0 0 1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 1 1
  * component         (component) <U30 'height.additiveNoise.proband' ... 'BM...
    phenotype_name    (component) object 'height' 'BMD' 'height' 'BMD'
    component_name    (component) object 'additiveNoise' ... 'phenotype'
    vorigin_relative  (component) int64 -1 -1 -1 -1
    comp_type         (component) object 'intermediate' ... 'outcome'
Attributes:
    generation:  1

Conversion to Pandas

Finally, we can convert from a phenotype DataArray to a multiindexed Pandas DataFrame using the DataArray.xft.as_pd() method:

[10]:
pheno.xft.as_pd()
[10]:
phenotype_name height BMD height BMD height BMD
component_name additiveGenetic additiveGenetic additiveNoise additiveNoise phenotype phenotype
vorigin_relative proband proband proband proband proband proband
iid fid sex
1_0 1_0 0 -0.172140 -2.900228 0.586642 0.260576 0.242362 -5.539879
1_1 1_0 0 -0.220377 -2.694321 -2.370100 -2.470488 -2.810854 -7.859131
1_2 1_1 1 -1.749179 0.770812 -0.494897 0.059959 -3.993255 1.601582
1_3 1_1 1 -0.452184 -0.654971 -0.873004 -1.964959 -1.777373 -3.274901
1_4 1_2 0 -1.181860 -1.471807 0.849380 0.655862 -1.514341 -2.287753
... ... ... ... ... ... ... ... ...
1_795 1_397 0 0.890462 0.752347 -0.316503 1.258179 1.464421 2.762873
1_796 1_398 0 0.435559 0.841148 -0.780206 -0.057162 0.090912 1.625135
1_797 1_398 1 0.576414 1.731700 -1.368366 -0.218389 -0.215537 3.245011
1_798 1_399 1 1.045583 0.784169 1.202751 -0.892162 3.293918 0.676176
1_799 1_399 1 0.917998 0.759443 -2.227264 -0.008592 -0.391268 1.510295

800 rows × 6 columns

Haplotype arrays

For \(n\) individuals and \(m\) diploid loci, a haplotype array is an \(n\times 2m\) xr.DataArray of 8-bit integers with rows corresponding to individuals, even columns (indexed starting at zero) corresponding to maternal haplotypes, and odd columns corresponding to paternal haplotypes:

[15]:
haplo
[15]:
<xarray.DataArray 'HaplotypeArray' (sample: 800, variant: 200)>
array([[1, 1, 0, ..., 1, 0, 0],
       [1, 1, 0, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 0, ..., 0, 1, 1],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int8)
Coordinates: (12/13)
  * sample       (sample) <U14 '0..1_0.1_0' '0..1_1.1_0' ... '0..1_799.1_399'
    iid          (sample) object '1_0' '1_1' '1_2' ... '1_797' '1_798' '1_799'
    fid          (sample) object '1_0' '1_0' '1_1' ... '1_398' '1_399' '1_399'
    sex          (sample) int64 0 0 1 1 0 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 1 1
  * variant      (variant) <U4 '0.0' '0.1' '1.0' '1.1' ... '98.1' '99.0' '99.1'
    vid          (variant) object '0' '0' '1' '1' '2' ... '98' '98' '99' '99'
    ...           ...
    hcopy        (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
    zero_allele  (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    one_allele   (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G'
    af           (variant) float64 0.6572 0.6572 0.3289 ... 0.2927 0.3748 0.3748
    pos_bp       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
    pos_cM       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
    generation:  1

Retrieving index objects

We can extract variant and sample index objects as follows:

[16]:
haplo.xft.get_sample_indexer()  ## same as haplo.xft.get_row_indexer()
[16]:
<SampleIndex>
  Generation 0
  800 indviduals from 400 families
  399 biological females
  401 biological males
                  iid    fid  sex
sample
0..1_0.1_0        1_0    1_0    0
            ...
0..1_799.1_399  1_799  1_399    1

[800 rows x 3 columns]
[17]:
haplo.xft.get_variant_indexer()  ## same as haplo.xft.get_column_indexer()
[17]:
<HaploidVariantIndex>
  100 diploid variants on 20 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
        vid  chrom zero_allele one_allele        af hcopy  pos_bp  pos_cM
variant
0.0       0      0           A          G  0.657175     0     NaN     NaN
0.1       0      0           A          G  0.657175     1     NaN     NaN
1.0       1      0           A          G  0.328911     0     NaN     NaN
1.1       1      0           A          G  0.328911     1     NaN     NaN
2.0       2      0           A          G  0.281481     0     NaN     NaN
...      ..    ...         ...        ...       ...   ...     ...     ...
97.1     97     19           A          G  0.419101     1     NaN     NaN
98.0     98     19           A          G  0.292685     0     NaN     NaN
98.1     98     19           A          G  0.292685     1     NaN     NaN
99.0     99     19           A          G  0.374765     0     NaN     NaN
99.1     99     19           A          G  0.374765     1     NaN     NaN

[200 rows x 8 columns]

Computing empirical allele frequencies

Empirical diploid allele frequences are available via the af_empirical property. Here we compare ancestral and empirical allele frequencies:

[18]:
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T
[18]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
ancestral 0.657175 0.328911 0.281481 0.541052 0.675575 0.438485 0.884611 0.647864 0.484746 0.413694 ... 0.663967 0.896287 0.384732 0.710038 0.574542 0.653361 0.220902 0.419101 0.292685 0.374765
empirical 0.642500 0.330000 0.265000 0.511250 0.678750 0.450625 0.913125 0.647500 0.498125 0.435625 ... 0.654375 0.898125 0.388125 0.733750 0.573125 0.663750 0.233750 0.415625 0.325625 0.375000

2 rows × 100 columns

We can use the use_empirical_afs() method to replace ancestral allele frequences with empirical allele frequences. This is useful when ancestral allele frequences are unknown, as is often the case with real data.

[19]:
haplo.xft.use_empirical_afs()
## now ancestral == empirical
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T
[19]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
ancestral 0.6425 0.33 0.265 0.51125 0.67875 0.450625 0.913125 0.6475 0.498125 0.435625 ... 0.654375 0.898125 0.388125 0.73375 0.573125 0.66375 0.23375 0.415625 0.325625 0.375
empirical 0.6425 0.33 0.265 0.51125 0.67875 0.450625 0.913125 0.6475 0.498125 0.435625 ... 0.654375 0.898125 0.388125 0.73375 0.573125 0.66375 0.23375 0.415625 0.325625 0.375

2 rows × 100 columns

Interpolating genetic distances

Given physical positions and a genetic map, centiMorgan distances can interpolated using the interpolate_cM() method.

Example coming soon!