Data structures

[1]:

import numpy as np
import pandas as pd
import xftsim as xft
from xftsim import index, struct

xft.config.print_durations_threshold=10. ## reduce verbosity
np.random.seed(123) ## set random seed for reproducibility

Here we introduce HaplotypeArray and PhenotypeArray objects, the two primary data objects xftsim operates on. These objects are indexed as follows:

Object	Row Index	Column Index
`struct.HaplotypeArray`	`index.SampleIndex`	`index.HaploidVariantIndex`
`struct.PhenotypeArray`	`index.SampleIndex`	`index.ComponentIndex`

It will be challenging to to understand this information if you haven’t read the tutorial on indexing!

Counterintuively, Neither HaplotypeArray are PhenotypeArray actual classes. Rather, both construct instances of xarray.DataArray with an extended API available through the xft accessor. If you’re confused, don’t worry–we’ll go through all of this step by step.

The xarray documentation can be very helpful if you haven’t used xarray before!

First, we’ll run a small simulation (the details of which aren’t important for now) so that we have some arrays to work with:

[2]:

founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=800, m=100)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.5], phenotype_name=['height', 'BMD'], haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes, p =.1)
mating_regime = xft.mate.LinearAssortativeMatingRegime(r = .5, offspring_per_pair=2,
                                                       component_index = xft.index.ComponentIndex.from_product(['height', 'BMD'], ['phenotype']))
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         mating_regime=mating_regime,
                         recombination_map=recombination_map,
                         architecture=architecture)
sim.run(2)

haplo = sim.haplotypes
pheno = sim.phenotypes

/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp_type[component_name==key] = value
/home/rsb/Dropbox/ftsim/xftsim/xftsim/index.py:957: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  comp_type[component_name==key] = value

Phenotype arrays

A phenotype array’s rows correspond to individuals and even columns correspond to phenotypic components:

[3]:

pheno

Conversion to Pandas

Finally, we can convert from a phenotype DataArray to a multiindexed Pandas DataFrame using the DataArray.xft.as_pd() method:

[10]:

pheno.xft.as_pd()

[10]:

		phenotype_name	height	BMD	height	BMD	height	BMD
		component_name	additiveGenetic	additiveGenetic	additiveNoise	additiveNoise	phenotype	phenotype
		vorigin_relative	proband	proband	proband	proband	proband	proband
iid	fid	sex
1_0	1_0	0	-0.172140	-2.900228	0.586642	0.260576	0.242362	-5.539879
1_1	1_0	0	-0.220377	-2.694321	-2.370100	-2.470488	-2.810854	-7.859131
1_2	1_1	1	-1.749179	0.770812	-0.494897	0.059959	-3.993255	1.601582
1_3	1_1	1	-0.452184	-0.654971	-0.873004	-1.964959	-1.777373	-3.274901
1_4	1_2	0	-1.181860	-1.471807	0.849380	0.655862	-1.514341	-2.287753
...	...	...	...	...	...	...	...	...
1_795	1_397	0	0.890462	0.752347	-0.316503	1.258179	1.464421	2.762873
1_796	1_398	0	0.435559	0.841148	-0.780206	-0.057162	0.090912	1.625135
1_797	1_398	1	0.576414	1.731700	-1.368366	-0.218389	-0.215537	3.245011
1_798	1_399	1	1.045583	0.784169	1.202751	-0.892162	3.293918	0.676176
1_799	1_399	1	0.917998	0.759443	-2.227264	-0.008592	-0.391268	1.510295

800 rows × 6 columns

Haplotype arrays

For \(n\) individuals and \(m\) diploid loci, a haplotype array is an \(n\times 2m\) xr.DataArray of 8-bit integers with rows corresponding to individuals, even columns (indexed starting at zero) corresponding to maternal haplotypes, and odd columns corresponding to paternal haplotypes:

[15]:

haplo

Retrieving index objects

We can extract variant and sample index objects as follows:

[16]:

haplo.xft.get_sample_indexer()  ## same as haplo.xft.get_row_indexer()

[16]:

<SampleIndex>
  Generation 0
  800 indviduals from 400 families
  399 biological females
  401 biological males
                  iid    fid  sex
sample
0..1_0.1_0        1_0    1_0    0
            ...
0..1_799.1_399  1_799  1_399    1

[800 rows x 3 columns]

[17]:

haplo.xft.get_variant_indexer()  ## same as haplo.xft.get_column_indexer()

[17]:

<HaploidVariantIndex>
  100 diploid variants on 20 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
        vid  chrom zero_allele one_allele        af hcopy  pos_bp  pos_cM
variant
0.0       0      0           A          G  0.657175     0     NaN     NaN
0.1       0      0           A          G  0.657175     1     NaN     NaN
1.0       1      0           A          G  0.328911     0     NaN     NaN
1.1       1      0           A          G  0.328911     1     NaN     NaN
2.0       2      0           A          G  0.281481     0     NaN     NaN
...      ..    ...         ...        ...       ...   ...     ...     ...
97.1     97     19           A          G  0.419101     1     NaN     NaN
98.0     98     19           A          G  0.292685     0     NaN     NaN
98.1     98     19           A          G  0.292685     1     NaN     NaN
99.0     99     19           A          G  0.374765     0     NaN     NaN
99.1     99     19           A          G  0.374765     1     NaN     NaN

[200 rows x 8 columns]

Computing empirical allele frequencies

Empirical diploid allele frequences are available via the af_empirical property. Here we compare ancestral and empirical allele frequencies:

[18]:

pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T

[18]:

	0	1	2	3	4	5	6	7	8	9	...	90	91	92	93	94	95	96	97	98	99
ancestral	0.657175	0.328911	0.281481	0.541052	0.675575	0.438485	0.884611	0.647864	0.484746	0.413694	...	0.663967	0.896287	0.384732	0.710038	0.574542	0.653361	0.220902	0.419101	0.292685	0.374765
empirical	0.642500	0.330000	0.265000	0.511250	0.678750	0.450625	0.913125	0.647500	0.498125	0.435625	...	0.654375	0.898125	0.388125	0.733750	0.573125	0.663750	0.233750	0.415625	0.325625	0.375000

2 rows × 100 columns

We can use the use_empirical_afs() method to replace ancestral allele frequences with empirical allele frequences. This is useful when ancestral allele frequences are unknown, as is often the case with real data.

[19]:

haplo.xft.use_empirical_afs()
## now ancestral == empirical
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T

[19]:

	0	1	2	3	4	5	6	7	8	9	...	90	91	92	93	94	95	96	97	98	99
ancestral	0.6425	0.33	0.265	0.51125	0.67875	0.450625	0.913125	0.6475	0.498125	0.435625	...	0.654375	0.898125	0.388125	0.73375	0.573125	0.66375	0.23375	0.415625	0.325625	0.375
empirical	0.6425	0.33	0.265	0.51125	0.67875	0.450625	0.913125	0.6475	0.498125	0.435625	...	0.654375	0.898125	0.388125	0.73375	0.573125	0.66375	0.23375	0.415625	0.325625	0.375

2 rows × 100 columns

Interpolating genetic distances

Given physical positions and a genetic map, centiMorgan distances can interpolated using the interpolate_cM() method.

Example coming soon!

Data structures

Phenotype arrays

Retrieving index objects

Subsetting phenotype arrays

Conversion to Pandas

Haplotype arrays

Retrieving index objects

Computing empirical allele frequencies

Interpolating genetic distances