Indexing
In the course of constructing simulations, we very frequently want to refer to specific individuals, specific variants, or specific phenotyping components. For example, when we create a haplotype array, rows will correspond in individuals and columns will correspond to haploid sites.
The xftsim.index
submodule implements classes corresponding to these specific cases:
xft.index.SampleIndex
indexes individualsxft.index.HaploidVariantIndex
andxft.index.DiploidVariantIndex
index genetic variantsxft.index.ComponentIndex
indexes phenotypes and components of phenotypes
All the above are instances of the xft.index.XftIndex
superclass and can be represented as Pandas DataFrames using the .frame
property:
[1]:
import xftsim as xft
## Here's a ComponentIndex instance. Ignore the details for now.
cindex = xft.index.ComponentIndex.from_product(('phenotype_1', 'phenotype_2'),
('additiveGenetic', 'additiveNoise', 'phenotype'))
## We can turn this in a Pandas data.frame
cframe = cindex.frame
cframe
[1]:
phenotype_name | component_name | vorigin_relative | comp_type | |
---|---|---|---|---|
component | ||||
phenotype_1.additiveGenetic.proband | phenotype_1 | additiveGenetic | -1 | intermediate |
phenotype_1.additiveNoise.proband | phenotype_1 | additiveNoise | -1 | intermediate |
phenotype_1.phenotype.proband | phenotype_1 | phenotype | -1 | outcome |
phenotype_2.additiveGenetic.proband | phenotype_2 | additiveGenetic | -1 | intermediate |
phenotype_2.additiveNoise.proband | phenotype_2 | additiveNoise | -1 | intermediate |
phenotype_2.phenotype.proband | phenotype_2 | phenotype | -1 | outcome |
Component indexing
Overview
The most important indexer to undeestand is the xft.index.ComponentIndex
. In xftsim
, individual level data with respect to a given sample (the “proband”) is indexed as follows:
phenotype_name
, phenotypes,component_name
, components of phenotypes,vorigin_relative
, relationship of originator of component to proband (see below).comp_type
, vector indicating if component is a phenotype (‘outcome’) or a subcomponent thereof (‘intermediate’)
Components are essentially sub-phenotypes and include the phenotype itself. For example, euppose we assume that the phenotype height is comprised of two components: an additive genetic component and an additive, individual-specific noise component. The corresponding component index looks like this:
[2]:
xft.index.ComponentIndex.from_product(phenotype_name=('height'),
component_name=('additiveGenetic', 'additiveNoise', 'phenotype'),
vorigin_relative=-1)
[2]:
<ComponentIndex>
3 components of 1 phenotype spanning 1 generation
phenotype_name component_name \
component
height.additiveGenetic.proband height additiveGenetic
height.additiveNoise.proband height additiveNoise
height.phenotype.proband height phenotype
vorigin_relative comp_type
component
height.additiveGenetic.proband -1 intermediate
height.additiveNoise.proband -1 intermediate
height.phenotype.proband -1 outcome
Note that we have added a third component phenotype
to represent the sums of the first two components. Components can be as general or as specific as we like.
In many cases, we can ignore vorigin_relative
, but it’s essential if we want to model intergenerational phenotypic effects such as vertical transmission. It works like this:
vorigin_relative
is a binary encoding that represents relationship to a proband as follows:
|
relationship to proband |
---|---|
-1 |
self |
0 |
mother |
1 |
father |
00 |
maternal grandmother |
10 |
maternal grandfather |
01 |
paternal grandmother |
11 |
paternal grandfather |
000 |
maternal grandmother’s mother |
100 |
maternal grandmother’s father |
… |
… |
and so forth. In most cases, considering maternal or paternal effects is sufficent. Suppose we’d like to jointly analyze trio’s heights, bone mineral density (BMD), as well as the heritable and nonheritable components thereof. In this case, our component index would look like this:
[3]:
xft.index.ComponentIndex.from_product(phenotype_name=('height','BMD'),
component_name=('genetic', 'noise', 'phenotype'),
vorigin_relative=(-1,0,1))
[3]:
<ComponentIndex>
3 components of 2 phenotypes spanning 2 generations
phenotype_name component_name vorigin_relative \
component
height.genetic.proband height genetic -1
height.genetic.mother height genetic 0
height.genetic.father height genetic 1
height.noise.proband height noise -1
height.noise.mother height noise 0
height.noise.father height noise 1
height.phenotype.proband height phenotype -1
height.phenotype.mother height phenotype 0
height.phenotype.father height phenotype 1
BMD.genetic.proband BMD genetic -1
BMD.genetic.mother BMD genetic 0
BMD.genetic.father BMD genetic 1
BMD.noise.proband BMD noise -1
BMD.noise.mother BMD noise 0
BMD.noise.father BMD noise 1
BMD.phenotype.proband BMD phenotype -1
BMD.phenotype.mother BMD phenotype 0
BMD.phenotype.father BMD phenotype 1
comp_type
component
height.genetic.proband intermediate
height.genetic.mother intermediate
height.genetic.father intermediate
height.noise.proband intermediate
height.noise.mother intermediate
height.noise.father intermediate
height.phenotype.proband outcome
height.phenotype.mother outcome
height.phenotype.father outcome
BMD.genetic.proband intermediate
BMD.genetic.mother intermediate
BMD.genetic.father intermediate
BMD.noise.proband intermediate
BMD.noise.mother intermediate
BMD.noise.father intermediate
BMD.phenotype.proband outcome
BMD.phenotype.mother outcome
BMD.phenotype.father outcome
There are several ways to construct a component index, the first of which we’ve used several times already:
Constructing a component index from a cartestian product
To generate a component index by expanding combinations of phenotype_name
, component_name
, and vorigin_relative
, we can use the .from_product()
method:
[4]:
xft.index.ComponentIndex.from_product(phenotype_name=('height',),
component_name=('phenotype'),
vorigin_relative=(-1,1))
[4]:
<ComponentIndex>
1 component of 1 phenotype spanning 2 generations
phenotype_name component_name vorigin_relative \
component
height.phenotype.proband height phenotype -1
height.phenotype.father height phenotype 1
comp_type
component
height.phenotype.proband outcome
height.phenotype.father outcome
See above for further examples.
Constructing a component index for specific components
We can also manually specify arguments:
[5]:
cindex = xft.index.ComponentIndex(phenotype_name=('height','BMD'),
component_name=('phenotype','genetic'),
vorigin_relative=(-1,1))
cindex
[5]:
<ComponentIndex>
2 components of 2 phenotypes spanning 2 generations
phenotype_name component_name vorigin_relative \
component
height.phenotype.proband height phenotype -1
BMD.genetic.father BMD genetic 1
comp_type
component
height.phenotype.proband outcome
BMD.genetic.father intermediate
or provide a Pandas DataFrame with the same information:
[6]:
xft.index.ComponentIndex(frame = cindex.frame)
[6]:
<ComponentIndex>
2 components of 2 phenotypes spanning 2 generations
phenotype_name component_name vorigin_relative \
component
height.phenotype.proband height phenotype -1
BMD.genetic.father BMD genetic 1
comp_type
component
height.phenotype.proband outcome
BMD.genetic.father intermediate
Constructing a generic component index
Finally, we can construct a generic component index by providing the number of phenotypes, k_total
:
[7]:
xft.index.ComponentIndex(k_total=3)
[7]:
<ComponentIndex>
1 component of 3 phenotypes spanning 1 generation
phenotype_name component_name vorigin_relative \
component
0.generic.proband 0 generic -1
1.generic.proband 1 generic -1
2.generic.proband 2 generic -1
comp_type
component
0.generic.proband intermediate
1.generic.proband intermediate
2.generic.proband intermediate
Variant indexing
Genetic data can be influenced with either xft.index.HaploidVariantIndex
or xft.index.DiploidVariantIndex
objects. The former, which indexes haploid sites, is used when distinguishing between homologous sites is necessary, such as during meioisis. It is trivial to switch between these indices so we will focus our introduction on xft.index.DiploidVariantIndex
.
A DiploidVariantIndex
tracks the following variant-level information
vid
, a vector of variant IDschrom
, a vector of chromosome IDszero_allele
, the allele corresponding to zerosone_allele
, a vector of chromosome IDsaf
, a vector of ancestral allele frequencies.annotation_array
, additional variant level annotationsh_copy
, haplotype copy (always ‘d’ for diploid data, ‘0’ or ‘1’ for haploid datapos_bp
, physical position measured in basepairspos_cM
, position as measured in centiMorgans
There are several ways to construct variant indices, though you’ll rarely need to do this in practice. Typically, you’ll use an automatically generated variant index, regardless of whether you’re using real or synthethic founder data.
Constructing a variant index
To construct a generic variant index, you only need to provide the number of diploid variants m
and the number of chromosomes n_chrom
:
[8]:
vind = xft.index.DiploidVariantIndex(m=500,n_chrom=22)
vind
[8]:
<DiploidVariantIndex>
500 diploid variants on 22 chromosome(s)
MAF ranges from nan to nan
0 annotation(s)
vid chrom zero_allele one_allele af hcopy pos_bp pos_cM
variant
0.d 0 0 A G NaN d NaN NaN
1.d 1 0 A G NaN d NaN NaN
2.d 2 0 A G NaN d NaN NaN
3.d 3 0 A G NaN d NaN NaN
4.d 4 0 A G NaN d NaN NaN
... ... ... ... ... .. ... ... ...
495.d 495 21 A G NaN d NaN NaN
496.d 496 21 A G NaN d NaN NaN
497.d 497 21 A G NaN d NaN NaN
498.d 498 21 A G NaN d NaN NaN
499.d 499 21 A G NaN d NaN NaN
[500 rows x 8 columns]
Alternatively we can supply the above arguments individually (only vid
is strictly necessary) or via a pandas DataFrame:
[9]:
xft.index.DiploidVariantIndex(vid=vind.vid)
[9]:
<DiploidVariantIndex>
500 diploid variants on 1 chromosome(s)
MAF ranges from nan to nan
0 annotation(s)
vid chrom zero_allele one_allele af hcopy pos_bp pos_cM
variant
0.d 0 0 A G NaN d NaN NaN
1.d 1 0 A G NaN d NaN NaN
2.d 2 0 A G NaN d NaN NaN
3.d 3 0 A G NaN d NaN NaN
4.d 4 0 A G NaN d NaN NaN
... ... ... ... ... .. ... ... ...
495.d 495 0 A G NaN d NaN NaN
496.d 496 0 A G NaN d NaN NaN
497.d 497 0 A G NaN d NaN NaN
498.d 498 0 A G NaN d NaN NaN
499.d 499 0 A G NaN d NaN NaN
[500 rows x 8 columns]
[10]:
xft.index.DiploidVariantIndex(frame=vind.frame)
[10]:
<DiploidVariantIndex>
500 diploid variants on 22 chromosome(s)
MAF ranges from nan to nan
0 annotation(s)
vid chrom zero_allele one_allele af hcopy pos_bp pos_cM
variant
0.d 0 0 A G NaN d NaN NaN
1.d 1 0 A G NaN d NaN NaN
2.d 2 0 A G NaN d NaN NaN
3.d 3 0 A G NaN d NaN NaN
4.d 4 0 A G NaN d NaN NaN
... ... ... ... ... .. ... ... ...
495.d 495 21 A G NaN d NaN NaN
496.d 496 21 A G NaN d NaN NaN
497.d 497 21 A G NaN d NaN NaN
498.d 498 21 A G NaN d NaN NaN
499.d 499 21 A G NaN d NaN NaN
[500 rows x 8 columns]
Switching between haploid and diploid indices
A DiploidVariantIndex
can be converted to a haploid index and back via the xft.index.DiploidVariantIndex.to_haploid()
and xft.index.HaploidVariantIndex.to_diploid()
methods respectively:
[11]:
hvind = vind.to_haploid()
hvind
[11]:
<HaploidVariantIndex>
500 diploid variants on 22 chromosome(s)
MAF ranges from nan to nan
0 annotation(s)
vid chrom zero_allele one_allele af hcopy pos_bp pos_cM
variant
0.0 0 0 A G NaN 0 NaN NaN
0.1 0 0 A G NaN 1 NaN NaN
1.0 1 0 A G NaN 0 NaN NaN
1.1 1 0 A G NaN 1 NaN NaN
2.0 2 0 A G NaN 0 NaN NaN
... ... ... ... ... .. ... ... ...
497.1 497 21 A G NaN 1 NaN NaN
498.0 498 21 A G NaN 0 NaN NaN
498.1 498 21 A G NaN 1 NaN NaN
499.0 499 21 A G NaN 0 NaN NaN
499.1 499 21 A G NaN 1 NaN NaN
[1000 rows x 8 columns]
[12]:
hvind.to_diploid()
[12]:
<DiploidVariantIndex>
500 diploid variants on 22 chromosome(s)
MAF ranges from nan to nan
0 annotation(s)
vid chrom zero_allele one_allele af hcopy pos_bp pos_cM
variant
0.d 0 0 A G NaN d NaN NaN
1.d 1 0 A G NaN d NaN NaN
2.d 2 0 A G NaN d NaN NaN
3.d 3 0 A G NaN d NaN NaN
4.d 4 0 A G NaN d NaN NaN
... ... ... ... ... .. ... ... ...
495.d 495 21 A G NaN d NaN NaN
496.d 496 21 A G NaN d NaN NaN
497.d 497 21 A G NaN d NaN NaN
498.d 498 21 A G NaN d NaN NaN
499.d 499 21 A G NaN d NaN NaN
[500 rows x 8 columns]
Sample indexing
We reference specific individuals using the xft.index.SampleIndex
class which includes three individual-level pieces of information:
iid
, a vector of individual IDsfid
, a vector of family IDssex
, a vector of biological sexes, with 0 and 1 encoding female and male respectively,
as well as the generation
of the sample. Sample indexes are always specific to a single generation.
There are several ways to construct a SampleIndex
, all of which will automatically construct unique identifiers:
Constructing a generic sample index
If all we want is a sample of arbitrarily-named, unrelated, sex-balanced individuals, we can simply provide the number of individuals n
and the generation
:
[13]:
xft.index.SampleIndex(n=5, generation=1)
[13]:
<SampleIndex>
Generation 1
5 indviduals from 5 families
3 biological females
2 biological males
iid fid sex
sample
1..1_0.1_0 1_0 1_0 0
1..1_1.1_1 1_1 1_1 1
1..1_2.1_2 1_2 1_2 0
1..1_3.1_3 1_3 1_3 1
1..1_4.1_4 1_4 1_4 0
Constructing a sample index for specific iids and fids
We can alternatively provide specific iids, fids, and sexes. Here we create a sample index comprised of two families: three sisters in the first and two brothers in the second:
[14]:
sind = xft.index.SampleIndex(iid = ['0_sister1','0_sister2','0_sister3','0_brother1', '0_brother2'],
fid = [0,0,0,1,1],
sex = [0,0,0,1,1], generation = 0)
sind
[14]:
<SampleIndex>
Generation 0
5 indviduals from 2 families
3 biological females
2 biological males
iid fid sex
sample
0..0_sister1.0 0_sister1 0 0
0..0_sister2.0 0_sister2 0 0
0..0_sister3.0 0_sister3 0 0
0..0_brother1.1 0_brother1 1 1
0..0_brother2.1 0_brother2 1 1
Constructing a sample index with a DataFrame
Finally, we can simply provide a Pandas DataFrame using the frame
argument. In this case we also need to specify the generation.
[15]:
sind.frame
[15]:
iid | fid | sex | |
---|---|---|---|
sample | |||
0..0_sister1.0 | 0_sister1 | 0 | 0 |
0..0_sister2.0 | 0_sister2 | 0 | 0 |
0..0_sister3.0 | 0_sister3 | 0 | 0 |
0..0_brother1.1 | 0_brother1 | 1 | 1 |
0..0_brother2.1 | 0_brother2 | 1 | 1 |
[16]:
xft.index.SampleIndex(frame=sind.frame, generation = 1)
[16]:
<SampleIndex>
Generation 1
5 indviduals from 2 families
3 biological females
2 biological males
iid fid sex
sample
1..0_sister1.0 0_sister1 0 0
1..0_sister2.0 0_sister2 0 0
1..0_sister3.0 0_sister3 0 0
1..0_brother1.1 0_brother1 1 1
1..0_brother2.1 0_brother2 1 1