Indexing

In the course of constructing simulations, we very frequently want to refer to specific individuals, specific variants, or specific phenotyping components. For example, when we create a haplotype array, rows will correspond in individuals and columns will correspond to haploid sites.

The xftsim.index submodule implements classes corresponding to these specific cases:

  • xft.index.SampleIndex indexes individuals

  • xft.index.HaploidVariantIndex and xft.index.DiploidVariantIndex index genetic variants

  • xft.index.ComponentIndex indexes phenotypes and components of phenotypes

All the above are instances of the xft.index.XftIndex superclass and can be represented as Pandas DataFrames using the .frame property:

[1]:
import xftsim as xft

## Here's a ComponentIndex instance. Ignore the details for now.
cindex = xft.index.ComponentIndex.from_product(('phenotype_1', 'phenotype_2'),
                                               ('additiveGenetic', 'additiveNoise', 'phenotype'))

## We can turn this in a Pandas data.frame
cframe = cindex.frame
cframe
[1]:
phenotype_name component_name vorigin_relative comp_type
component
phenotype_1.additiveGenetic.proband phenotype_1 additiveGenetic -1 intermediate
phenotype_1.additiveNoise.proband phenotype_1 additiveNoise -1 intermediate
phenotype_1.phenotype.proband phenotype_1 phenotype -1 outcome
phenotype_2.additiveGenetic.proband phenotype_2 additiveGenetic -1 intermediate
phenotype_2.additiveNoise.proband phenotype_2 additiveNoise -1 intermediate
phenotype_2.phenotype.proband phenotype_2 phenotype -1 outcome

Component indexing

Overview

The most important indexer to undeestand is the xft.index.ComponentIndex. In xftsim, individual level data with respect to a given sample (the “proband”) is indexed as follows:

  • phenotype_name, phenotypes,

  • component_name, components of phenotypes,

  • vorigin_relative, relationship of originator of component to proband (see below).

  • comp_type, vector indicating if component is a phenotype (‘outcome’) or a subcomponent thereof (‘intermediate’)

Components are essentially sub-phenotypes and include the phenotype itself. For example, euppose we assume that the phenotype height is comprised of two components: an additive genetic component and an additive, individual-specific noise component. The corresponding component index looks like this:

[2]:
xft.index.ComponentIndex.from_product(phenotype_name=('height'),
                                      component_name=('additiveGenetic', 'additiveNoise', 'phenotype'),
                                      vorigin_relative=-1)
[2]:
<ComponentIndex>
  3 components of 1 phenotype spanning 1 generation
                               phenotype_name   component_name  \
component
height.additiveGenetic.proband         height  additiveGenetic
height.additiveNoise.proband           height    additiveNoise
height.phenotype.proband               height        phenotype

                                vorigin_relative     comp_type
component
height.additiveGenetic.proband                -1  intermediate
height.additiveNoise.proband                  -1  intermediate
height.phenotype.proband                      -1       outcome

Note that we have added a third component phenotype to represent the sums of the first two components. Components can be as general or as specific as we like.

In many cases, we can ignore vorigin_relative, but it’s essential if we want to model intergenerational phenotypic effects such as vertical transmission. It works like this:

vorigin_relative is a binary encoding that represents relationship to a proband as follows:

vorigin_relative

relationship to proband

-1

self

0

mother

1

father

00

maternal grandmother

10

maternal grandfather

01

paternal grandmother

11

paternal grandfather

000

maternal grandmother’s mother

100

maternal grandmother’s father

and so forth. In most cases, considering maternal or paternal effects is sufficent. Suppose we’d like to jointly analyze trio’s heights, bone mineral density (BMD), as well as the heritable and nonheritable components thereof. In this case, our component index would look like this:

[3]:
xft.index.ComponentIndex.from_product(phenotype_name=('height','BMD'),
                                      component_name=('genetic', 'noise', 'phenotype'),
                                      vorigin_relative=(-1,0,1))
[3]:
<ComponentIndex>
  3 components of 2 phenotypes spanning 2 generations
                         phenotype_name component_name  vorigin_relative  \
component
height.genetic.proband           height        genetic                -1
height.genetic.mother            height        genetic                 0
height.genetic.father            height        genetic                 1
height.noise.proband             height          noise                -1
height.noise.mother              height          noise                 0
height.noise.father              height          noise                 1
height.phenotype.proband         height      phenotype                -1
height.phenotype.mother          height      phenotype                 0
height.phenotype.father          height      phenotype                 1
BMD.genetic.proband                 BMD        genetic                -1
BMD.genetic.mother                  BMD        genetic                 0
BMD.genetic.father                  BMD        genetic                 1
BMD.noise.proband                   BMD          noise                -1
BMD.noise.mother                    BMD          noise                 0
BMD.noise.father                    BMD          noise                 1
BMD.phenotype.proband               BMD      phenotype                -1
BMD.phenotype.mother                BMD      phenotype                 0
BMD.phenotype.father                BMD      phenotype                 1

                             comp_type
component
height.genetic.proband    intermediate
height.genetic.mother     intermediate
height.genetic.father     intermediate
height.noise.proband      intermediate
height.noise.mother       intermediate
height.noise.father       intermediate
height.phenotype.proband       outcome
height.phenotype.mother        outcome
height.phenotype.father        outcome
BMD.genetic.proband       intermediate
BMD.genetic.mother        intermediate
BMD.genetic.father        intermediate
BMD.noise.proband         intermediate
BMD.noise.mother          intermediate
BMD.noise.father          intermediate
BMD.phenotype.proband          outcome
BMD.phenotype.mother           outcome
BMD.phenotype.father           outcome

There are several ways to construct a component index, the first of which we’ve used several times already:

Constructing a component index from a cartestian product

To generate a component index by expanding combinations of phenotype_name, component_name, and vorigin_relative, we can use the .from_product() method:

[4]:
xft.index.ComponentIndex.from_product(phenotype_name=('height',),
                                      component_name=('phenotype'),
                                      vorigin_relative=(-1,1))
[4]:
<ComponentIndex>
  1 component of 1 phenotype spanning 2 generations
                         phenotype_name component_name  vorigin_relative  \
component
height.phenotype.proband         height      phenotype                -1
height.phenotype.father          height      phenotype                 1

                         comp_type
component
height.phenotype.proband   outcome
height.phenotype.father    outcome

See above for further examples.

Constructing a component index for specific components

We can also manually specify arguments:

[5]:
cindex = xft.index.ComponentIndex(phenotype_name=('height','BMD'),
                                  component_name=('phenotype','genetic'),
                                  vorigin_relative=(-1,1))
cindex
[5]:
<ComponentIndex>
  2 components of 2 phenotypes spanning 2 generations
                         phenotype_name component_name  vorigin_relative  \
component
height.phenotype.proband         height      phenotype                -1
BMD.genetic.father                  BMD        genetic                 1

                             comp_type
component
height.phenotype.proband       outcome
BMD.genetic.father        intermediate

or provide a Pandas DataFrame with the same information:

[6]:
 xft.index.ComponentIndex(frame = cindex.frame)
[6]:
<ComponentIndex>
  2 components of 2 phenotypes spanning 2 generations
                         phenotype_name component_name  vorigin_relative  \
component
height.phenotype.proband         height      phenotype                -1
BMD.genetic.father                  BMD        genetic                 1

                             comp_type
component
height.phenotype.proband       outcome
BMD.genetic.father        intermediate

Constructing a generic component index

Finally, we can construct a generic component index by providing the number of phenotypes, k_total:

[7]:
xft.index.ComponentIndex(k_total=3)
[7]:
<ComponentIndex>
  1 component of 3 phenotypes spanning 1 generation
                  phenotype_name component_name  vorigin_relative  \
component
0.generic.proband              0        generic                -1
1.generic.proband              1        generic                -1
2.generic.proband              2        generic                -1

                      comp_type
component
0.generic.proband  intermediate
1.generic.proband  intermediate
2.generic.proband  intermediate

Variant indexing

Genetic data can be influenced with either xft.index.HaploidVariantIndex or xft.index.DiploidVariantIndex objects. The former, which indexes haploid sites, is used when distinguishing between homologous sites is necessary, such as during meioisis. It is trivial to switch between these indices so we will focus our introduction on xft.index.DiploidVariantIndex.

A DiploidVariantIndex tracks the following variant-level information

  • vid, a vector of variant IDs

  • chrom, a vector of chromosome IDs

  • zero_allele, the allele corresponding to zeros

  • one_allele, a vector of chromosome IDs

  • af, a vector of ancestral allele frequencies.

  • annotation_array, additional variant level annotations

  • h_copy, haplotype copy (always ‘d’ for diploid data, ‘0’ or ‘1’ for haploid data

  • pos_bp, physical position measured in basepairs

  • pos_cM, position as measured in centiMorgans

There are several ways to construct variant indices, though you’ll rarely need to do this in practice. Typically, you’ll use an automatically generated variant index, regardless of whether you’re using real or synthethic founder data.

Constructing a variant index

To construct a generic variant index, you only need to provide the number of diploid variants m and the number of chromosomes n_chrom:

[8]:
vind = xft.index.DiploidVariantIndex(m=500,n_chrom=22)
vind
[8]:
<DiploidVariantIndex>
  500 diploid variants on 22 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
         vid  chrom zero_allele one_allele  af hcopy  pos_bp  pos_cM
variant
0.d        0      0           A          G NaN     d     NaN     NaN
1.d        1      0           A          G NaN     d     NaN     NaN
2.d        2      0           A          G NaN     d     NaN     NaN
3.d        3      0           A          G NaN     d     NaN     NaN
4.d        4      0           A          G NaN     d     NaN     NaN
...      ...    ...         ...        ...  ..   ...     ...     ...
495.d    495     21           A          G NaN     d     NaN     NaN
496.d    496     21           A          G NaN     d     NaN     NaN
497.d    497     21           A          G NaN     d     NaN     NaN
498.d    498     21           A          G NaN     d     NaN     NaN
499.d    499     21           A          G NaN     d     NaN     NaN

[500 rows x 8 columns]

Alternatively we can supply the above arguments individually (only vid is strictly necessary) or via a pandas DataFrame:

[9]:
xft.index.DiploidVariantIndex(vid=vind.vid)
[9]:
<DiploidVariantIndex>
  500 diploid variants on 1 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
         vid  chrom zero_allele one_allele  af hcopy  pos_bp  pos_cM
variant
0.d        0      0           A          G NaN     d     NaN     NaN
1.d        1      0           A          G NaN     d     NaN     NaN
2.d        2      0           A          G NaN     d     NaN     NaN
3.d        3      0           A          G NaN     d     NaN     NaN
4.d        4      0           A          G NaN     d     NaN     NaN
...      ...    ...         ...        ...  ..   ...     ...     ...
495.d    495      0           A          G NaN     d     NaN     NaN
496.d    496      0           A          G NaN     d     NaN     NaN
497.d    497      0           A          G NaN     d     NaN     NaN
498.d    498      0           A          G NaN     d     NaN     NaN
499.d    499      0           A          G NaN     d     NaN     NaN

[500 rows x 8 columns]
[10]:
xft.index.DiploidVariantIndex(frame=vind.frame)
[10]:
<DiploidVariantIndex>
  500 diploid variants on 22 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
         vid  chrom zero_allele one_allele  af hcopy  pos_bp  pos_cM
variant
0.d        0      0           A          G NaN     d     NaN     NaN
1.d        1      0           A          G NaN     d     NaN     NaN
2.d        2      0           A          G NaN     d     NaN     NaN
3.d        3      0           A          G NaN     d     NaN     NaN
4.d        4      0           A          G NaN     d     NaN     NaN
...      ...    ...         ...        ...  ..   ...     ...     ...
495.d    495     21           A          G NaN     d     NaN     NaN
496.d    496     21           A          G NaN     d     NaN     NaN
497.d    497     21           A          G NaN     d     NaN     NaN
498.d    498     21           A          G NaN     d     NaN     NaN
499.d    499     21           A          G NaN     d     NaN     NaN

[500 rows x 8 columns]

Switching between haploid and diploid indices

A DiploidVariantIndex can be converted to a haploid index and back via the xft.index.DiploidVariantIndex.to_haploid() and xft.index.HaploidVariantIndex.to_diploid() methods respectively:

[11]:
hvind = vind.to_haploid()
hvind
[11]:
<HaploidVariantIndex>
  500 diploid variants on 22 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
         vid  chrom zero_allele one_allele  af hcopy  pos_bp  pos_cM
variant
0.0        0      0           A          G NaN     0     NaN     NaN
0.1        0      0           A          G NaN     1     NaN     NaN
1.0        1      0           A          G NaN     0     NaN     NaN
1.1        1      0           A          G NaN     1     NaN     NaN
2.0        2      0           A          G NaN     0     NaN     NaN
...      ...    ...         ...        ...  ..   ...     ...     ...
497.1    497     21           A          G NaN     1     NaN     NaN
498.0    498     21           A          G NaN     0     NaN     NaN
498.1    498     21           A          G NaN     1     NaN     NaN
499.0    499     21           A          G NaN     0     NaN     NaN
499.1    499     21           A          G NaN     1     NaN     NaN

[1000 rows x 8 columns]
[12]:
hvind.to_diploid()
[12]:
<DiploidVariantIndex>
  500 diploid variants on 22 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s)
         vid  chrom zero_allele one_allele  af hcopy  pos_bp  pos_cM
variant
0.d        0      0           A          G NaN     d     NaN     NaN
1.d        1      0           A          G NaN     d     NaN     NaN
2.d        2      0           A          G NaN     d     NaN     NaN
3.d        3      0           A          G NaN     d     NaN     NaN
4.d        4      0           A          G NaN     d     NaN     NaN
...      ...    ...         ...        ...  ..   ...     ...     ...
495.d    495     21           A          G NaN     d     NaN     NaN
496.d    496     21           A          G NaN     d     NaN     NaN
497.d    497     21           A          G NaN     d     NaN     NaN
498.d    498     21           A          G NaN     d     NaN     NaN
499.d    499     21           A          G NaN     d     NaN     NaN

[500 rows x 8 columns]

Sample indexing

We reference specific individuals using the xft.index.SampleIndex class which includes three individual-level pieces of information:

  • iid, a vector of individual IDs

  • fid, a vector of family IDs

  • sex, a vector of biological sexes, with 0 and 1 encoding female and male respectively,

as well as the generation of the sample. Sample indexes are always specific to a single generation.

There are several ways to construct a SampleIndex, all of which will automatically construct unique identifiers:

Constructing a generic sample index

If all we want is a sample of arbitrarily-named, unrelated, sex-balanced individuals, we can simply provide the number of individuals n and the generation:

[13]:
xft.index.SampleIndex(n=5, generation=1)
[13]:
<SampleIndex>
  Generation 1
  5 indviduals from 5 families
  3 biological females
  2 biological males
            iid  fid  sex
sample
1..1_0.1_0  1_0  1_0    0
1..1_1.1_1  1_1  1_1    1
1..1_2.1_2  1_2  1_2    0
1..1_3.1_3  1_3  1_3    1
1..1_4.1_4  1_4  1_4    0

Constructing a sample index for specific iids and fids

We can alternatively provide specific iids, fids, and sexes. Here we create a sample index comprised of two families: three sisters in the first and two brothers in the second:

[14]:
sind = xft.index.SampleIndex(iid = ['0_sister1','0_sister2','0_sister3','0_brother1', '0_brother2'],
                             fid = [0,0,0,1,1],
                             sex = [0,0,0,1,1], generation = 0)
sind
[14]:
<SampleIndex>
  Generation 0
  5 indviduals from 2 families
  3 biological females
  2 biological males
                        iid  fid  sex
sample
0..0_sister1.0    0_sister1    0    0
0..0_sister2.0    0_sister2    0    0
0..0_sister3.0    0_sister3    0    0
0..0_brother1.1  0_brother1    1    1
0..0_brother2.1  0_brother2    1    1

Constructing a sample index with a DataFrame

Finally, we can simply provide a Pandas DataFrame using the frame argument. In this case we also need to specify the generation.

[15]:
sind.frame
[15]:
iid fid sex
sample
0..0_sister1.0 0_sister1 0 0
0..0_sister2.0 0_sister2 0 0
0..0_sister3.0 0_sister3 0 0
0..0_brother1.1 0_brother1 1 1
0..0_brother2.1 0_brother2 1 1
[16]:
xft.index.SampleIndex(frame=sind.frame, generation = 1)
[16]:
<SampleIndex>
  Generation 1
  5 indviduals from 2 families
  3 biological females
  2 biological males
                        iid  fid  sex
sample
1..0_sister1.0    0_sister1    0    0
1..0_sister2.0    0_sister2    0    0
1..0_sister3.0    0_sister3    0    0
1..0_brother1.1  0_brother1    1    1
1..0_brother2.1  0_brother2    1    1