Recombination

xftsim provides methods for simulating recombination under simplistic and realistic models. In practice, we’ve observed that both procedures tend to yield highly similar results, but we are open to counterexamples. We therefore try to make both simplistic and complex recombination simulations painless.

[1]:
import xftsim as xft
from xftsim.reproduce import RecombinationMap

A RecombinationMap object maps ordered collections of \(m\) diploid variants per chromosome to a vector of \(m\) probabilities \(p\). The first element of \(p\) in each chromsome is always 0.5, reflecting the randomness of Mendelian inheritance, and the others reflect the independent probabilities of recombination events occuring between contiguous loci. There are currently no mechanisms for interference or subpopulation-specific recombination maps.

Simple recombination schemes

The simplest recombination scheme sets all elements of \(p\) (except those on chromosome boundaries) to a single value. 0.5 would then corresponded to fully unlinked loci, with the strength of local LD increasing as we approach zero. If we provide a single value as the \(p\) argument, this is what we get:

[2]:
demo = xft.sim.DemoSimulation(m=80)
haplotypes = demo.haplotypes

rmap = RecombinationMap(p=.1, vindex=haplotypes.xft.get_variant_indexer())
rmap
[2]:
   vid  chrom    p
0    0      0  0.5
1    1      0  0.1
2    2      0  0.1
3    3      0  0.1
4    4      1  0.5
..  ..    ...  ...
75  75     18  0.1
76  76     19  0.5
77  77     19  0.1
78  78     19  0.1
79  79     19  0.1

[80 rows x 3 columns]

We can achieve the same outcome providing haplotypes (instead of a variant index) directly using the constant_map_from_haplotypes method:

[3]:
RecombinationMap.constant_map_from_haplotypes(p=.1, haplotypes=haplotypes)
[3]:
   vid  chrom    p
0    0      0  0.5
1    1      0  0.1
2    2      0  0.1
3    3      0  0.1
4    4      1  0.5
..  ..    ...  ...
75  75     18  0.1
76  76     19  0.5
77  77     19  0.1
78  78     19  0.1
79  79     19  0.1

[80 rows x 3 columns]

We can also construct a fully arbitrary recombination map by supplying a vector of probabilities rather than a single value:

[4]:
import numpy as np
RecombinationMap(p=np.random.uniform(size=haplotypes.xft.m),
                 vindex=haplotypes.xft.get_variant_indexer())
[4]:
   vid  chrom         p
0    0      0  0.500000
1    1      0  0.916344
2    2      0  0.969233
3    3      0  0.488298
4    4      1  0.500000
..  ..    ...       ...
75  75     18  0.832223
76  76     19  0.500000
77  77     19  0.530739
78  78     19  0.401417
79  79     19  0.368004

[80 rows x 3 columns]

Realistic recombination schemes

Given a haplotype array with genetic distances stored in the pos_cM coordinate of it’s variant index, we can construct the corresponding recombination map. If the haplotype array doesn’t have genetic distance information, we can interpolate it using a genetic map

We demonstrate this below using the example genotype data for parts of chromosome 11 and 12 from the pandas-plink library and a genetic map downloaded from the pyrho project:

[5]:
from pandas_plink import get_data_folder
from os.path import join

pdat = xft.founders.founder_haplotypes_from_plink_bfile(join(get_data_folder(), "chr*.bed")).compute()

/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_read.py:288: UserWarning: More than one FAM file has been specified. Only the first one will be considered.
  warnings.warn(msg, UserWarning)
Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 396.08it/s]
Mapping files:  83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                          | 5/6 [00:00<00:00, 483.60it/s]
Multiple files read in this order: ['chr11', 'chr12']

Note that this data doesn’t include genetic distances:

[6]:
pdat.xft.get_variant_indexer().pos_cM
[6]:
array([nan, nan, nan, ..., nan, nan, nan])

We can read the pyrho maps in using xft.struct.GeneticMap.from_pyrho_maps and then use xft.interpolate_cM DataArray method to interpolate the cM distances:

[7]:
paths = ['/tmp/hg19/YRI/YRI_recombination_map_hapmap_format_hg19_chr_'+str(i)+'.txt' for i in range(1,23)]

gmap = xft.struct.GeneticMap.from_pyrho_maps(paths)

pdat.xft.interpolate_cM(gmap)
pdat.xft.get_variant_indexer().pos_cM
[7]:
array([ 0.08585632,  0.08585632,  0.14418587, ..., 28.92352049,
       28.93298451, 28.93298451])

Finally we can construct a recombination map from these haplotypes with genetic distance information using the variable_map_from_haplotypes_with_cM method:

[8]:
RecombinationMap.variable_map_from_haplotypes_with_cM(pdat)
[8]:
            vid chrom         p
0     316849996    11  0.500000
1     316874359    11  0.000583
2     316941526    11  0.000953
3     317137620    11  0.006634
4     317534352    11  0.002456
...         ...   ...       ...
1247  372752160    12  0.000014
1248  372819760    12  0.000299
1249  372877404    12  0.000245
1250  372918788    12  0.000137
1251  373081507    12  0.000095

[1252 rows x 3 columns]