Recombination
xftsim
provides methods for simulating recombination under simplistic and realistic models. In practice, we’ve observed that both procedures tend to yield highly similar results, but we are open to counterexamples. We therefore try to make both simplistic and complex recombination simulations painless.
[1]:
import xftsim as xft
from xftsim.reproduce import RecombinationMap
A RecombinationMap
object maps ordered collections of \(m\) diploid variants per chromosome to a vector of \(m\) probabilities \(p\). The first element of \(p\) in each chromsome is always 0.5, reflecting the randomness of Mendelian inheritance, and the others reflect the independent probabilities of recombination events occuring between contiguous loci. There are currently no mechanisms for interference or subpopulation-specific recombination maps.
Simple recombination schemes
The simplest recombination scheme sets all elements of \(p\) (except those on chromosome boundaries) to a single value. 0.5 would then corresponded to fully unlinked loci, with the strength of local LD increasing as we approach zero. If we provide a single value as the \(p\) argument, this is what we get:
[2]:
demo = xft.sim.DemoSimulation(m=80)
haplotypes = demo.haplotypes
rmap = RecombinationMap(p=.1, vindex=haplotypes.xft.get_variant_indexer())
rmap
[2]:
vid chrom p
0 0 0 0.5
1 1 0 0.1
2 2 0 0.1
3 3 0 0.1
4 4 1 0.5
.. .. ... ...
75 75 18 0.1
76 76 19 0.5
77 77 19 0.1
78 78 19 0.1
79 79 19 0.1
[80 rows x 3 columns]
We can achieve the same outcome providing haplotypes (instead of a variant index) directly using the constant_map_from_haplotypes
method:
[3]:
RecombinationMap.constant_map_from_haplotypes(p=.1, haplotypes=haplotypes)
[3]:
vid chrom p
0 0 0 0.5
1 1 0 0.1
2 2 0 0.1
3 3 0 0.1
4 4 1 0.5
.. .. ... ...
75 75 18 0.1
76 76 19 0.5
77 77 19 0.1
78 78 19 0.1
79 79 19 0.1
[80 rows x 3 columns]
We can also construct a fully arbitrary recombination map by supplying a vector of probabilities rather than a single value:
[4]:
import numpy as np
RecombinationMap(p=np.random.uniform(size=haplotypes.xft.m),
vindex=haplotypes.xft.get_variant_indexer())
[4]:
vid chrom p
0 0 0 0.500000
1 1 0 0.916344
2 2 0 0.969233
3 3 0 0.488298
4 4 1 0.500000
.. .. ... ...
75 75 18 0.832223
76 76 19 0.500000
77 77 19 0.530739
78 78 19 0.401417
79 79 19 0.368004
[80 rows x 3 columns]
Realistic recombination schemes
Given a haplotype array with genetic distances stored in the pos_cM
coordinate of it’s variant index, we can construct the corresponding recombination map. If the haplotype array doesn’t have genetic distance information, we can interpolate it using a genetic map
We demonstrate this below using the example genotype data for parts of chromosome 11 and 12 from the pandas-plink
library and a genetic map downloaded from the pyrho project:
[5]:
from pandas_plink import get_data_folder
from os.path import join
pdat = xft.founders.founder_haplotypes_from_plink_bfile(join(get_data_folder(), "chr*.bed")).compute()
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_read.py:288: UserWarning: More than one FAM file has been specified. Only the first one will be considered.
warnings.warn(msg, UserWarning)
Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 396.08it/s]
Mapping files: 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 5/6 [00:00<00:00, 483.60it/s]
Multiple files read in this order: ['chr11', 'chr12']
Note that this data doesn’t include genetic distances:
[6]:
pdat.xft.get_variant_indexer().pos_cM
[6]:
array([nan, nan, nan, ..., nan, nan, nan])
We can read the pyrho maps in using xft.struct.GeneticMap.from_pyrho_maps
and then use xft.interpolate_cM
DataArray method to interpolate the cM distances:
[7]:
paths = ['/tmp/hg19/YRI/YRI_recombination_map_hapmap_format_hg19_chr_'+str(i)+'.txt' for i in range(1,23)]
gmap = xft.struct.GeneticMap.from_pyrho_maps(paths)
pdat.xft.interpolate_cM(gmap)
pdat.xft.get_variant_indexer().pos_cM
[7]:
array([ 0.08585632, 0.08585632, 0.14418587, ..., 28.92352049,
28.93298451, 28.93298451])
Finally we can construct a recombination map from these haplotypes with genetic distance information using the variable_map_from_haplotypes_with_cM
method:
[8]:
RecombinationMap.variable_map_from_haplotypes_with_cM(pdat)
[8]:
vid chrom p
0 316849996 11 0.500000
1 316874359 11 0.000583
2 316941526 11 0.000953
3 317137620 11 0.006634
4 317534352 11 0.002456
... ... ... ...
1247 372752160 12 0.000014
1248 372819760 12 0.000299
1249 372877404 12 0.000245
1250 372918788 12 0.000137
1251 373081507 12 0.000095
[1252 rows x 3 columns]