Founder haplotypes
Overview
xftsim
simulations can be highly realistic, but they can also be quite simplistic. Founder data is one place we see this: we can use real phased haplotype data or simulate haplotypes as independent Bernoulli trials. As fully synthetic data is extremely convenient to work with, we recommend at the very least using such data to prototype or debug simulations.
In what follows, we first introduce tools for generating haplotypes from scratch, then cover tools for importing external haplotype data.
Haplotypes from scratch
The simplest methods for generating fully synthetic haplotype data involve generating independent Bernoulli trials. Given a set of allele frequencies, founders.founder_haplotypes_from_AFs()
will randomly generate a haplotype array using this method:
[1]:
import xftsim as xft
from xftsim import founders
afs = [0,1,.5]
founders.founder_haplotypes_from_AFs(n=10, afs=afs)
[1]:
<xarray.DataArray 'HaplotypeArray' (sample: 10, variant: 6)> array([[0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 1], [0, 0, 1, 1, 0, 0], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 0], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0], [0, 0, 1, 1, 1, 0]], dtype=int8) Coordinates: (12/13) * sample (sample) <U10 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_9.0_9' iid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9' fid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9' sex (sample) int64 0 1 0 1 0 1 0 1 0 1 * variant (variant) <U3 '0.0' '0.1' '1.0' '1.1' '2.0' '2.1' vid (variant) object '0' '0' '1' '1' '2' '2' ... ... hcopy (variant) object '0' '1' '0' '1' '0' '1' zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' 'A' one_allele (variant) object 'G' 'G' 'G' 'G' 'G' 'G' af (variant) float64 0.0 0.0 1.0 1.0 0.5 0.5 pos_bp (variant) float64 nan nan nan nan nan nan pos_cM (variant) float64 nan nan nan nan nan nan Attributes: generation: 0
For convenience, the function founders.founder_haplotypes_uniform_AFs()
will uniformly sample m
allele frequencies between minMAF
and 1 - minMAF
:
[2]:
founders.founder_haplotypes_uniform_AFs(n=10,m=10,minMAF=.05)
[2]:
<xarray.DataArray 'HaplotypeArray' (sample: 10, variant: 20)> array([[1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0], [0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1], [0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0], [0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0], [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0], [1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0], [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0]], dtype=int8) Coordinates: (12/13) * sample (sample) <U10 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_9.0_9' iid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9' fid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9' sex (sample) int64 0 1 0 1 0 1 0 1 0 1 * variant (variant) <U3 '0.0' '0.1' '1.0' '1.1' ... '8.1' '9.0' '9.1' vid (variant) object '0' '0' '1' '1' '2' ... '7' '8' '8' '9' '9' ... ... hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1' zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A' one_allele (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G' af (variant) float64 0.5635 0.5635 0.7249 ... 0.1328 0.1955 0.1955 pos_bp (variant) float64 nan nan nan nan nan ... nan nan nan nan nan pos_cM (variant) float64 nan nan nan nan nan ... nan nan nan nan nan Attributes: generation: 0
When creating a generic xft.Index.VariantIndex
(to assign variant IDs etc) as in the above examples, xftsim
will attempt to evenly divide variants between (up to) 22 chromosomes.
Of course, we can always directly provide haplotypes to the xft.struct.HaplotypeArray
constructor:
[3]:
import numpy as np
arbitrary_haplotypes = np.random.binomial(1,.3, (100,200))
xft.struct.HaplotypeArray(arbitrary_haplotypes)
/home/rsb/Dropbox/ftsim/xftsim/xftsim/struct.py:1280: UserWarning: Using empirical allele frequencies as variant_indexer not provided
warnings.warn(
[3]:
<xarray.DataArray 'HaplotypeArray' (sample: 100, variant: 200)> array([[0, 0, 0, ..., 0, 1, 1], [1, 1, 0, ..., 0, 0, 0], [1, 0, 0, ..., 0, 0, 0], ..., [1, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 1], [0, 0, 0, ..., 0, 0, 1]], dtype=int8) Coordinates: (12/13) * sample (sample) <U12 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_99.0_99' iid (sample) object '0_0' '0_1' '0_2' ... '0_97' '0_98' '0_99' fid (sample) object '0_0' '0_1' '0_2' ... '0_97' '0_98' '0_99' sex (sample) int64 0 1 0 1 0 1 0 1 0 1 0 ... 1 0 1 0 1 0 1 0 1 0 1 * variant (variant) <U4 '0.0' '0.1' '1.0' '1.1' ... '98.1' '99.0' '99.1' vid (variant) object '0' '0' '1' '1' '2' ... '98' '98' '99' '99' ... ... hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1' zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A' one_allele (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G' af (variant) float64 0.315 0.315 0.265 0.265 ... 0.295 0.34 0.34 pos_bp (variant) float64 nan nan nan nan nan ... nan nan nan nan nan pos_cM (variant) float64 nan nan nan nan nan ... nan nan nan nan nan Attributes: generation: 0
Haplotypes from external datasets
We currently support the plink binary (bfile) and variant call (VCF) formats by interfacing with the pandas-plink and sgkit libraries, respectively. We hope to add support for bgen
and plink2 (pfile)
formats in the future.
plink bfiles
The plink bfile format is inherantly diploid and will break phasing. Thus, when reading in plink data, haplotypes at heterozygous loci are assignmened randomly.
For this example, we’ll use the example data that’s packaged with the pandas-plink
library:
[4]:
from pandas_plink import get_data_folder
from os.path import join
pdat = founders.founder_haplotypes_from_plink_bfile(join(get_data_folder(), "chr*.bed"))
pdat
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_read.py:288: UserWarning: More than one FAM file has been specified. Only the first one will be considered.
warnings.warn(msg, UserWarning)
Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 344.31it/s]
Mapping files: 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 5/6 [00:00<00:00, 345.12it/s]
Multiple files read in this order: ['chr11', 'chr12']
[4]:
<xarray.DataArray 'HaplotypeArray' (sample: 14, variant: 2504)> dask.array<_genotypes_to_pseudo_haplotypes, shape=(14, 2504), dtype=int8, chunksize=(14, 1558), chunktype=numpy.ndarray> Coordinates: (12/13) * sample (sample) <U12 '0..B001.B001' '0..B002.B002' ... '0..B014.B014' iid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' fid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' sex (sample) int64 2 2 2 2 2 2 2 2 2 2 2 2 2 2 * variant (variant) <U11 '316849996.0' '316849996.1' ... '373081507.1' vid (variant) object '316849996' '316849996' ... '373081507' ... ... hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1' zero_allele (variant) object 'C' 'C' 'G' 'G' 'G' ... 'A' 'T' 'T' 'G' 'G' one_allele (variant) object 'T' 'T' 'C' 'C' 'C' ... 'T' 'C' 'C' 'A' 'A' af (variant) float64 nan nan nan nan nan ... nan nan nan nan nan pos_bp (variant) int32 157439 157439 181802 ... 27367844 27367844 pos_cM (variant) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 Attributes: generation: 0
As pandas-plink
will generate a lazy dask array, we need to use the .compute()
method to explicitly load it into memory:
[5]:
pdat.compute()
[5]:
<xarray.DataArray 'HaplotypeArray' (sample: 14, variant: 2504)> array([[0, 0, 0, ..., 1, 0, 0], [0, 0, 1, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [1, 1, 1, ..., 1, 1, 0], [1, 1, 0, ..., 1, 1, 1], [0, 0, 0, ..., 0, 0, 0]], dtype=int8) Coordinates: (12/13) * sample (sample) <U12 '0..B001.B001' '0..B002.B002' ... '0..B014.B014' iid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' fid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' sex (sample) int64 2 2 2 2 2 2 2 2 2 2 2 2 2 2 * variant (variant) <U11 '316849996.0' '316849996.1' ... '373081507.1' vid (variant) object '316849996' '316849996' ... '373081507' ... ... hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1' zero_allele (variant) object 'C' 'C' 'G' 'G' 'G' ... 'A' 'T' 'T' 'G' 'G' one_allele (variant) object 'T' 'T' 'C' 'C' 'C' ... 'T' 'C' 'C' 'A' 'A' af (variant) float64 nan nan nan nan nan ... nan nan nan nan nan pos_bp (variant) int32 157439 157439 181802 ... 27367844 27367844 pos_cM (variant) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 Attributes: generation: 0
We can write haplotype arrays out to plink bfiles using xft.io.write_to_plink1()
:
[6]:
xft.io.write_to_plink1(pdat, '/tmp/plink')
/home/rsb/Dropbox/ftsim/xftsim/xftsim/io.py:194: UserWarning: Writing to the plink bfile format removes phase information
warnings.warn("Writing to the plink bfile format removes phase information")
Writing BED: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 202.75it/s]
Writing FAM... done.
Writing BIM... done.
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:261: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
df.to_csv(
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:288: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
df.to_csv(
VCF / sgkit data
Support for VCF files is provided via compatability the sgkit
library. We recommend the following procedure:
Convert the VCF file (or a subset there of) to a zarr store on disk using
sgkit.io.vcf.vcf_to_zarr()
Lazy load the sgkit data using
sgkit.load_dataset()
Convert the sgkit data to an
xftsim
compatable DataArray usingxft.founders.founder_haplotypes_from_sgkit_dataset()
Save the founder data to zarr for future use using
xft.io.save_haplotype_zarr()
Haplotypes from zarr stores
Zarr stores are the preferred file format for saving and loading haplotype array data in xftsim
. Haplotype data can be read from zarr using xft.io.load_haplotype_zarr()
and written using xft.io.save_haplotype_zarr()
.