Founder haplotypes
Overview
xftsim simulations can be highly realistic, but they can also be quite simplistic. Founder data is one place we see this: we can use real phased haplotype data or simulate haplotypes as independent Bernoulli trials. As fully synthetic data is extremely convenient to work with, we recommend at the very least using such data to prototype or debug simulations.
In what follows, we first introduce tools for generating haplotypes from scratch, then cover tools for importing external haplotype data.
Haplotypes from scratch
The simplest methods for generating fully synthetic haplotype data involve generating independent Bernoulli trials. Given a set of allele frequencies, founders.founder_haplotypes_from_AFs() will randomly generate a haplotype array using this method:
[1]:
import xftsim as xft
from xftsim import founders
afs = [0,1,.5]
founders.founder_haplotypes_from_AFs(n=10, afs=afs)
[1]:
<xarray.DataArray 'HaplotypeArray' (sample: 10, variant: 6)>
array([[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0]], dtype=int8)
Coordinates: (12/13)
* sample (sample) <U10 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_9.0_9'
iid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
fid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
sex (sample) int64 0 1 0 1 0 1 0 1 0 1
* variant (variant) <U3 '0.0' '0.1' '1.0' '1.1' '2.0' '2.1'
vid (variant) object '0' '0' '1' '1' '2' '2'
... ...
hcopy (variant) object '0' '1' '0' '1' '0' '1'
zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' 'A'
one_allele (variant) object 'G' 'G' 'G' 'G' 'G' 'G'
af (variant) float64 0.0 0.0 1.0 1.0 0.5 0.5
pos_bp (variant) float64 nan nan nan nan nan nan
pos_cM (variant) float64 nan nan nan nan nan nan
Attributes:
generation: 0For convenience, the function founders.founder_haplotypes_uniform_AFs() will uniformly sample m allele frequencies between minMAF and 1 - minMAF:
[2]:
founders.founder_haplotypes_uniform_AFs(n=10,m=10,minMAF=.05)
[2]:
<xarray.DataArray 'HaplotypeArray' (sample: 10, variant: 20)>
array([[1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1],
[0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0],
[0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0],
[1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0]],
dtype=int8)
Coordinates: (12/13)
* sample (sample) <U10 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_9.0_9'
iid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
fid (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
sex (sample) int64 0 1 0 1 0 1 0 1 0 1
* variant (variant) <U3 '0.0' '0.1' '1.0' '1.1' ... '8.1' '9.0' '9.1'
vid (variant) object '0' '0' '1' '1' '2' ... '7' '8' '8' '9' '9'
... ...
hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
one_allele (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G'
af (variant) float64 0.5635 0.5635 0.7249 ... 0.1328 0.1955 0.1955
pos_bp (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
pos_cM (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
generation: 0When creating a generic xft.Index.VariantIndex (to assign variant IDs etc) as in the above examples, xftsim will attempt to evenly divide variants between (up to) 22 chromosomes.
Of course, we can always directly provide haplotypes to the xft.struct.HaplotypeArray constructor:
[3]:
import numpy as np
arbitrary_haplotypes = np.random.binomial(1,.3, (100,200))
xft.struct.HaplotypeArray(arbitrary_haplotypes)
/home/rsb/Dropbox/ftsim/xftsim/xftsim/struct.py:1280: UserWarning: Using empirical allele frequencies as variant_indexer not provided
warnings.warn(
[3]:
<xarray.DataArray 'HaplotypeArray' (sample: 100, variant: 200)>
array([[0, 0, 0, ..., 0, 1, 1],
[1, 1, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
...,
[1, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 0, 0, 1]], dtype=int8)
Coordinates: (12/13)
* sample (sample) <U12 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_99.0_99'
iid (sample) object '0_0' '0_1' '0_2' ... '0_97' '0_98' '0_99'
fid (sample) object '0_0' '0_1' '0_2' ... '0_97' '0_98' '0_99'
sex (sample) int64 0 1 0 1 0 1 0 1 0 1 0 ... 1 0 1 0 1 0 1 0 1 0 1
* variant (variant) <U4 '0.0' '0.1' '1.0' '1.1' ... '98.1' '99.0' '99.1'
vid (variant) object '0' '0' '1' '1' '2' ... '98' '98' '99' '99'
... ...
hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
zero_allele (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
one_allele (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G'
af (variant) float64 0.315 0.315 0.265 0.265 ... 0.295 0.34 0.34
pos_bp (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
pos_cM (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
generation: 0Haplotypes from external datasets
We currently support the plink binary (bfile) and variant call (VCF) formats by interfacing with the pandas-plink and sgkit libraries, respectively. We hope to add support for bgen and plink2 (pfile) formats in the future.
plink bfiles
The plink bfile format is inherantly diploid and will break phasing. Thus, when reading in plink data, haplotypes at heterozygous loci are assignmened randomly.
For this example, we’ll use the example data that’s packaged with the pandas-plink library:
[4]:
from pandas_plink import get_data_folder
from os.path import join
pdat = founders.founder_haplotypes_from_plink_bfile(join(get_data_folder(), "chr*.bed"))
pdat
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_read.py:288: UserWarning: More than one FAM file has been specified. Only the first one will be considered.
warnings.warn(msg, UserWarning)
Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 344.31it/s]
Mapping files: 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 5/6 [00:00<00:00, 345.12it/s]
Multiple files read in this order: ['chr11', 'chr12']
[4]:
<xarray.DataArray 'HaplotypeArray' (sample: 14, variant: 2504)>
dask.array<_genotypes_to_pseudo_haplotypes, shape=(14, 2504), dtype=int8, chunksize=(14, 1558), chunktype=numpy.ndarray>
Coordinates: (12/13)
* sample (sample) <U12 '0..B001.B001' '0..B002.B002' ... '0..B014.B014'
iid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
fid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
sex (sample) int64 2 2 2 2 2 2 2 2 2 2 2 2 2 2
* variant (variant) <U11 '316849996.0' '316849996.1' ... '373081507.1'
vid (variant) object '316849996' '316849996' ... '373081507'
... ...
hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
zero_allele (variant) object 'C' 'C' 'G' 'G' 'G' ... 'A' 'T' 'T' 'G' 'G'
one_allele (variant) object 'T' 'T' 'C' 'C' 'C' ... 'T' 'C' 'C' 'A' 'A'
af (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
pos_bp (variant) int32 157439 157439 181802 ... 27367844 27367844
pos_cM (variant) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
Attributes:
generation: 0As pandas-plink will generate a lazy dask array, we need to use the .compute() method to explicitly load it into memory:
[5]:
pdat.compute()
[5]:
<xarray.DataArray 'HaplotypeArray' (sample: 14, variant: 2504)>
array([[0, 0, 0, ..., 1, 0, 0],
[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 1, 1, 0],
[1, 1, 0, ..., 1, 1, 1],
[0, 0, 0, ..., 0, 0, 0]], dtype=int8)
Coordinates: (12/13)
* sample (sample) <U12 '0..B001.B001' '0..B002.B002' ... '0..B014.B014'
iid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
fid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
sex (sample) int64 2 2 2 2 2 2 2 2 2 2 2 2 2 2
* variant (variant) <U11 '316849996.0' '316849996.1' ... '373081507.1'
vid (variant) object '316849996' '316849996' ... '373081507'
... ...
hcopy (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
zero_allele (variant) object 'C' 'C' 'G' 'G' 'G' ... 'A' 'T' 'T' 'G' 'G'
one_allele (variant) object 'T' 'T' 'C' 'C' 'C' ... 'T' 'C' 'C' 'A' 'A'
af (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
pos_bp (variant) int32 157439 157439 181802 ... 27367844 27367844
pos_cM (variant) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
Attributes:
generation: 0We can write haplotype arrays out to plink bfiles using xft.io.write_to_plink1():
[6]:
xft.io.write_to_plink1(pdat, '/tmp/plink')
/home/rsb/Dropbox/ftsim/xftsim/xftsim/io.py:194: UserWarning: Writing to the plink bfile format removes phase information
warnings.warn("Writing to the plink bfile format removes phase information")
Writing BED: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 202.75it/s]
Writing FAM... done.
Writing BIM... done.
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:261: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
df.to_csv(
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:288: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
df.to_csv(
VCF / sgkit data
Support for VCF files is provided via compatability the sgkit library. We recommend the following procedure:
Convert the VCF file (or a subset there of) to a zarr store on disk using
sgkit.io.vcf.vcf_to_zarr()Lazy load the sgkit data using
sgkit.load_dataset()Convert the sgkit data to an
xftsimcompatable DataArray usingxft.founders.founder_haplotypes_from_sgkit_dataset()Save the founder data to zarr for future use using
xft.io.save_haplotype_zarr()
Haplotypes from zarr stores
Zarr stores are the preferred file format for saving and loading haplotype array data in xftsim. Haplotype data can be read from zarr using xft.io.load_haplotype_zarr() and written using xft.io.save_haplotype_zarr().