Founder haplotypes

Overview

xftsim simulations can be highly realistic, but they can also be quite simplistic. Founder data is one place we see this: we can use real phased haplotype data or simulate haplotypes as independent Bernoulli trials. As fully synthetic data is extremely convenient to work with, we recommend at the very least using such data to prototype or debug simulations.

In what follows, we first introduce tools for generating haplotypes from scratch, then cover tools for importing external haplotype data.

Haplotypes from external datasets

We currently support the plink binary (bfile) and variant call (VCF) formats by interfacing with the pandas-plink and sgkit libraries, respectively. We hope to add support for bgen and plink2 (pfile) formats in the future.

plink bfiles

The plink bfile format is inherantly diploid and will break phasing. Thus, when reading in plink data, haplotypes at heterozygous loci are assignmened randomly.

For this example, we’ll use the example data that’s packaged with the pandas-plink library:

[4]:

from pandas_plink import get_data_folder
from os.path import join

pdat = founders.founder_haplotypes_from_plink_bfile(join(get_data_folder(), "chr*.bed"))
pdat

/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_read.py:288: UserWarning: More than one FAM file has been specified. Only the first one will be considered.
  warnings.warn(msg, UserWarning)
Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 344.31it/s]
Mapping files:  83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                          | 5/6 [00:00<00:00, 345.12it/s]

Multiple files read in this order: ['chr11', 'chr12']

[4]:

<xarray.DataArray 'HaplotypeArray' (sample: 14, variant: 2504)>
dask.array<_genotypes_to_pseudo_haplotypes, shape=(14, 2504), dtype=int8, chunksize=(14, 1558), chunktype=numpy.ndarray>
Coordinates: (12/13)
  * sample       (sample) <U12 '0..B001.B001' '0..B002.B002' ... '0..B014.B014'
    iid          (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    fid          (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    sex          (sample) int64 2 2 2 2 2 2 2 2 2 2 2 2 2 2
  * variant      (variant) <U11 '316849996.0' '316849996.1' ... '373081507.1'
    vid          (variant) object '316849996' '316849996' ... '373081507'
    ...           ...
    hcopy        (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
    zero_allele  (variant) object 'C' 'C' 'G' 'G' 'G' ... 'A' 'T' 'T' 'G' 'G'
    one_allele   (variant) object 'T' 'T' 'C' 'C' 'C' ... 'T' 'C' 'C' 'A' 'A'
    af           (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
    pos_bp       (variant) int32 157439 157439 181802 ... 27367844 27367844
    pos_cM       (variant) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
Attributes:
    generation:  0

As pandas-plink will generate a lazy dask array, we need to use the .compute() method to explicitly load it into memory:

[5]:

pdat.compute()

We can write haplotype arrays out to plink bfiles using xft.io.write_to_plink1():

[6]:

xft.io.write_to_plink1(pdat, '/tmp/plink')

/home/rsb/Dropbox/ftsim/xftsim/xftsim/io.py:194: UserWarning: Writing to the plink bfile format removes phase information
  warnings.warn("Writing to the plink bfile format removes phase information")
Writing BED: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 202.75it/s]

Writing FAM... done.
Writing BIM... done.


/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:261: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
  df.to_csv(
/home/rsb/miniconda3/lib/python3.9/site-packages/pandas_plink/_write.py:288: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
  df.to_csv(

VCF / sgkit data

Support for VCF files is provided via compatability the sgkit library. We recommend the following procedure:

Convert the VCF file (or a subset there of) to a zarr store on disk using sgkit.io.vcf.vcf_to_zarr()
Lazy load the sgkit data using sgkit.load_dataset()
Convert the sgkit data to an xftsim compatable DataArray using xft.founders.founder_haplotypes_from_sgkit_dataset()
Save the founder data to zarr for future use using xft.io.save_haplotype_zarr()

Haplotypes from zarr stores

Zarr stores are the preferred file format for saving and loading haplotype array data in xftsim. Haplotype data can be read from zarr using xft.io.load_haplotype_zarr() and written using xft.io.save_haplotype_zarr().