Founder haplotypes

Overview

xftsim simulations can be highly realistic, but they can also be quite simplistic. Founder data is one place we see this: we can use real phased haplotype data or simulate haplotypes as independent Bernoulli trials. As fully synthetic data is extremely convenient to work with, we recommend at the very least using such data to prototype or debug simulations.

In what follows, we first introduce tools for generating haplotypes from scratch, then cover tools for importing external haplotype data.

Haplotypes from scratch

The simplest methods for generating fully synthetic haplotype data involve generating independent Bernoulli trials. Given a set of allele frequencies, founders.founder_haplotypes_from_AFs() will randomly generate a haplotype array using this method:

[1]:
import xftsim as xft
from xftsim import founders


afs = [0,1,.5]
founders.founder_haplotypes_from_AFs(n=10, afs=afs)
[1]:
<xarray.DataArray 'HaplotypeArray' (sample: 10, variant: 6)>
array([[0, 0, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 1, 0, 1],
       [0, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0]], dtype=int8)
Coordinates: (12/13)
  * sample       (sample) <U10 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_9.0_9'
    iid          (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
    fid          (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
    sex          (sample) int64 0 1 0 1 0 1 0 1 0 1
  * variant      (variant) <U3 '0.0' '0.1' '1.0' '1.1' '2.0' '2.1'
    vid          (variant) object '0' '0' '1' '1' '2' '2'
    ...           ...
    hcopy        (variant) object '0' '1' '0' '1' '0' '1'
    zero_allele  (variant) object 'A' 'A' 'A' 'A' 'A' 'A'
    one_allele   (variant) object 'G' 'G' 'G' 'G' 'G' 'G'
    af           (variant) float64 0.0 0.0 1.0 1.0 0.5 0.5
    pos_bp       (variant) float64 nan nan nan nan nan nan
    pos_cM       (variant) float64 nan nan nan nan nan nan
Attributes:
    generation:  0

For convenience, the function founders.founder_haplotypes_uniform_AFs() will uniformly sample m allele frequencies between minMAF and 1 - minMAF:

[2]:
founders.founder_haplotypes_uniform_AFs(n=10,m=10,minMAF=.05)
[2]:
<xarray.DataArray 'HaplotypeArray' (sample: 10, variant: 20)>
array([[1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1],
       [0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0],
       [0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0],
       [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
       [1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0]],
      dtype=int8)
Coordinates: (12/13)
  * sample       (sample) <U10 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_9.0_9'
    iid          (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
    fid          (sample) object '0_0' '0_1' '0_2' '0_3' ... '0_7' '0_8' '0_9'
    sex          (sample) int64 0 1 0 1 0 1 0 1 0 1
  * variant      (variant) <U3 '0.0' '0.1' '1.0' '1.1' ... '8.1' '9.0' '9.1'
    vid          (variant) object '0' '0' '1' '1' '2' ... '7' '8' '8' '9' '9'
    ...           ...
    hcopy        (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
    zero_allele  (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    one_allele   (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G'
    af           (variant) float64 0.5635 0.5635 0.7249 ... 0.1328 0.1955 0.1955
    pos_bp       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
    pos_cM       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
    generation:  0

When creating a generic xft.Index.VariantIndex (to assign variant IDs etc) as in the above examples, xftsim will attempt to evenly divide variants between (up to) 22 chromosomes.

Of course, we can always directly provide haplotypes to the xft.struct.HaplotypeArray constructor:

[3]:
import numpy as np

arbitrary_haplotypes = np.random.binomial(1,.3, (100,200))
xft.struct.HaplotypeArray(arbitrary_haplotypes)
/home/rsb/Dropbox/ftsim/xftsim/xftsim/struct.py:1280: UserWarning: Using empirical allele frequencies as variant_indexer not provided
  warnings.warn(
[3]:
<xarray.DataArray 'HaplotypeArray' (sample: 100, variant: 200)>
array([[0, 0, 0, ..., 0, 1, 1],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]], dtype=int8)
Coordinates: (12/13)
  * sample       (sample) <U12 '0..0_0.0_0' '0..0_1.0_1' ... '0..0_99.0_99'
    iid          (sample) object '0_0' '0_1' '0_2' ... '0_97' '0_98' '0_99'
    fid          (sample) object '0_0' '0_1' '0_2' ... '0_97' '0_98' '0_99'
    sex          (sample) int64 0 1 0 1 0 1 0 1 0 1 0 ... 1 0 1 0 1 0 1 0 1 0 1
  * variant      (variant) <U4 '0.0' '0.1' '1.0' '1.1' ... '98.1' '99.0' '99.1'
    vid          (variant) object '0' '0' '1' '1' '2' ... '98' '98' '99' '99'
    ...           ...
    hcopy        (variant) object '0' '1' '0' '1' '0' ... '1' '0' '1' '0' '1'
    zero_allele  (variant) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    one_allele   (variant) object 'G' 'G' 'G' 'G' 'G' ... 'G' 'G' 'G' 'G' 'G'
    af           (variant) float64 0.315 0.315 0.265 0.265 ... 0.295 0.34 0.34
    pos_bp       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
    pos_cM       (variant) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
    generation:  0

Haplotypes from external datasets

We currently support the plink binary (bfile) and variant call (VCF) formats by interfacing with the pandas-plink and sgkit libraries, respectively. We hope to add support for bgen and plink2 (pfile) formats in the future.

VCF / sgkit data

Support for VCF files is provided via compatability the sgkit library. We recommend the following procedure:

  1. Convert the VCF file (or a subset there of) to a zarr store on disk using sgkit.io.vcf.vcf_to_zarr()

  2. Lazy load the sgkit data using sgkit.load_dataset()

  3. Convert the sgkit data to an xftsim compatable DataArray using xft.founders.founder_haplotypes_from_sgkit_dataset()

  4. Save the founder data to zarr for future use using xft.io.save_haplotype_zarr()

Haplotypes from zarr stores

Zarr stores are the preferred file format for saving and loading haplotype array data in xftsim. Haplotype data can be read from zarr using xft.io.load_haplotype_zarr() and written using xft.io.save_haplotype_zarr().