Skip to content

Commit

Permalink
docs: creating an .hanc file from a .bp file (#260)
Browse files Browse the repository at this point in the history
  • Loading branch information
aryarm authored Nov 5, 2024
1 parent cf4ccb2 commit b6dcd0c
Show file tree
Hide file tree
Showing 5 changed files with 87 additions and 1 deletion.
2 changes: 2 additions & 0 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,8 @@ You'll have to call ``__iter()__`` manually if you want to specify any function
for sample, blocks in breakpoints.__iter__(samples={"HG00097", "HG00099"}):
print(sample, blocks)
.. _api-data-bp2anc:

Obtaining ancestral labels for a list of positions
**************************************************
In the end, we're usually only interested in the ancestral labels of a set of variant positions, as a matrix of values. The ``population_array()`` method generates a numpy array denoting the ancestral label of each sample for each variant you specify.
Expand Down
45 changes: 45 additions & 0 deletions docs/api/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,48 @@ You can use the :ref:`data API <api-data>` and the :ref:`simphenotype API <api-h
hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=allele),)
hp.write()
.. _api-examples-bp2anc:

Converting a ``.bp`` file into a ``.hanc`` per-site ancestry file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can obtain the ancestry of a list of variants directly from a ``.bp`` file using the :ref:`data API <api-data-bp2anc>`.

**Input:**

* Breakpoints in a :ref:`.bp file <formats-breakpoints>`
* A list of variants in a :ref:`PLINK2 PVAR file <formats-genotypesplink>`

**Output:**

* An ``.hanc`` per-site ancestry file as described in `the admix-simu documentation <https://github.com/williamslab/admix-simu/tree/master?tab=readme-ov-file#per-site-ancestry-values>`_:

.. include:: ../../tests/data/simple.hanc
:literal:

.. code-block:: python
import numpy as np
from pathlib import Path
from haptools import data
output = Path("output.hanc")
# load breakpoints from the bp file and encode each population label as an int
breakpoints = data.Breakpoints.load("tests/data/simple.bp")
breakpoints.encode()
print(breakpoints.labels)
# load the SNPs array from a PVAR file
snps = data.GenotypesPLINK("tests/data/simple.pgen")
snps.read_variants()
snps = snps.variants[["chrom", "pos"]]
# create array of per-site ancestry values
arr = breakpoints.population_array(variants=snps)
# reshape from n x p x 2 to n*2 x p
# so rows are haplotypes and columns are variants
arr = arr.transpose((0, 2, 1)).reshape(-1, arr.shape[1])
# write to haplotype ancestry file
np.savetxt(output, arr, fmt="%i", delimiter="")
3 changes: 2 additions & 1 deletion haptools/data/genotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1236,7 +1236,8 @@ def read_variants(
if variants is not None:
max_variants = len(variants)
if max_variants is None:
raise ValueError("Provide either the variants or max_variants parameter!")
p = pgenlib.PvarReader(bytes(str(self.fname.with_suffix(".pvar")), "utf8"))
max_variants = p.get_variant_ct()
# first, preallocate the array and the indices of each variant
self.variants = np.empty((max_variants,), dtype=self.variants.dtype)
indices = np.empty((max_variants,), dtype=np.uint32)
Expand Down
10 changes: 10 additions & 0 deletions tests/data/simple.hanc
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
0000
1020
0000
0100
0010
0000
2101
0100
0002
0102
28 changes: 28 additions & 0 deletions tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -2221,3 +2221,31 @@ def test_gts2hap(self):
with open(DATADIR / "apoe.hap") as expected:
assert hp_file.read() == expected.read()
hp.fname.unlink()

def test_bp2anc(self):
output = Path("output.hanc")

# load breakpoints from the bp file and encode each population label as an int
breakpoints = Breakpoints.load(DATADIR / "simple.bp")
breakpoints.encode()
# print(breakpoints.labels)

# load the SNPs array from a PVAR file
snps = GenotypesPLINK(DATADIR / "simple.pgen")
snps.read_variants()
snps = snps.variants[["chrom", "pos"]]

# create array of per-site ancestry values
arr = breakpoints.population_array(variants=snps)
# reshape from n x p x 2 to n*2 x p
# so rows are haplotypes and columns are variants
arr = arr.transpose((0, 2, 1)).reshape(-1, arr.shape[1])

# write to haplotype ancestry file
np.savetxt(output, arr, fmt="%i", delimiter="")

# validate the output and clean up afterwards
with open(output) as anc_file:
with open(DATADIR / "simple.hanc") as expected:
assert anc_file.read() == expected.read()
output.unlink()

0 comments on commit b6dcd0c

Please sign in to comment.