Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement a new validate command #220

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
2b9f708
Create base for hapfile validation
Jul 14, 2023
6bd77fc
Solidify and improve validator base
Jul 20, 2023
79698fc
Raise error on type IDs which match chromosome IDs
Jul 20, 2023
ffc7935
Report errors for column additions on non-existent types
Jul 20, 2023
beabc98
Recognize extra columns and cast validation for extra column types
Jul 20, 2023
6b45589
Corrected bug where float values were unrecognized
Jul 20, 2023
2233b09
Allow parsing & reordering of extra columns
Jul 21, 2023
0cb6586
Append feature to cli
Jul 24, 2023
ba5f9d8
Fix bug where the validator would break if no repeats were provided
Jul 24, 2023
af7dbb5
Complete first working instance of the validator.
Jul 24, 2023
4c42e69
Add a pair of test files to the hapfile directory. Corrected a hapfile.
Jul 24, 2023
2693470
Format files with Black
Jul 24, 2023
79a845b
Create test for the validate command
Jul 26, 2023
76176c5
Create tests for validation command
Jul 26, 2023
c6f1f56
Remove debugging print statements
Jul 27, 2023
85a298b
fix pgenlib import issue
aryarm Jul 27, 2023
777114e
Add doc base for the valhap command
Jul 27, 2023
81d38f8
Merge branch 'impl-validate-command' of github.com:CAST-genomics/hapt…
Jul 27, 2023
747b43b
Clean up docs. Add further information.
Jul 27, 2023
f28b902
Fix indentation
Jul 27, 2023
dbe6d87
Fix format.
Jul 27, 2023
32468f9
rename from val_hapfile to to 'validate'
aryarm Jul 30, 2023
390eaeb
implement some suggestions from PR
aryarm Sep 14, 2023
8b324ac
Use relative import for logging module
aryarm Sep 14, 2023
57c81f8
accept pvar instead of pgen
aryarm Sep 15, 2023
61ac08c
change up logging to be silent by default when called from command line
aryarm Sep 16, 2023
c4ecaec
reformat test_validate.py for concision
aryarm Sep 16, 2023
e7efcf6
Merge branch 'main' into impl-validate-command
aryarm Sep 17, 2023
6bbee4b
rename test data dir and remove valhap prefix
aryarm Sep 17, 2023
4b95834
remove test code import prefix
aryarm Sep 17, 2023
1290c7b
Merge branch 'impl-validate-command' of github.com:CAST-genomics/hapt…
aryarm Sep 17, 2023
5614004
add tests for command line and add non zero exit code
aryarm Sep 17, 2023
474f9fc
clarify how sorting works
aryarm Sep 17, 2023
6b7942c
change behavior of sorting parameter
aryarm Sep 18, 2023
d16f7bd
do not skip pytest for pgenlib
aryarm Oct 1, 2023
9234bef
Merge branch 'main' into impl-validate-command
aryarm Oct 1, 2023
fc71adf
refmt with black
aryarm Oct 2, 2023
04ab0e3
Merge branch 'main' into impl-validate-command
aryarm Oct 14, 2023
6065862
remove extra files outside of test dir
aryarm Oct 14, 2023
50d5cb3
rename valhap test dir to validate
aryarm Oct 14, 2023
46ac080
add descriptions to all test commands
aryarm Oct 14, 2023
3db4522
fail validation if any lines are blank
aryarm Oct 14, 2023
6288b8d
add test for whitespace
aryarm Oct 14, 2023
0b0932c
add test for indexed hap file
aryarm Oct 14, 2023
6d81e26
start adding docstrings
aryarm Oct 14, 2023
189eed0
remove max_variants which we will instead infer from the hap file
aryarm Oct 14, 2023
c042b82
start HapFileValidator class commenting
aryarm Oct 29, 2023
3558764
add more comments to validate command
aryarm Nov 11, 2023
d91b2a3
document metadata line handling code
aryarm Feb 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions docs/commands/validate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
.. _commands-validate:


validate
========

Validate the formatting of a :doc:`.hap file </formats/haplotypes>`. Output warnings/errors explaining how the formatting of your ``.hap`` file may be improved.

If a :ref:`.pvar file <formats-genotypesplink>` file is provided, the SNPs and TRs present in the ``.hap`` file will be checked for existence in the ``.pvar`` file.

.. note::

This command will not check that your ``.hap`` file is properly sorted. It only checks formatting.

Usage
~~~~~
.. code-block:: bash

haptools validate \
--sort \
--genotypes PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
HAPFILE

Examples
~~~~~~~~
.. code-block:: bash

haptools validate tests/data/validate/basic.hap

Outputs a message specifying the amount of errors and warnings.

.. code-block::

[ INFO] Completed .hap file validation with 0 errors and 0 warnings.

All warnings and errors will be logged if there are any.

.. code-block:: bash

haptools validate tests/data/validate/no_version.hap

.. code-block::

[ WARNING] No version declaration found. Assuming to use the latest version.
[ INFO] Completed .hap file validation with 0 errors and 1 warnings.
Error: Found several warnings and / or errors in the .hap file

All ``.hap`` files must be sorted before they can be validated, so we try our best to sort your ``.hap`` file internally before performing any validation checks.
If your ``.hap`` file is already sorted, you should use the ``--sorted`` parameter. It will speed things up a bit by skipping the sorting step. If your ``.hap`` file is indexed, it will be assumed to be sorted regardless.

.. code-block:: bash

haptools validate --sorted tests/data/simple.hap

As mentioned before, one can use the ``--genotypes`` flag to provide a ``.pvar`` file with which to compare the existence of variant IDs.
The following will check if all of the variant IDs in the ``.hap`` file appear in the ``.pvar`` file.

.. code-block:: bash

haptools validate --genotypes tests/data/simple.pvar tests/data/simple.hap

.. note::

We accept a PVAR file instead of a VCF in order to avoid reading lots of information
which is not relevant to the validation process. However, any VCF subsetted to just
its first 8 fields is a valid PVAR file. So you can easily create a PVAR file from a
VCF using ``cut -f -8`` or ``plink2 --make-just-pvar``.

Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: validate
2 changes: 1 addition & 1 deletion docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Genotype files must be specified as VCF or BCF files. They can be bgzip-compress
PLINK2 PGEN
~~~~~~~~~~~

There is also experimental support for `PLINK2 PGEN <https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf>`_ files in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you're working with large datasets. See the documentation for the :class:`GenotypesPLINK` class in :ref:`the API docs <api-data-genotypesplink>` for more information.
There is also experimental support for `PLINK2 PGEN <https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf>`_ files (accomponied by PVAR and PSAM files) in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you're working with large datasets. See the documentation for the :class:`GenotypesPLINK` class in :ref:`the API docs <api-data-genotypesplink>` for more information.

If you run out memory when using PGEN files, consider reading/writing variants from the file in chunks via the ``--chunk-size`` parameter.

Expand Down
3 changes: 3 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ Commands

* :doc:`haptools transform </commands/transform>`: Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

* :doc:`haptools validate </commands/validate>`: Validate the formatting of a haplotype file.

* :doc:`haptools index </commands/index>`: Sort, compress, and index our custom file format for haplotypes.

* :doc:`haptools clump </commands/clump>`: Convert variants in LD with one another into clumps.
Expand Down Expand Up @@ -95,6 +97,7 @@ There is an option to *Cite this repository* on the right sidebar of `the reposi
commands/simphenotype.rst
commands/karyogram.rst
commands/transform.rst
commands/validate.rst
commands/index.rst
commands/clump.rst
commands/ld.rst
Expand Down
55 changes: 55 additions & 0 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1025,6 +1025,61 @@ def clump(
)


@main.command(short_help="Validate the structure of a .hap file")
@click.argument("filename", type=click.Path(exists=True, path_type=Path))
@click.option(
"--sorted/--not-sorted",
is_flag=True,
default=False,
show_default=True,
help="Has the file been sorted already?",
)
@click.option(
"--genotypes",
type=click.Path(path_type=Path),
default=None,
show_default="optional .pvar file to compare against",
help=(
"A .pvar file containing variant IDs in order to compare them to the .hap file"
),
)
@click.option(
"-v",
"--verbosity",
type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"]),
default="INFO",
show_default=True,
help="The level of verbosity desired",
)
def validate(
filename: Path,
sorted: bool = False,
genotypes: Path | None = None,
verbosity: str = "INFO",
):
"""
Validate the formatting of a .hap file

Output warnings/errors explaining how the formatting of your .hap file may
be improved.
"""
from .logging import getLogger
from .validate import is_hapfile_valid

log = getLogger(name="validate", level=verbosity)

# if the hap file is compressed and a .tbi index exists for it, assume it is sorted
if filename.suffix == ".gz" and filename.with_suffix(".gz.tbi").exists():
sorted = True

is_valid = is_hapfile_valid(filename, sorted=sorted, log=log, pvar=genotypes)

if not is_valid:
raise click.ClickException(
"Found several warnings and / or errors in the .hap file"
)


if __name__ == "__main__":
# run the CLI if someone tries 'python -m haptools' on the command line
main(prog_name="haptools")
2 changes: 1 addition & 1 deletion haptools/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ class GenotypesAncestry(data.GenotypesVCF):
See documentation for :py:attr:`~.Genotypes.log`
"""

def __init__(self, fname: Path | str, log: Logger = None):
def __init__(self, fname: Path | str, log: logging.Logger = None):
super().__init__(fname, log)
self.ancestry = None
self.valid_labels = None
Expand Down
Loading
Loading