Skip to content

Commit

Permalink
Add option to input a VCF, instead of calling variants with ntEdit (#19)
Browse files Browse the repository at this point in the history
* Initial commit for third-party VCF input feature

* Bugfixes

* Allow either input VCF to be gzipped

* Add feature to strip INFO from input VCF, add to demo, update documentation

* --strip_info option mostly to be used for ntEdit input VCFs that were already cross-referenced

* Add new demo file

* Add bedtools dependency

* Update expected output file for demo

* Remove any alt alleles in the -l VCF that are in the format <*>

* bedtools intersect chokes on them, and they'd never be cross-referenced anyway

* Use awk instead of perl

* Update --custom_vcf help entry

* Consistent capitalization of VCF
  • Loading branch information
lcoombe authored Jun 19, 2024
1 parent a45b9f5 commit bb6c497
Show file tree
Hide file tree
Showing 7 changed files with 446 additions and 20 deletions.
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,13 @@ No compilation is required for ntRoot (only the dependencies), so simply add the
- [ntEdit 2.0.2+](https://github.com/bcgsc/ntEdit)
- [snakemake](https://snakemake.readthedocs.io/en/stable/)
- [samtools](https://www.htslib.org/)

- [bedtools](https://bedtools.readthedocs.io/en/latest/)


## Usage <a name=usage></a>
```
usage: ntroot [-h] -r REFERENCE [--reads READS] [--genome GENOME [GENOME ...]] -l L -k K [--tile TILE] [--lai] [-t T] [-z Z] [-j J] [-Y Y] [-v] [-V] [-n] [-f]
usage: ntroot [-h] [-r REFERENCE] [--reads READS] [--genome GENOME [GENOME ...]] -l L [-k K] [--tile TILE] [--lai] [-t T] [-z Z] [-j J] [-Y Y] [--custom_vcf CUSTOM_VCF]
[--strip_info] [-v] [-V] [-n] [-f]
ntRoot: Ancestry inference from genomic data
Expand All @@ -73,6 +74,9 @@ optional arguments:
-z Z Minimum contig length [default=100]
-j J controls size of k-mer subset. When checking subset of k-mers, check every jth k-mer [default=3]
-Y Y Ratio of number of k-mers in the k subset that should be present to accept an edit (higher=stringent) [default=0.55]
--custom_vcf CUSTOM_VCF
Input VCF for computing ancestry. When specified, ntRoot will skip the ntEdit step, and predict ancestry from the provided VCF.
--strip_info When using --custom_vcf, strip the existing INFO field from the input VCF.
-v, --verbose Verbose mode [default=False]
-V, --version show program's version number and exit
-n, --dry-run Print out the commands that will be executed
Expand Down Expand Up @@ -122,6 +126,10 @@ Example command:
ntroot -k 55 --reference GRCh38.fa.gz --reads ERR3242308_ -t 48 -Y 0.55 -l 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz
</pre>

If you would like to infer ancestry from a pre-existing VCF file:
<pre>
ntroot -r GRCh38.fa.gz --custom_vcf third_party.vcf -l 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz
</pre>

Note: For more advanced users, and for ancestry predictions on organisms other than human, please contact us.

Expand Down
4 changes: 2 additions & 2 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
- script: |
source activate ntroot_CI
conda install --yes --name ntroot_CI -c conda-forge -c bioconda python=3.9 mamba
mamba install --yes -c conda-forge -c bioconda meson ninja snakemake perl 'ntedit>=2.0.1' samtools
mamba install --yes -c conda-forge -c bioconda meson ninja snakemake perl 'ntedit>=2.0.1' samtools bedtools
displayName: Install Conda packages
- script: |
Expand Down Expand Up @@ -48,7 +48,7 @@ jobs:
- script: |
source activate ntroot_CI
conda install --yes --name ntroot_CI -c conda-forge -c bioconda python=3.9 mamba
mamba install --yes -c conda-forge -c bioconda snakemake perl 'ntedit>=2.0.1' samtools
mamba install --yes -c conda-forge -c bioconda snakemake perl 'ntedit>=2.0.1' samtools bedtools
displayName: Install Conda packages
- script: |
Expand Down
Binary file added demo/HG002.chr21.vcf.gz
Binary file not shown.
12 changes: 12 additions & 0 deletions demo/run_ntroot_demo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,16 @@ else
exit 1
fi

echo "Running ntRoot input VCF demo..."

ntroot --reference chr21.fa --custom_vcf HG002.chr21.vcf.gz -l pop-spec-snp_chr21.vcf.gz
prediction=$(cat HG002.chr21.vcf.gz.cross-ref.vcf_ancestry-predictions_tile5000000.tsv | awk '{print $1}' |head -n 2 |tail -n 1)
if [ ${prediction} == "EUR" ]; then
echo "ntRoot input VCF test successful!"
else
echo "ntRoot input VCF test failed.. Please check your installation."
exit 1
fi


echo "Done ntRoot tests!"
48 changes: 35 additions & 13 deletions ntroot
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def set_up_parser():

parser.add_argument("-k",
help="k-mer size",
required=True, type=int)
required=False, type=int)
parser.add_argument("--tile", help="Tile size for ancestry fraction inference (bp) [default=5000000]",
default=5000000, type=int)
parser.add_argument("--lai", help="Output ancestry predictons per tile in a separate output file",
Expand All @@ -54,6 +54,12 @@ def set_up_parser():
help="Ratio of number of k-mers in the k subset that should be present to accept "
"an edit (higher=stringent) "
"[default=0.55]", default=0.55, type=float)
parser.add_argument("--custom_vcf", help="Input VCF for computing ancestry. "
"When specified, ntRoot will skip the ntEdit step, and "
"predict ancestry from the provided VCF.",
type=str)
parser.add_argument("--strip_info", help="When using --custom_vcf, strip the existing INFO field from the input VCF.",
action="store_true")
parser.add_argument("-v", "--verbose",
help="Verbose mode [default=False]", action="store_true", default=False)
parser.add_argument("-V", "--version", action="version", version=NTROOT_VERSION)
Expand All @@ -71,14 +77,18 @@ def main():

args = parser.parse_args()

if (args.reads and args.genome) or (not args.reads and not args.genome):
if ((args.reads and args.genome) or (not args.reads and not args.genome)) and not args.custom_vcf:
raise argparse.ArgumentTypeError("Please specify --reads OR --genome")

if not args.draft and not args.reference:
raise argparse.ArgumentTypeError("Please specify the reference genome with --reference")
if args.draft:
args.reference = args.draft

if args.reads or args.genome:
if not args.k:
raise argparse.ArgumentTypeError("Please specify the k-mer size with -k")

base_dir = os.path.dirname(os.path.realpath(__file__))

intro_string = ["Running ntRoot...",
Expand All @@ -94,26 +104,38 @@ def main():
smk_rule = "ntroot_genome_lai"
else:
smk_rule = "ntroot_genome"
if args.custom_vcf:
if args.lai:
smk_rule = "ntroot_input_vcf_lai"
else:
smk_rule = "ntroot_input_vcf"

command = f"snakemake -s {base_dir}/ntroot_run_pipeline.smk {smk_rule} --nolock -p --cores {args.t} " \
f"--config reference={args.reference} k={args.k} t={args.t} " \
f"z={args.z} tile_size={args.tile} " \
f"j={args.j} Y={args.Y} "
f"--config reference={args.reference} t={args.t} " \
f"tile_size={args.tile} "

if args.genome:
intro_string.append(f"\t--genome {args.genome}")
command += f"genomes={args.genome} "
command += f"genomes={args.genome}"
elif args.custom_vcf:
intro_string.append(f"\t--custom_vcf {args.custom_vcf}")
command += f" input_vcf={args.custom_vcf}"
if args.strip_info:
intro_string.append("\t--strip_info")
command += " strip_info=True"
else:
intro_string.append(f"\t--reads {args.reads}")
command += f"reads={args.reads}"

intro_string.extend([f"\t-k {args.k}",
f"\t--reference {args.reference}",
f"\t-t {args.t}",
f"\t-z {args.z}",
f"\t-j {args.j}",
f"\t-Y {args.Y}"
])
intro_string.extend([f"\t--reference {args.reference}",
f"\t-t {args.t}"])

if args.genome or args.reads:
intro_string.extend([f"\t-k {args.k}",
f"\t-z {args.z}",
f"\t-j {args.j}",
f"\t-Y {args.Y}"])
command += f" k={args.k} z={args.z} j={args.j} Y={args.Y}"

if args.lai:
intro_string.append("\t--lai")
Expand Down
Loading

0 comments on commit bb6c497

Please sign in to comment.