Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👨‍🔬 Beta Workflow to Modify Ploidy #145

Merged
merged 11 commits into from
Oct 21, 2024
34 changes: 34 additions & 0 deletions docs/KFDRC_GATK_HC_MOD_PLOIDY_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow
This is a research workflow for users wishing to modify the ploidy of certain
regions of their existing GVCF calls.

## Inputs
### Ploidy-related
- input_gvcf: GVCF generated in standard workflow
- region: Specific region to pull, in format 'chr21' or 'chr3:1-1000'
- re_calling_interval_list: Interval list to re-call __in GATK Picard-style Interval List format__
- sample_ploidy: If sample/interval is expected to not have ploidy=2, enter expected ploidy
> For example, for trisomy 21, one might do:
>- region = chr21
>- sample_ploidy = 3
>- re_calling_interval_list = the regions of chr21 that are expected to be ploidy = 3
### Typical haplotype calling
- input_cram: Input CRAM file
- biospecimen_name: String name of biospcimen
- output_basename: String to use as the base for output filenames
- reference_fasta: FASTA file that was used during alignment. Also need
corresponding `.fai` and `.dict` files.
- dbsnp_vcf: dbSNP vcf file
- dbsnp_idx: dbSNP vcf index file
- contamination: Precalculated contamination value. Providing the value here
will skip the run of VerifyBAMID and use the provided value as ground truth.
- contamination_sites_bed: .Bed file for markers used in this
analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)
- contamination_sites_mu: .mu matrix file of genotype matrix
- contamination_sites_ud: .UD matrix file from SVD result of genotype matrix
- wgs_evaluation_interval_list: Target intervals to restrict GVCF metric
analysis (for VariantCallingMetrics)

## Outputs

- mixed_ploidy_gvcf: Updated complete GVCF in which the desired region has had its ploidy updated
2 changes: 1 addition & 1 deletion docs/dockers_gatk_cramtogvcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ gatk_indexfeaturefile.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.7.0R
picard_collectgvcfcallingmetrics.cwl|pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R
picard_intervallisttools.cwl|pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R
picard_mergevcfs_python_renamesample.cwl|pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R
samtools_cram_to_bam.cwl|pgc-images.sbgenomics.com/d3b-bixu/samtools:1.8-dev
samtools_cram_to_bam.cwl|pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9
samtools_idxstats_xy_ratio.cwl|pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9
untar_indexed_reference_2.cwl|None
verifybamid_contamination_conditional.cwl|pgc-images.sbgenomics.com/d3b-bixu/verifybamid:1.0.2
6 changes: 4 additions & 2 deletions subworkflows/kfdrc_bam_to_gvcf.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,16 @@ inputs:
contamination_sites_bed: File
contamination_sites_mu: File
contamination_sites_ud: File
input_bam: File
indexed_reference_fasta: File
input_bam: { type: File, secondaryFiles: ['^.bai'] }
indexed_reference_fasta: { type: File, secondaryFiles: ['.fai', '^.dict'] }
output_basename: string
wgs_calling_interval_list: File
dbsnp_vcf: File
dbsnp_idx: File?
reference_dict: File
wgs_evaluation_interval_list: File
conditional_run: int
sample_ploidy: { type: 'int?', doc: "If sample/interval is expected to not have ploidy=2, enter expected ploidy" }

outputs:
verifybamid_output: {type: File, outputSource: verifybamid_checkcontam_conditional/output}
Expand Down Expand Up @@ -63,6 +64,7 @@ steps:
input_bam: input_bam
interval_list: picard_intervallisttools/output
reference: indexed_reference_fasta
sample_ploidy: sample_ploidy
scatter: [interval_list]
out: [output]

Expand Down
46 changes: 46 additions & 0 deletions tools/bcftools_amend_vcf_header.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
cwlVersion: v1.2
class: CommandLineTool
id: bcftools-amend-vcf-header
doc: "A specialized tool to make clear that GATK HC was run again"
requirements:
- class: ShellCommandRequirement
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/bcftools:1.20'
- class: ResourceRequirement
ramMin: 16000
coresMin: 8
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entryname: "amend_header.sh"
entry: |
#!/usr/bin/env bash
set -xeo pipefail

awk 'NR == FNR && $line ~ /^##GATKCommandLine.HaplotypeCaller/ {sub(/Caller,/, "Caller_rpt_subset,"); newcmd=$line}; NR != FNR; NR != FNR && $line ~ /^##GATKCommandLine.HaplotypeCaller/ {print newcmd}' <(bcftools head $2) <(bcftools head $1) > header_build.txt

baseCommand: []
arguments:
- position: 0
shellQuote: false
valueFrom: >-
/bin/bash amend_header.sh
- position: 3
shellQuote: false
valueFrom: >-
&& bcftools reheader --threads $(inputs.threads) -h header_build.txt $(inputs.input_vcf.path) > $(inputs.output_basename).ploidy_mod.g.vcf.gz
&& tabix $(inputs.output_basename).ploidy_mod.g.vcf.gz

inputs:
input_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "Reconstituted VCF with modified ploidy region calls",
inputBinding: { position: 1 } }
mod_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "VCF with modified region cals only. Header will be used to modift input_vcf",
inputBinding: { position: 2 } }
threads: { type: 'int?', default: 4 }
output_basename: string
outputs:
header_amended_vcf:
type: File
outputBinding:
glob: "*.{v,b}cf{,.gz}"
secondaryFiles: ['.tbi?']
43 changes: 43 additions & 0 deletions tools/bedtools_intersect.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
cwlVersion: v1.0
class: CommandLineTool
id: bedtools_intersect
doc: "Subset VCF with bedtools intersect. Can add -v flag for negative selection"
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: ResourceRequirement
ramMin: 8000
coresMin: 4
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/vcfutils:latest'

baseCommand: [/bin/bash -c]
arguments:
- position: 1
shellQuote: false
valueFrom: >-
'set -eo pipefail; bedtools intersect -wa -header
- position: 3
shellQuote: false
valueFrom: >-
| bgzip -c -@ 4 > $(inputs.output_basename).bed_intersect.vcf.gz'
- position: 4
shellQuote: false
valueFrom: >-
&& tabix $(inputs.output_basename).bed_intersect.vcf.gz

inputs:
input_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "Input VCF file.",
inputBinding: { position: 2, prefix: "-a" } }
input_bed_file: { type: File, doc: "bed intervals to intersect with.",
inputBinding: { position: 2, prefix: "-b" } }
output_basename: string
inverse: {type: 'boolean?', doc: "Select whatever is NOT in the interval bed file",
inputBinding: { position: 2, prefix: "-v"} }

outputs:
intersected_vcf:
type: File
outputBinding:
glob: '*.bed_intersect.vcf.gz'
secondaryFiles: ['.tbi']
2 changes: 2 additions & 0 deletions tools/gatk_haplotypecaller.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,7 @@ inputs:
input_bam: { type: File, secondaryFiles: [^.bai], doc: "Input bam file" }
interval_list: { type: File, doc: "File containing one or more genomic intervals over which to operate" }
contamination: { type: float, doc: "Fraction of contamination in sequencing data (for all samples) to aggressively remove" }
sample_ploidy: { type: 'int?', doc: "If sample/interval is expected to not have ploidy=2, enter expected ploidy",
inputBinding: {position: 1, prefix: "--sample_ploidy" } }
outputs:
output: { type: File, outputBinding: { glob: '*.vcf.gz' }, secondaryFiles: [.tbi] }
26 changes: 26 additions & 0 deletions tools/gatk_intervallist_to_bed.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
cwlVersion: v1.0
class: CommandLineTool
id: gatk4_intervallist2bed
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.1.0'
- class: ResourceRequirement
ramMin: 2000

baseCommand: [/gatk, IntervalListToBed, --java-options, -Xmx100m]
arguments:
- position: 1
shellQuote: false
valueFrom: >-
-O $(inputs.interval_list.nameroot).bed

inputs:
interval_list: { type: File, doc: "Interval list to convert",
inputBinding: { position: 0, prefix: "-I"} }
outputs:
output:
type: File
outputBinding:
glob: '*.bed'
19 changes: 12 additions & 7 deletions tools/samtools_cram_to_bam.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,24 @@ requirements:
ramMin: 16000
coresMin: $(inputs.threads)
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/samtools:1.8-dev'
baseCommand: [samtools, view]
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9'
baseCommand: [samtools, view, -b]
arguments:
- position: 1
- position: 4
shellQuote: false
valueFrom: >-
-b -T $(inputs.reference.path) -@ $(inputs.threads) $(inputs.input_cram.path) > $(inputs.output_basename).bam
> $(inputs.output_basename).bam
&& samtools index -@ $(inputs.threads) $(inputs.output_basename).bam $(inputs.output_basename).bai
inputs:
input_cram: File
input_cram: { type: File, secondaryFiles: ['.crai?'], doc: "cram file to convert",
inputBinding: { position: 2 } }
region: { type: 'string?', doc: "Specific region to pull, in format 'chr21' or 'chr3:1-1000'",
inputBinding: { position: 3 } }
output_basename: string
threads: { type: 'int?', doc: "num threads to use", default: 8}
reference: {type: File, secondaryFiles: [.fai]}
threads: { type: 'int?', doc: "num threads to use", default: 8,
inputBinding: { position: 1, prefix: "-@"} }
reference: {type: File, secondaryFiles: [.fai], doc: "reference file use for cram",
inputBinding: { position: 1, prefix: "--reference" } }
outputs:
output:
type: File
Expand Down
147 changes: 147 additions & 0 deletions workflows/kfdrc-gatk-haplotypecaller-ploidy-mod-wf.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
cwlVersion: v1.2
class: Workflow
id: kfdrc-gatk-haplotypecaller-ploidy-mod-workflow
label: Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow
doc: |
# Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow
This is a research workflow for users wishing to modify the ploidy of certain
regions of their existing GVCF calls.

## Inputs
### Ploidy-related
- input_gvcf: GVCF generated in standard workflow
- region: Specific region to pull, in format 'chr21' or 'chr3:1-1000'
- re_calling_interval_list: Interval list to re-call
- sample_ploidy: If sample/interval is expected to not have ploidy=2, enter expected ploidy
> For example, for trisomy 21, one might do:
>- region = chr21
>- sample_ploidy = 3
>- re_calling_interval_list = the regions of chr21 that are expected to be ploidy = 3
### Typical haplotype calling
- input_cram: Input CRAM file
- biospecimen_name: String name of biospcimen
- output_basename: String to use as the base for output filenames
- reference_fasta: FASTA file that was used during alignment. Also need
corresponding `.fai` and `.dict` files.
- dbsnp_vcf: dbSNP vcf file
- dbsnp_idx: dbSNP vcf index file
- contamination: Precalculated contamination value. Providing the value here
will skip the run of VerifyBAMID and use the provided value as ground truth.
- contamination_sites_bed: .Bed file for markers used in this
analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)
- contamination_sites_mu: .mu matrix file of genotype matrix
- contamination_sites_ud: .UD matrix file from SVD result of genotype matrix
- wgs_evaluation_interval_list: Target intervals to restrict GVCF metric
analysis (for VariantCallingMetrics)

## Outputs

- mixed_ploidy_gvcf: Updated complete GVCF in which the desired region has had its ploidy updated


requirements:
- class: ScatterFeatureRequirement
- class: MultipleInputFeatureRequirement
- class: SubworkflowFeatureRequirement
migbro marked this conversation as resolved.
Show resolved Hide resolved
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
inputs:
input_cram: {type: 'File', doc: "Input CRAM file"}
input_gvcf: {type: File, secondaryFiles: ['.tbi'], doc: "gVCF generated in standard workflow"}
biospecimen_name: {type: 'string', doc: "String name of biospecimen"}
output_basename: {type: 'string', doc: "String to use as the base for output filenames"}
reference_fasta: {type: 'File', "sbg:suggestedValue": {class: File, path: 60639014357c3a53540ca7a3, name: Homo_sapiens_assembly38.fasta,
secondaryFiles: [{class: File, path: 60639016357c3a53540ca7af, name: Homo_sapiens_assembly38.fasta.fai}, {class: File, path: 60639019357c3a53540ca7e7,
name: Homo_sapiens_assembly38.dict}]}, secondaryFiles: ['.fai', '^.dict']}
region: {type: 'string?', doc: "Specific region to pull, in format 'chr21' or 'chr3:1-1000'"}
dbsnp_vcf: {type: 'File', doc: "dbSNP vcf file", "sbg:suggestedValue": {class: File, path: 6063901f357c3a53540ca84b, name: Homo_sapiens_assembly38.dbsnp138.vcf}}
dbsnp_idx: {type: 'File?', doc: "dbSNP vcf index file", "sbg:suggestedValue": {class: File, path: 6063901e357c3a53540ca834, name: Homo_sapiens_assembly38.dbsnp138.vcf.idx}}
contamination: {type: 'float?', doc: "Precalculated contamination value. Providing the value here will skip the run of VerifyBAMID
and use the provided value as ground truth."}
contamination_sites_bed: {type: 'File?', doc: ".Bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)",
"sbg:suggestedValue": {class: File, path: 6063901e357c3a53540ca833, name: Homo_sapiens_assembly38.contam.bed}}
contamination_sites_mu: {type: 'File?', doc: ".mu matrix file of genotype matrix", "sbg:suggestedValue": {class: File, path: 60639017357c3a53540ca7cd,
name: Homo_sapiens_assembly38.contam.mu}}
contamination_sites_ud: {type: 'File?', doc: ".UD matrix file from SVD result of genotype matrix", "sbg:suggestedValue": {class: File,
path: 6063901f357c3a53540ca84f, name: Homo_sapiens_assembly38.contam.UD}}
re_calling_interval_list: {type: 'File', doc: "Interval list to re-call"}
wgs_evaluation_interval_list: {type: 'File', doc: "Target intervals to restrict gvcf metric analysis (for VariantCallingMetrics)",
"sbg:suggestedValue": {class: File, path: 60639017357c3a53540ca7d3, name: wgs_evaluation_regions.hg38.interval_list}}
sample_ploidy: {type: 'int?', doc: "If sample/interval is expected to not have ploidy=2, enter expected ploidy"}

outputs:
mixed_ploidy_gvcf: {type: File, outputSource: bcftools_amend_header/header_amended_vcf}

steps:
gatk_intervallist_to_bed:
run: ../tools/gatk_intervallist_to_bed.cwl
in:
interval_list: re_calling_interval_list
out: [output]
bedtools_intersect:
run: ../tools/bedtools_intersect.cwl
in:
input_vcf: input_gvcf
input_bed_file: gatk_intervallist_to_bed/output
output_basename: output_basename
inverse:
valueFrom: ${ return 1==1 }
out: [intersected_vcf]
samtools_cram_to_bam:
run: ../tools/samtools_cram_to_bam.cwl
in:
input_cram: input_cram
output_basename: output_basename
reference: reference_fasta
region: region
threads:
valueFrom: ${ return 16 }
out: [output]
generate_gvcf:
run: ../subworkflows/kfdrc_bam_to_gvcf.cwl
in:
contamination_sites_ud: contamination_sites_ud
contamination_sites_mu: contamination_sites_mu
contamination_sites_bed: contamination_sites_bed
input_bam: samtools_cram_to_bam/output
indexed_reference_fasta: reference_fasta
output_basename: output_basename
dbsnp_vcf: dbsnp_vcf
dbsnp_idx: dbsnp_idx
reference_dict:
source: reference_fasta
valueFrom: "${self.secondaryFiles.filter(function(e) {return e.nameext == '.dict'})[0])}"
wgs_calling_interval_list: re_calling_interval_list
wgs_evaluation_interval_list: wgs_evaluation_interval_list
conditional_run:
valueFrom: $(1)
contamination: contamination
biospecimen_name: biospecimen_name
sample_ploidy: sample_ploidy
out: [verifybamid_output, gvcf, gvcf_calling_metrics]
picard_mergevcfs:
run: ../tools/picard_mergevcfs.cwl
in:
input_vcf:
source: [bedtools_intersect/intersected_vcf, generate_gvcf/gvcf]
output_vcf_basename: output_basename
biospecimen_name: biospecimen_name
out: [output]
bcftools_amend_header:
run: ../tools/bcftools_amend_vcf_header.cwl
hints:
- class: 'sbg:AWSInstanceType'
value: c6i.2xlarge
in:
input_vcf: picard_mergevcfs/output
mod_vcf: generate_gvcf/gvcf
output_basename: output_basename
out: [header_amended_vcf]

$namespaces:
sbg: https://sevenbridges.com
hints:
- class: 'sbg:maxNumberOfParallelInstances'
value: 4
sbg:license: Apache License 2.0
sbg:publisher: KFDRC
Loading