-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👨🔬 Beta Workflow to Modify Ploidy #145
Changes from 5 commits
49619a1
9aae9f5
ba85e1b
ef50b87
e279d12
ec516a6
ac60a67
08b5c2f
cd8f7aa
1132157
e9861ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow | ||
This is a research workflow for users wishing to modify the ploidy of certain | ||
regions of their existing GVCF calls. | ||
|
||
## Inputs | ||
|
||
- input_cram: Input CRAM file | ||
- input_gvcf: GVCF generated in standard workflow | ||
- biospecimen_name: String name of biospcimen | ||
- output_basename: String to use as the base for output filenames | ||
- reference_fasta: FASTA file that was used during alignment. Also need | ||
corresponding `.fai` and `.dict` files. | ||
- region: Specific region to pull, in format 'chr21' or 'chr3:1-1000' | ||
- dbsnp_vcf: dbSNP vcf file | ||
- dbsnp_idx: dbSNP vcf index file | ||
- contamination: Precalculated contamination value. Providing the value here | ||
will skip the run of VerifyBAMID and use the provided value as ground truth. | ||
- contamination_sites_bed: .Bed file for markers used in this | ||
analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele) | ||
- contamination_sites_mu: .mu matrix file of genotype matrix | ||
- contamination_sites_ud: .UD matrix file from SVD result of genotype matrix | ||
- re_calling_interval_list: Interval list to re-call | ||
- wgs_evaluation_interval_list: Target intervals to restrict GVCF metric | ||
analysis (for VariantCallingMetrics) | ||
- sample_ploidy: If sample/interval is expected to not have ploidy=2, enter expected ploidy | ||
|
||
## Outputs | ||
|
||
- mixed_ploidy_gvcf: Updated complete GVCF in which the desired region has had its ploidy updated |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
cwlVersion: v1.0 | ||
class: CommandLineTool | ||
id: bedtools_intersect | ||
doc: "Subset VCF with bedtools intersect. Can add -v flag for negative selection" | ||
requirements: | ||
- class: ShellCommandRequirement | ||
- class: InlineJavascriptRequirement | ||
- class: ResourceRequirement | ||
ramMin: 8000 | ||
coresMin: 4 | ||
- class: DockerRequirement | ||
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/vcfutils:latest' | ||
|
||
baseCommand: [bedtools, intersect] | ||
arguments: | ||
- position: 2 | ||
shellQuote: false | ||
valueFrom: >- | ||
-wa -header | bgzip -c -@ 4 > $(inputs.output_basename).bed_intersect.vcf.gz | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this can pipefail |
||
&& tabix $(inputs.output_basename).bed_intersect.vcf.gz | ||
|
||
inputs: | ||
input_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "Input VCF file.", | ||
inputBinding: { position: 1, prefix: "-a" } } | ||
input_bed_file: { type: File, doc: "bed intervals to intersect with.", | ||
inputBinding: { position: 1, prefix: "-b" } } | ||
output_basename: string | ||
inverse: {type: 'boolean?', doc: "Select whatever is NOT in the interval bed file", | ||
inputBinding: { position: 1, prefix: "-v"} } | ||
|
||
outputs: | ||
intersected_vcf: | ||
type: File | ||
outputBinding: | ||
glob: '*.bed_intersect.vcf.gz' | ||
secondaryFiles: ['.tbi'] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
cwlVersion: v1.0 | ||
class: CommandLineTool | ||
id: gatk4_intervallist2bed | ||
requirements: | ||
- class: ShellCommandRequirement | ||
- class: InlineJavascriptRequirement | ||
- class: DockerRequirement | ||
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.1.0' | ||
- class: ResourceRequirement | ||
ramMin: 2000 | ||
|
||
baseCommand: [/gatk, IntervalListToBed, --java-options, -Xmx100m] | ||
arguments: | ||
- position: 1 | ||
shellQuote: false | ||
valueFrom: >- | ||
-O $(inputs.interval_list.nameroot).bed | ||
|
||
inputs: | ||
interval_list: { type: File, doc: "Interval list to convert", | ||
inputBinding: { position: 0, prefix: "-I"} } | ||
outputs: | ||
output: | ||
type: File | ||
outputBinding: | ||
glob: '*.bed' |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,19 +8,24 @@ requirements: | |
ramMin: 16000 | ||
coresMin: $(inputs.threads) | ||
- class: DockerRequirement | ||
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/samtools:1.8-dev' | ||
baseCommand: [samtools, view] | ||
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9' | ||
baseCommand: [samtools, view, -b] | ||
arguments: | ||
- position: 1 | ||
- position: 3 | ||
shellQuote: false | ||
valueFrom: >- | ||
-b -T $(inputs.reference.path) -@ $(inputs.threads) $(inputs.input_cram.path) > $(inputs.output_basename).bam | ||
> $(inputs.output_basename).bam | ||
&& samtools index -@ $(inputs.threads) $(inputs.output_basename).bam $(inputs.output_basename).bai | ||
inputs: | ||
input_cram: File | ||
input_cram: { type: File, secondaryFiles: ['.crai?'], doc: "cram file to convert", | ||
inputBinding: { position: 2 } } | ||
region: { type: 'string?', doc: "Specific region to pull, in format 'chr21' or 'chr3:1-1000'", | ||
inputBinding: { position: 2 } } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't this positionally have to come after input_cram? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. technically it does, because the tie breaker of them both being position 2 is that I declared input cram first. However, that is a dangerous technicality, so I'll make it 1 |
||
output_basename: string | ||
threads: { type: 'int?', doc: "num threads to use", default: 8} | ||
reference: {type: File, secondaryFiles: [.fai]} | ||
threads: { type: 'int?', doc: "num threads to use", default: 8, | ||
inputBinding: { position: 1, prefix: "-@"} } | ||
reference: {type: File, secondaryFiles: [.fai], doc: "reference file use for cram", | ||
inputBinding: { position: 1, prefix: "--reference" } } | ||
outputs: | ||
output: | ||
type: File | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
cwlVersion: v1.2 | ||
class: Workflow | ||
id: kfdrc-gatk-haplotypecaller-ploidy-mod-workflow | ||
label: Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow | ||
doc: | | ||
# Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow | ||
This is a research workflow for users wishing to modify the ploidy of certain | ||
regions of their existing GVCF calls. | ||
|
||
## Inputs | ||
|
||
- input_cram: Input CRAM file | ||
- input_gvcf: GVCF generated in standard workflow | ||
- biospecimen_name: String name of biospcimen | ||
- output_basename: String to use as the base for output filenames | ||
- reference_fasta: FASTA file that was used during alignment. Also need | ||
corresponding `.fai` and `.dict` files. | ||
- region: Specific region to pull, in format 'chr21' or 'chr3:1-1000' | ||
- dbsnp_vcf: dbSNP vcf file | ||
- dbsnp_idx: dbSNP vcf index file | ||
- contamination: Precalculated contamination value. Providing the value here | ||
will skip the run of VerifyBAMID and use the provided value as ground truth. | ||
- contamination_sites_bed: .Bed file for markers used in this | ||
analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele) | ||
- contamination_sites_mu: .mu matrix file of genotype matrix | ||
- contamination_sites_ud: .UD matrix file from SVD result of genotype matrix | ||
- re_calling_interval_list: Interval list to re-call | ||
- wgs_evaluation_interval_list: Target intervals to restrict GVCF metric | ||
analysis (for VariantCallingMetrics) | ||
- sample_ploidy: If sample/interval is expected to not have ploidy=2, enter expected ploidy | ||
|
||
## Outputs | ||
|
||
- mixed_ploidy_gvcf: Updated complete GVCF in which the desired region has had its ploidy updated | ||
|
||
|
||
requirements: | ||
- class: ScatterFeatureRequirement | ||
- class: MultipleInputFeatureRequirement | ||
- class: SubworkflowFeatureRequirement | ||
migbro marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- class: InlineJavascriptRequirement | ||
- class: StepInputExpressionRequirement | ||
inputs: | ||
input_cram: {type: 'File', doc: "Input CRAM file"} | ||
input_gvcf: {type: File, secondaryFiles: ['.tbi'], doc: "gVCF generated in standard workflow"} | ||
biospecimen_name: {type: 'string', doc: "String name of biospecimen"} | ||
output_basename: {type: 'string', doc: "String to use as the base for output filenames"} | ||
reference_fasta: {type: 'File', "sbg:suggestedValue": {class: File, path: 60639014357c3a53540ca7a3, name: Homo_sapiens_assembly38.fasta, | ||
secondaryFiles: [{class: File, path: 60639016357c3a53540ca7af, name: Homo_sapiens_assembly38.fasta.fai}, {class: File, path: 60639019357c3a53540ca7e7, | ||
name: Homo_sapiens_assembly38.dict}]}, secondaryFiles: ['.fai', '^.dict']} | ||
region: {type: 'string?', doc: "Specific region to pull, in format 'chr21' or 'chr3:1-1000'"} | ||
dbsnp_vcf: {type: 'File', doc: "dbSNP vcf file", "sbg:suggestedValue": {class: File, path: 6063901f357c3a53540ca84b, name: Homo_sapiens_assembly38.dbsnp138.vcf}} | ||
dbsnp_idx: {type: 'File?', doc: "dbSNP vcf index file", "sbg:suggestedValue": {class: File, path: 6063901e357c3a53540ca834, name: Homo_sapiens_assembly38.dbsnp138.vcf.idx}} | ||
contamination: {type: 'float?', doc: "Precalculated contamination value. Providing the value here will skip the run of VerifyBAMID | ||
and use the provided value as ground truth."} | ||
contamination_sites_bed: {type: 'File?', doc: ".Bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)", | ||
"sbg:suggestedValue": {class: File, path: 6063901e357c3a53540ca833, name: Homo_sapiens_assembly38.contam.bed}} | ||
contamination_sites_mu: {type: 'File?', doc: ".mu matrix file of genotype matrix", "sbg:suggestedValue": {class: File, path: 60639017357c3a53540ca7cd, | ||
name: Homo_sapiens_assembly38.contam.mu}} | ||
contamination_sites_ud: {type: 'File?', doc: ".UD matrix file from SVD result of genotype matrix", "sbg:suggestedValue": {class: File, | ||
path: 6063901f357c3a53540ca84f, name: Homo_sapiens_assembly38.contam.UD}} | ||
re_calling_interval_list: {type: 'File', doc: "Interval list to re-call"} | ||
wgs_evaluation_interval_list: {type: 'File', doc: "Target intervals to restrict gvcf metric analysis (for VariantCallingMetrics)", | ||
"sbg:suggestedValue": {class: File, path: 60639017357c3a53540ca7d3, name: wgs_evaluation_regions.hg38.interval_list}} | ||
sample_ploidy: {type: 'int?', doc: "If sample/interval is expected to not have ploidy=2, enter expected ploidy"} | ||
|
||
outputs: | ||
mixed_ploidy_gvcf: {type: File, outputSource: picard_mergevcfs_python_renamesample/output} | ||
|
||
steps: | ||
gatk_intervallist_to_bed: | ||
run: ../tools/gatk_intervallist_to_bed.cwl | ||
in: | ||
interval_list: re_calling_interval_list | ||
out: [output] | ||
bedtools_intersect: | ||
run: ../tools/bedtools_intersect.cwl | ||
in: | ||
input_vcf: input_gvcf | ||
input_bed_file: gatk_intervallist_to_bed/output | ||
output_basename: output_basename | ||
inverse: | ||
valueFrom: ${ return 1==1 } | ||
out: [intersected_vcf] | ||
samtools_cram_to_bam: | ||
run: ../tools/samtools_cram_to_bam.cwl | ||
in: | ||
input_cram: input_cram | ||
output_basename: output_basename | ||
reference: reference_fasta | ||
region: region | ||
threads: | ||
valueFrom: ${ return 16 } | ||
out: [output] | ||
generate_gvcf: | ||
run: ../subworkflows/kfdrc_bam_to_gvcf.cwl | ||
in: | ||
contamination_sites_ud: contamination_sites_ud | ||
contamination_sites_mu: contamination_sites_mu | ||
contamination_sites_bed: contamination_sites_bed | ||
input_bam: samtools_cram_to_bam/output | ||
indexed_reference_fasta: reference_fasta | ||
output_basename: output_basename | ||
dbsnp_vcf: dbsnp_vcf | ||
dbsnp_idx: dbsnp_idx | ||
reference_dict: | ||
source: reference_fasta | ||
valueFrom: "${self.secondaryFiles.filter(function(e) {return e.nameext == '.dict'})[0])}" | ||
wgs_calling_interval_list: re_calling_interval_list | ||
wgs_evaluation_interval_list: wgs_evaluation_interval_list | ||
conditional_run: | ||
valueFrom: $(1) | ||
contamination: contamination | ||
biospecimen_name: biospecimen_name | ||
sample_ploidy: sample_ploidy | ||
out: [verifybamid_output, gvcf, gvcf_calling_metrics] | ||
picard_mergevcfs_python_renamesample: | ||
run: ../tools/picard_mergevcfs_python_renamesample.cwl | ||
in: | ||
input_vcf: | ||
source: [bedtools_intersect/intersected_vcf, generate_gvcf/gvcf] | ||
output_vcf_basename: output_basename | ||
biospecimen_name: biospecimen_name | ||
out: [output] | ||
|
||
$namespaces: | ||
sbg: https://sevenbridges.com | ||
hints: | ||
- class: 'sbg:maxNumberOfParallelInstances' | ||
value: 4 | ||
sbg:license: Apache License 2.0 | ||
sbg:publisher: KFDRC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just put a blurb at the top that highlights:
region
re_calling_interval_list
sample_ploidy
These are the three "active" inputs. It would be good to have an example like: