-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GatherSampleEvidence & TrainGCNV docs #681
Changes from 19 commits
8b6931c
cdbe237
1fcfe12
19a824a
d5a3c98
71de496
6032dcf
a0f0ced
a06c18b
00076ec
dfb0a72
c7f9795
e1862b3
fd129b3
2926a89
7d4b503
2ab676e
fdb33ab
b9dfb39
490c739
53df818
fb0bf8f
d291c6c
56dfaa2
2e2a4dc
b782169
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,25 +5,205 @@ sidebar_position: 4 | |
slug: gbe | ||
--- | ||
|
||
Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample | ||
raw evidence into a batch. See above for more information on batching. | ||
Runs CNV callers ([cn.MOPS](https://academic.oup.com/nar/article/40/9/e69/1136601), GATK gCNV) | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and combines single-sample raw evidence into a batch. | ||
|
||
### Prerequisites | ||
|
||
- GatherSampleEvidence | ||
- (Recommended) EvidenceQC | ||
- gCNV training. | ||
```mermaid | ||
|
||
### Inputs | ||
- PED file (updated with EvidenceQC sex assignments, including sex = 0 | ||
for sex aneuploidies. Calls will not be made on sex chromosomes | ||
when sex = 0 in order to avoid generating many confusing calls | ||
or upsetting normalized copy numbers for the batch.) | ||
- Read count, BAF, PE, SD, and SR files (GatherSampleEvidence) | ||
- Caller VCFs (GatherSampleEvidence) | ||
- Contig ploidy model and gCNV model files (gCNV training) | ||
stateDiagram | ||
direction LR | ||
|
||
classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8 | ||
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white | ||
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d | ||
|
||
### Outputs | ||
gse: GatherSampleEvidence | ||
eqc: EvidenceQC | ||
gcnv: TrainGCNV | ||
gbe: GatherBatchEvidence | ||
cbe: ClusterBatch | ||
gse --> gbe | ||
eqc --> gbe | ||
gcnv --> gbe | ||
gbe --> cbe | ||
|
||
class gbe thisModule | ||
class gse, eqc, gcnv inModules | ||
class cbe outModules | ||
``` | ||
|
||
## Inputs | ||
This workflow takes as input the read counts, BAF, PE, SD, SR, and per-caller VCF files | ||
produced in the GatherSampleEvidence workflow, and contig ploidy and gCNV models from | ||
the TrainGCNV workflow. | ||
The following is the list of the inputs the GatherBatchEvidence workflow takes. | ||
|
||
|
||
#### `batch` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the plan to have detailed documentation for every input like this? Is that necessary? Maybe it could be collapsible so it's more approachable for users who do not need that level of detail? Most users will just use the pre-configured default inputs and will only need detailed documentation on the pipeline-level inputs and outputs, and I wouldn't want to make it more difficult for them to navigate the documentation. One other thing to consider is there are places where we do want users to be able to edit inputs as necessary, and I wouldn't want those inputs to get lost among the others - a separate category that does not collapse maybe? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan is to document every required input of these modules. We have discussed a few options for those required inputs that do not have values set on Terra, or set values need to be adjusted, or set values need tweaking for cohort-to-cohort, etc. One of the options is tagging/labeling such inputs (similar to labeling optional/conditional outputs) and we can think of other alternatives. However, that is beyond the scope of this PR as here we are just documenting all the required (at least leaving a placeholder for them), and we will revisit their spotlighting later. |
||
An identifier for the batch. | ||
|
||
|
||
#### `samples` | ||
Sets the list of sample IDs. | ||
|
||
|
||
#### `counts` | ||
Set to the [`GatherSampleEvidence.coverage_counts`](./gse#coverage-counts) output. | ||
|
||
|
||
#### Raw calls | ||
|
||
The following inputs set the per-caller raw SV calls, and should be set | ||
if the caller was run in the [`GatherSampleEvidence`](./gse) workflow. | ||
You may set each of the following inputs to the linked output from | ||
the GatherSampleEvidence workflow. | ||
|
||
|
||
- `manta_vcfs`: [`GatherSampleEvidence.manta_vcf`](./gse#manta-vcf); | ||
- `melt_vcfs`: [`GatherSampleEvidence.melt_vcf`](./gse#melt-vcf); | ||
- `scramble_vcfs`: [`GatherSampleEvidence.scramble_vcf`](./gse#scramble-vcf); | ||
- `wham_vcfs`: [`GatherSampleEvidence.wham_vcf`](./gse#wham-vcf). | ||
|
||
#### `PE_files` | ||
Set to the [`GatherSampleEvidence.pesr_disc`](./gse#pesr-disc) output. | ||
|
||
#### `SR_files` | ||
Set to the [`GatherSampleEvidence.pesr_split`](./gse#pesr-split) | ||
|
||
|
||
#### `SD_files` | ||
Set to the [`GatherSampleEvidence.pesr_sd`](./gse#pesr-sd) | ||
|
||
|
||
#### `matrix_qc_distance` | ||
You may set it to `1000000`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think providing specific input values on the website is not what we want to do, as we don't want to have to update the website every time we update the inputs. I think we should refer users to the JSON templates. I also think "You may set it to" is pretty ambiguous - probably we want to refer to these as "recommended input values" or "default settings" or similar. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed it to reference the external files we have. The text is a placeholder, and we should document what it does and what its impact is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I prefer to finalize the text while the PR is open. The edits in the PR allow for convenient discussion of these particular lines. And in my experience TODOs often get lost once they're merged! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The external reference I mentioned above is my best resource for this; if you have other information on it, happy to extend it. |
||
|
||
|
||
#### `min_svsize` | ||
Sets the minimum size of SVs to include. | ||
You may set it to `50`. | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
#### `ped_file` | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
A pedigree file describing the familial relationshipts between the samples in the cohort. | ||
The file needs to be in the | ||
[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). | ||
Updated with [EvidenceQC](./eqc) sex assignments, including | ||
`sex = 0` for sex aneuploidies. Calls will not be made on sex chromosomes | ||
when `sex = 0` in order to avoid generating many confusing calls | ||
or upsetting normalized copy numbers for the batch. | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
#### `run_matrix_qc` | ||
Enables or disables running optional QC tasks. | ||
|
||
|
||
#### `gcnv_qs_cutoff` | ||
You may set the value of this input to `30`. | ||
|
||
#### cn.MOPS files | ||
The workflow needs the following cn.MOPS files. | ||
|
||
- `cnmops_chrom_file` and `cnmops_allo_file`: FASTA index files (`.fai`) for respectively non-sex chromosome (autosome) and chromosomes X and Y (allosomes). | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The content of the files may read as the following, | ||
and the format is explained [on this page](https://www.htslib.org/doc/faidx.html). | ||
|
||
```bash | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
chrX 156040895 2903754205 100 101 | ||
chrY 57227415 3061355656 100 101 | ||
``` | ||
|
||
You may use the following files for these fields: | ||
|
||
```json | ||
"cnmops_chrom_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Echoing Mark's comments about not giving specific file paths in this documentation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is a good reference for these? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes most sense to direct users to the JSONs in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I updated it to link to a specific file in the resources JSON; that is not much better than this, but we have ongoing internal discussions on how best to address such inputs. We should not point to the |
||
"cnmops_allo_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/allosome.fai" | ||
``` | ||
|
||
- `cnmops_exclude_list`: You may use the following file for this field. | ||
``` | ||
gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/GRCh38_Nmask.bed | ||
``` | ||
|
||
#### GATK-gCNV inputs | ||
|
||
The following inputs are configured based on the outputs generated in the [`TrainGCNV`](./gcnv) workflow. | ||
|
||
- `contig_ploidy_model_tar`: [`TrainGCNV.cohort_contig_ploidy_model_tar`](./gcnv#contig-ploidy-model-tarball) | ||
- `gcnv_model_tars`: [`TrainGCNV.cohort_gcnv_model_tars`](./gcnv#model-tarballs) | ||
|
||
|
||
The workflow also enables setting a few optional arguments of gCNV. | ||
The arguments and their default values are as the following, | ||
and each argument is documented on | ||
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360037593411-PostprocessGermlineCNVCalls) | ||
and | ||
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360047217671-GermlineCNVCaller). | ||
|
||
```json | ||
"gcnv_caller_internal_admixing_rate": 0.5, | ||
"gcnv_caller_update_convergence_threshold": 0.000001, | ||
"gcnv_cnv_coherence_length": 1000, | ||
"gcnv_convergence_snr_averaging_window": 100, | ||
"gcnv_convergence_snr_countdown_window": 10, | ||
"gcnv_convergence_snr_trigger_threshold": 0.2, | ||
"gcnv_copy_number_posterior_expectation_mode": "EXACT", | ||
"gcnv_depth_correction_tau": 10000, | ||
"gcnv_learning_rate": 0.03, | ||
"gcnv_log_emission_sampling_median_rel_error": 0.001, | ||
"gcnv_log_emission_sampling_rounds": 20, | ||
"gcnv_max_advi_iter_first_epoch": 1000, | ||
"gcnv_max_advi_iter_subsequent_epochs": 200, | ||
"gcnv_max_training_epochs": 5, | ||
"gcnv_min_training_epochs": 1, | ||
"gcnv_num_thermal_advi_iters": 250, | ||
"gcnv_p_alt": 0.000001, | ||
"gcnv_sample_psi_scale": 0.000001, | ||
"ref_copy_number_autosomal_contigs": 2 | ||
``` | ||
|
||
|
||
#### Docker images | ||
|
||
The workflow needs the following Docker images, which you may find a link to their | ||
latest images from [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json). | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- `cnmops_docker`; | ||
- `condense_counts_docker`; | ||
- `linux_docker`; | ||
- `sv_base_docker`; | ||
- `sv_base_mini_docker`; | ||
- `sv_pipeline_docker`; | ||
- `sv_pipeline_qc_docker`; | ||
- `gcnv_gatk_docker`; | ||
- `gatk_docker`. | ||
|
||
#### Static inputs | ||
|
||
You may refer to [this reference file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json) | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for values of the following inputs. | ||
|
||
- `primary_contigs_fai`; | ||
- `cytoband`; | ||
- `ref_dict`; | ||
- `mei_bed`; | ||
- `genome_file`; | ||
- `sd_locs_vcf`. | ||
|
||
|
||
#### Optional Inputs | ||
The following is the list of a few optional inputs of the | ||
workflow, with an example of possible values. | ||
|
||
- `"allosomal_contigs": [["chrX", "chrY"]]` | ||
- `"ploidy_sample_psi_scale": 0.001` | ||
|
||
|
||
|
||
|
||
|
||
## Outputs | ||
|
||
- Combined read count matrix, SR, PE, and BAF files | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Standardized call VCFs | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,20 +6,78 @@ slug: gse | |
--- | ||
|
||
Runs raw evidence collection on each sample with the following SV callers: | ||
Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, | ||
Manta, Wham, Scramble, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, | ||
refer to the Sample Exclusion section. | ||
|
||
Note: a list of sample IDs must be provided. Refer to the sample ID | ||
requirements for specifications of allowable sample IDs. | ||
The downstream dependencies of the GatherSampleEvidence workflow | ||
are illustrated in the following diagram. | ||
|
||
```mermaid | ||
|
||
stateDiagram | ||
direction LR | ||
|
||
classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8 | ||
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white | ||
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d | ||
|
||
gse: GatherSampleEvidence | ||
eqc: EvidenceQC | ||
gcnv: TrainGCNV | ||
gbe: GatherBatchEvidence | ||
gse --> eqc | ||
gse --> gcnv | ||
gse --> gbe | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
class gse thisModule | ||
class eqc, gcnv, gbe outModules | ||
``` | ||
|
||
|
||
## Inputs | ||
VJalili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### `bam_or_cram_file` | ||
A BAM or CRAM file aligned to hg38. Index file (.bai) must be provided if using BAM. | ||
|
||
#### `sample_id` | ||
Refer to the [sample ID requirements](/docs/gs/inputs#sampleids) for specifications of allowable sample IDs. | ||
IDs that do not meet these requirements may cause errors. | ||
|
||
### Inputs | ||
#### `preprocessed_intervals` | ||
Picard interval list. | ||
|
||
#### `sd_locs_vcf` | ||
(`sd`: site depth) | ||
A VCF file containing allele counts at common SNP loci of the genome, which is used for calculating BAF. | ||
For human genome, you may use [`dbSNP`](https://www.ncbi.nlm.nih.gov/snp/) | ||
that contains a complete list of common and clinical human single nucleotide variations, | ||
microsatellites, and small-scale insertions and deletions. | ||
You may find a link to the file in | ||
[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json). | ||
|
||
- Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs. | ||
|
||
### Outputs | ||
## Outputs | ||
|
||
- Caller VCFs (Manta, MELT, and/or Wham) | ||
- Binned read counts file | ||
- Split reads (SR) file | ||
- Discordant read pairs (PE) file | ||
|
||
#### `manta_vcf` {#manta-vcf} | ||
A VCF file containing variants called by Manta. | ||
|
||
#### `melt_vcf` {#melt-vcf} | ||
A VCF file containing variants called by MELT. | ||
|
||
#### `scramble_vcf` {#scramble-vcf} | ||
A VCF file containing variants called by Scramble. | ||
|
||
#### `wham_vcf` {#wham-vcf} | ||
A VCF file containing variants called by Wham. | ||
|
||
#### `coverage_counts` {#coverage-counts} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there supposed to be descriptions here? Feels inconsistent with the other sections There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a description of these. We discussed leaving them as placeholders to make sure we will populate them. If you have a description, feel free to suggest one. |
||
|
||
#### `pesr_disc` {#pesr-disc} | ||
|
||
#### `pesr_split` {#pesr-split} | ||
|
||
#### `pesr_sd` {#pesr-sd} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the existing thread on displaying dependencies. None of the outputs of EvidenceQC are used in GatherBatchEvidence (or TrainGCNV technically). EvidenceQC is recommended to use to create batches for TrainGCNV (and the following steps) but if that's what you wanted to represent I would just exclude GatherBatchEvidence from this diagram since it follows TrainGCNV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is resolved in the updated diagrams; please recheck.