Reviewer comments

broadinstitute · Oct 25, 2024 · 6e358fb · 6e358fb
1 parent 9658456
commit 6e358fb
Show file tree

Hide file tree

Showing 22 changed files with 116 additions and 97 deletions.
diff --git a/website/docs/best_practices.md b/website/docs/best_practices.md
@@ -12,4 +12,7 @@ documentation found here.
 
 Users should also review the [Getting Started](/docs/gs/overview) section before attempting to perform SV calling.
 
-Recommendations for assessing the quality of completed call sets can be found on the [MainVcfQc module page](/docs/modules/mvqc).
+The following sections also contain recommendations pertaining to data and call set QC:
+
+- Preliminary sample QC in the [EvidenceQc module](/docs/modules/eqc#preliminary-sample-qc).
+- Assessment of completed call sets can be found on the [MainVcfQc module page](/docs/modules/mvqc).
diff --git a/website/docs/execution/joint.md b/website/docs/execution/joint.md
@@ -76,14 +76,20 @@ cutoff for outlier filtration in `08-FilterBatchSamples`
 13. `13-ResolveComplexVariants`: Complex variant resolution
 14. `14-GenotypeComplexVariants`: Complex variant re-genotyping
 15. `15-CleanVcf`: VCF cleanup
-16. `16-MainVcfQc`: Generates VCF QC reports
-17. `17-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and 
+16. `16-RefineComplexVariants`: Complex variant filtering and refinement
+17. `17-ApplyManualVariantFilter`: Hard filtering high-FP SV classes
+18. `18-JoinRawCalls`: Raw call aggregation
+19. `19-SVConcordance`: Annotate genotype concordance with raw calls
+20. `20-FilterGenotypes`: Genotype filtering
+21. `21-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and 
 AF annotation with external population callsets
 
 Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration):
-* `PlotSVCountsPerSample: Plot SV counts per sample per SV type
+* `MainVcfQc`: Generate detailed call set QC plots
+* `PlotSVCountsPerSample`: Plot SV counts per sample per SV type
 * `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run 
 * `PlotSVCountsPerSample` beforehand (configured with the single VCF you want to filter) to enable IQR cutoff choice.
+* `VisualizeCnvs`: Plot multi-sample depth profiles for CNVs
 
 For detailed instructions on running the pipeline in Terra, see [workflow instructions](#instructions) below.
 

diff --git a/website/docs/execution/single.md b/website/docs/execution/single.md
@@ -8,8 +8,8 @@ slug: single
 ## Introduction
 
 **Extending SV detection to small datasets**
-The Single Sample pipeline 
-is designed to facilitate running the methods developed for the cohort-mode GATK-SV pipeline on small data sets or in
+
+The Single Sample pipeline is designed to facilitate running the methods developed for the cohort-mode GATK-SV pipeline on small data sets or in
 clinical contexts where batching large numbers of samples is not an option. To do so, it uses precomputed data, SV calls,
 and model parameters computed by the cohort pipeline on a reference panel composed of similar samples. The pipeline integrates this
 precomputed information with signals extracted from the input CRAM file to produce a call set similar in quality to results

diff --git a/website/docs/gs/calling_modes.md b/website/docs/gs/calling_modes.md
@@ -19,7 +19,7 @@ use cases:
 - Studies with rolling data delivery, i.e. in small batches over time
 
 Users should also consider that the single-sample mode is provided as a single workflow and is therefore considerably 
-simpler to run than joint calling . However, it also has higher compute costs on a per-sample basis and will not be as sensitive 
+simpler to run than joint calling. However, it also has higher compute costs on a per-sample basis and will not be as sensitive 
 as joint calling with larger cohorts.
 
 ## Joint calling mode

diff --git a/website/docs/gs/dockers.md b/website/docs/gs/dockers.md
@@ -8,7 +8,7 @@ GATK-SV utilizes a set of [Docker](https://www.docker.com/) images for execution
 
 ### Publishing and availability
 
-Dockers are automatically build and pushed to the `us.gcr.io/broad-dsde-methods/gatk-sv` repository under two different conditions:
+Dockers are automatically built and pushed to the `us.gcr.io/broad-dsde-methods/gatk-sv` repository under two different conditions:
 1. **Release**: upon releasing a new version of GATK-SV. These Dockers are made permanently available.
 2. **Commit**: upon merging a new commit to the development branch. These Dockers are ephemeral and may be periodically 
 deleted. Any users needing to preserve access to these Docker images should copy them to their own repository. Also
@@ -41,9 +41,7 @@ Failure to localize Docker images to your region will incur significant egress c
 ### Versioning
 
 All Docker images are tagged with a date and version number that must be run with the corresponding version of the 
-WDLs. For example, `sv-pipeline:2024-09-25-v0.29-beta-f064b2d7` must be run with the WDLs from the 
-[v0.29-beta release](https://github.com/broadinstitute/gatk-sv/releases/tag/v0.29-beta). Conversely, the Docker images 
-built with a particular version can be determined from the `dockers.json` file by checking out
+WDLs. The Docker images built with a particular version can be determined from the `dockers.json` file by checking out
 the commit or release of interest and examining `dockers.json`, e.g.
 [v0.29-beta](https://github.com/broadinstitute/gatk-sv/blob/v0.29-beta/inputs/values/dockers.json).
 

diff --git a/website/docs/gs/input_files.md b/website/docs/gs/input_files.md
@@ -69,4 +69,4 @@ The PED file format is described [here](https://gatk.broadinstitute.org/hc/en-us
 * All family, individual, and parental IDs must conform to the [sample ID requirements](/docs/gs/inputs#sampleids).
 * Missing parental IDs should be entered as 0.
 * Header lines are allowed if they begin with a # character.
-  To validate the PED file, you may use `src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list`.
+* To validate the PED file, you may use `src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list`.
diff --git a/website/docs/modules/annotate_vcf.md b/website/docs/modules/annotate_vcf.md
@@ -72,7 +72,7 @@ Family structures and sex assignments determined in [EvidenceQC](./eqc). See [PE
 If provided, sex-specific allele frequencies will be annotated.
 
 #### <HighlightOptionalArg>Optional</HighlightOptionalArg> `par_bed`
-Psuedo-autosomal region (PAR) bed file. If provided, variants overlapping PARs will be annotated with the `PAR` field.
+Pseudo-autosomal region (PAR) bed file. If provided, variants overlapping PARs will be annotated with the `PAR` field.
 
 #### `sv_per_shard`
 Shard sized for parallel processing. Decreasing this may help if the workflow is running too slowly.
@@ -85,7 +85,8 @@ from the reference population.
 External `AF` annotation prefix. Required if providing [external_af_ref_bed](#optional-external_af_ref_bed).
 
 #### <HighlightOptionalArg>Optional</HighlightOptionalArg> `external_af_population`
-External population names, e.g. "ALL", "AFR", "AMR", "EAS", "EUR". Required if providing [external_af_ref_bed](#optional-external_af_ref_bed).
+Population names in the external SV reference set, e.g. "ALL", "AFR", "AMR", "EAS", "EUR". Required if providing 
+[external_af_ref_bed](#optional-external_af_ref_bed) and must match the populations in the bed file.
 
 #### <HighlightOptionalArg>Optional</HighlightOptionalArg> `use_hail`
 Default: `false`. Use Hail for VCF concatenation. This should only be used for projects with over 50k samples. If enabled, the

diff --git a/website/docs/modules/clean_vcf.md b/website/docs/modules/clean_vcf.md
@@ -44,7 +44,7 @@ stateDiagram
 ### Inputs
 
 #### `cohort_name`
-Cohort name. May be alphanumeric with underscores.
+Cohort name. The guidelines outlined in the [sample ID requirements](/docs/gs/inputs#sampleids) section apply here.
 
 #### `complex_genotype_vcfs`
 Array of contig-sharded VCFs containing genotyped complex variants, generated in [GenotypeComplexVariants](./gcv#complex_genotype_vcfs).

diff --git a/website/docs/modules/cluster_batch.md b/website/docs/modules/cluster_batch.md
@@ -30,6 +30,7 @@ stateDiagram
   cb: ClusterBatch
   gbm: GenerateBatchMetrics
   jrc: JoinRawCalls
+  
   gbe --> cb
   cb --> gbm
   cb --> jrc
@@ -40,8 +41,10 @@ stateDiagram
   class jrc outModules
 ```
 
-Note that [GenerateBatchMetrics](./gbm) is the primary downstream module batch processing. [JoinRawCalls](./jrc) is 
-required for genotype filtering but does not need to be run until later in the pipeline. 
+:::note
+[GenerateBatchMetrics](./gbm) is the primary downstream module in batch processing. [JoinRawCalls](./jrc) is 
+required for genotype filtering but does not need to be run until later in the pipeline.
+:::
 
 ## Inputs
 

diff --git a/website/docs/modules/combine_batches.md b/website/docs/modules/combine_batches.md
@@ -42,7 +42,7 @@ All array inputs of batch data must match in order. For example, the order of th
 :::
 
 #### `cohort_name`
-Cohort name. May be alphanumeric with underscores.
+Cohort name. The guidelines outlined in the [sample ID requirements](/docs/gs/inputs#sampleids) section apply here.
 
 #### `batches`
 Array of batch identifiers. Should match the name used in [GatherBatchEvidence](./gbe#batch). Order must match that of [depth_vcfs](#depth_vcfs).

diff --git a/website/docs/modules/evidence_qc.md b/website/docs/modules/evidence_qc.md
@@ -44,6 +44,39 @@ stateDiagram
   class batching outModules
 ```
 
+### Preliminary Sample QC
+
+The purpose of sample filtering at this stage after EvidenceQC is to
+prevent very poor quality samples from interfering with the results for
+the rest of the callset. In general, samples that are borderline are
+okay to leave in, but you should choose filtering thresholds to suit
+the needs of your cohort and study. There will be future opportunities
+(as part of [FilterBatch](/docs/modules/fb)) for filtering before the joint genotyping
+stage if necessary. Here are a few of the basic QC checks that we recommend:
+
+- Chromosome X and Y ploidy plots: check that sex assignments
+  match your expectations. If there are discrepancies, check for
+  sample swaps and update your PED file before proceeding.
+
+- Whole-genome dosage score (WGD): examine distribution and check that
+  it is centered around 0 (the distribution of WGD for PCR-
+  samples is expected to be slightly lower than 0, and the distribution
+  of WGD for PCR+ samples is expected to be slightly greater than 0.
+  Refer to the gnomAD-SV paper for more information on WGD score).
+  Optionally filter outliers.
+
+- Low outliers for each SV caller: these are samples with
+  much lower than typical numbers of SV calls per contig for
+  each caller. An empty low outlier file means there were
+  no outliers below the median and no filtering is necessary.
+  Check that no samples had zero calls.
+
+- High outliers for each SV caller: optionally
+  filter outliers; samples with many more SV calls than average may be poor quality.
+
+- Remove samples with autosomal aneuploidies based on
+  the per-batch binned coverage plots of each chromosome.
+
 
 ### Inputs
 
@@ -91,36 +124,3 @@ Outlier samples detected by call counts.
 
 #### <HighlightOptionalArg>Optional</HighlightOptionalArg> `qc_table`
 QC summary table. Enable with [run_ploidy](#optional-run_ploidy).
-
-## Preliminary Sample QC
-
-The purpose of sample filtering at this stage after EvidenceQC is to 
-prevent very poor quality samples from interfering with the results for 
-the rest of the callset. In general, samples that are borderline are 
-okay to leave in, but you should choose filtering thresholds to suit 
-the needs of your cohort and study. There will be future opportunities 
-(as part of FilterBatch) for filtering before the joint genotyping 
-stage if necessary. Here are a few of the basic QC checks that we recommend:
-
-- Look at the X and Y ploidy plots, and check that sex assignments 
-  match your expectations. If there are discrepancies, check for 
-  sample swaps and update your PED file before proceeding.
-
-- Look at the dosage score (WGD) distribution and check that 
-  it is centered around 0 (the distribution of WGD for PCR- 
-  samples is expected to be slightly lower than 0, and the distribution 
-  of WGD for PCR+ samples is expected to be slightly greater than 0. 
-  Refer to the gnomAD-SV paper for more information on WGD score). 
-  Optionally filter outliers.
-
-- Look at the low outliers for each SV caller (samples with 
-  much lower than typical numbers of SV calls per contig for 
-  each caller). An empty low outlier file means there were 
-  no outliers below the median and no filtering is necessary. 
-  Check that no samples had zero calls.
-
-- Look at the high outliers for each SV caller and optionally 
-  filter outliers; samples with many more SV calls than average may be poor quality.
-
-- Remove samples with autosomal aneuploidies based on 
-  the per-batch binned coverage plots of each chromosome.
diff --git a/website/docs/modules/filter_batch.md b/website/docs/modules/filter_batch.md
@@ -64,8 +64,8 @@ Common variant metrics table [GenerateBatchMetrics](./gbm#metrics_common)
 
 #### `outlier_cutoff_nIQR`
 Defines outlier sample cutoffs based on variant counts. Samples deviating from the batch median count by more than 
-the given multiple of the interquartile range are hard filtered from the VCF. Recommended range is between 3 and 9
-depending on desired sensitivity (higher is less stringent), or disable with 999.
+the given multiple of the interquartile range are hard filtered from the VCF. Recommended range is between `3` and `9`
+depending on desired sensitivity (higher is less stringent), or disable with `10000`.
 
 #### <HighlightOptionalArg>Optional</HighlightOptionalArg>  `outlier_cutoff_table`
 A cutoff table to set permissible nIQR ranges for each SVTYPE. If provided, overrides `outlier_cutoff_nIQR`. Expected 

diff --git a/website/docs/modules/filter_genotypes.md b/website/docs/modules/filter_genotypes.md
@@ -47,7 +47,7 @@ The model uses the following features:
 * Genotype properties:
     * Non-reference and no-call allele counts
     * Genotype quality (`GQ`)
-    * Supporting evidence types (EV) and respective genotype qualities (`PE_GQ`, `SR_GQ`, `RD_GQ`)
+    * Supporting evidence types (`EV`) and respective genotype qualities (`PE_GQ`, `SR_GQ`, `RD_GQ`)
     * Raw call concordance (`CONC_ST`)
 * Variant properties:
     * Variant type (`SVTYPE`) and size (`SVLEN`)
@@ -69,7 +69,7 @@ For ease of use, we provide a model pre-trained on high-quality data with truth
 ```
 gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model
 ```
-See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training.
+See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training. The generation and release of this model was made possible by the All of Us program (see [here](/docs/acknowledgements)).
 
 ### SL scores
 
@@ -88,13 +88,15 @@ This workflow can be run in one of two modes:
    Genotypes with `SL` scores less than the cutoffs are set to no-call (`./.`). The above values were taken directly from Appendix N of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7 ](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7). Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in [this script](https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/apply_sl_filter.py) for all available options.
 
 2. (Advanced) The user provides truth labels for a subset of non-reference calls, and `SL` cutoffs are automatically optimized. These truth labels should be provided as a json file in the following format:
-    ```
+   ```json
     {
-      "sample_1": {
+      "sample_1": 
+      {
         "good_variant_ids": ["variant_1", "variant_3"], 
         "bad_variant_ids": ["variant_5", "variant_10"]
       },
-      "sample_2": {
+      "sample_2": 
+      {
         "good_variant_ids": ["variant_2", "variant_13"], 
         "bad_variant_ids": ["variant_8", "variant_11"]
       }