Skip to content

Commit

Permalink
Reviewer comments
Browse files Browse the repository at this point in the history
  • Loading branch information
mwalker174 committed Oct 25, 2024
1 parent 9658456 commit 6e358fb
Show file tree
Hide file tree
Showing 22 changed files with 116 additions and 97 deletions.
5 changes: 4 additions & 1 deletion website/docs/best_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,7 @@ documentation found here.

Users should also review the [Getting Started](/docs/gs/overview) section before attempting to perform SV calling.

Recommendations for assessing the quality of completed call sets can be found on the [MainVcfQc module page](/docs/modules/mvqc).
The following sections also contain recommendations pertaining to data and call set QC:

- Preliminary sample QC in the [EvidenceQc module](/docs/modules/eqc#preliminary-sample-qc).
- Assessment of completed call sets can be found on the [MainVcfQc module page](/docs/modules/mvqc).
12 changes: 9 additions & 3 deletions website/docs/execution/joint.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,14 +76,20 @@ cutoff for outlier filtration in `08-FilterBatchSamples`
13. `13-ResolveComplexVariants`: Complex variant resolution
14. `14-GenotypeComplexVariants`: Complex variant re-genotyping
15. `15-CleanVcf`: VCF cleanup
16. `16-MainVcfQc`: Generates VCF QC reports
17. `17-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and
16. `16-RefineComplexVariants`: Complex variant filtering and refinement
17. `17-ApplyManualVariantFilter`: Hard filtering high-FP SV classes
18. `18-JoinRawCalls`: Raw call aggregation
19. `19-SVConcordance`: Annotate genotype concordance with raw calls
20. `20-FilterGenotypes`: Genotype filtering
21. `21-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and
AF annotation with external population callsets

Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration):
* `PlotSVCountsPerSample: Plot SV counts per sample per SV type
* `MainVcfQc`: Generate detailed call set QC plots
* `PlotSVCountsPerSample`: Plot SV counts per sample per SV type
* `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run
* `PlotSVCountsPerSample` beforehand (configured with the single VCF you want to filter) to enable IQR cutoff choice.
* `VisualizeCnvs`: Plot multi-sample depth profiles for CNVs

For detailed instructions on running the pipeline in Terra, see [workflow instructions](#instructions) below.

Expand Down
4 changes: 2 additions & 2 deletions website/docs/execution/single.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ slug: single
## Introduction

**Extending SV detection to small datasets**
The Single Sample pipeline
is designed to facilitate running the methods developed for the cohort-mode GATK-SV pipeline on small data sets or in

The Single Sample pipeline is designed to facilitate running the methods developed for the cohort-mode GATK-SV pipeline on small data sets or in
clinical contexts where batching large numbers of samples is not an option. To do so, it uses precomputed data, SV calls,
and model parameters computed by the cohort pipeline on a reference panel composed of similar samples. The pipeline integrates this
precomputed information with signals extracted from the input CRAM file to produce a call set similar in quality to results
Expand Down
2 changes: 1 addition & 1 deletion website/docs/gs/calling_modes.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ use cases:
- Studies with rolling data delivery, i.e. in small batches over time

Users should also consider that the single-sample mode is provided as a single workflow and is therefore considerably
simpler to run than joint calling . However, it also has higher compute costs on a per-sample basis and will not be as sensitive
simpler to run than joint calling. However, it also has higher compute costs on a per-sample basis and will not be as sensitive
as joint calling with larger cohorts.

## Joint calling mode
Expand Down
6 changes: 2 additions & 4 deletions website/docs/gs/dockers.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ GATK-SV utilizes a set of [Docker](https://www.docker.com/) images for execution

### Publishing and availability

Dockers are automatically build and pushed to the `us.gcr.io/broad-dsde-methods/gatk-sv` repository under two different conditions:
Dockers are automatically built and pushed to the `us.gcr.io/broad-dsde-methods/gatk-sv` repository under two different conditions:
1. **Release**: upon releasing a new version of GATK-SV. These Dockers are made permanently available.
2. **Commit**: upon merging a new commit to the development branch. These Dockers are ephemeral and may be periodically
deleted. Any users needing to preserve access to these Docker images should copy them to their own repository. Also
Expand Down Expand Up @@ -41,9 +41,7 @@ Failure to localize Docker images to your region will incur significant egress c
### Versioning

All Docker images are tagged with a date and version number that must be run with the corresponding version of the
WDLs. For example, `sv-pipeline:2024-09-25-v0.29-beta-f064b2d7` must be run with the WDLs from the
[v0.29-beta release](https://github.com/broadinstitute/gatk-sv/releases/tag/v0.29-beta). Conversely, the Docker images
built with a particular version can be determined from the `dockers.json` file by checking out
WDLs. The Docker images built with a particular version can be determined from the `dockers.json` file by checking out
the commit or release of interest and examining `dockers.json`, e.g.
[v0.29-beta](https://github.com/broadinstitute/gatk-sv/blob/v0.29-beta/inputs/values/dockers.json).

Expand Down
2 changes: 1 addition & 1 deletion website/docs/gs/input_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,4 @@ The PED file format is described [here](https://gatk.broadinstitute.org/hc/en-us
* All family, individual, and parental IDs must conform to the [sample ID requirements](/docs/gs/inputs#sampleids).
* Missing parental IDs should be entered as 0.
* Header lines are allowed if they begin with a # character.
To validate the PED file, you may use `src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list`.
* To validate the PED file, you may use `src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list`.
5 changes: 3 additions & 2 deletions website/docs/modules/annotate_vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Family structures and sex assignments determined in [EvidenceQC](./eqc). See [PE
If provided, sex-specific allele frequencies will be annotated.

#### <HighlightOptionalArg>Optional</HighlightOptionalArg> `par_bed`
Psuedo-autosomal region (PAR) bed file. If provided, variants overlapping PARs will be annotated with the `PAR` field.
Pseudo-autosomal region (PAR) bed file. If provided, variants overlapping PARs will be annotated with the `PAR` field.

#### `sv_per_shard`
Shard sized for parallel processing. Decreasing this may help if the workflow is running too slowly.
Expand All @@ -85,7 +85,8 @@ from the reference population.
External `AF` annotation prefix. Required if providing [external_af_ref_bed](#optional-external_af_ref_bed).

#### <HighlightOptionalArg>Optional</HighlightOptionalArg> `external_af_population`
External population names, e.g. "ALL", "AFR", "AMR", "EAS", "EUR". Required if providing [external_af_ref_bed](#optional-external_af_ref_bed).
Population names in the external SV reference set, e.g. "ALL", "AFR", "AMR", "EAS", "EUR". Required if providing
[external_af_ref_bed](#optional-external_af_ref_bed) and must match the populations in the bed file.

#### <HighlightOptionalArg>Optional</HighlightOptionalArg> `use_hail`
Default: `false`. Use Hail for VCF concatenation. This should only be used for projects with over 50k samples. If enabled, the
Expand Down
2 changes: 1 addition & 1 deletion website/docs/modules/clean_vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ stateDiagram
### Inputs

#### `cohort_name`
Cohort name. May be alphanumeric with underscores.
Cohort name. The guidelines outlined in the [sample ID requirements](/docs/gs/inputs#sampleids) section apply here.

#### `complex_genotype_vcfs`
Array of contig-sharded VCFs containing genotyped complex variants, generated in [GenotypeComplexVariants](./gcv#complex_genotype_vcfs).
Expand Down
7 changes: 5 additions & 2 deletions website/docs/modules/cluster_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ stateDiagram
cb: ClusterBatch
gbm: GenerateBatchMetrics
jrc: JoinRawCalls
gbe --> cb
cb --> gbm
cb --> jrc
Expand All @@ -40,8 +41,10 @@ stateDiagram
class jrc outModules
```

Note that [GenerateBatchMetrics](./gbm) is the primary downstream module batch processing. [JoinRawCalls](./jrc) is
required for genotype filtering but does not need to be run until later in the pipeline.
:::note
[GenerateBatchMetrics](./gbm) is the primary downstream module in batch processing. [JoinRawCalls](./jrc) is
required for genotype filtering but does not need to be run until later in the pipeline.
:::

## Inputs

Expand Down
2 changes: 1 addition & 1 deletion website/docs/modules/combine_batches.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ All array inputs of batch data must match in order. For example, the order of th
:::

#### `cohort_name`
Cohort name. May be alphanumeric with underscores.
Cohort name. The guidelines outlined in the [sample ID requirements](/docs/gs/inputs#sampleids) section apply here.

#### `batches`
Array of batch identifiers. Should match the name used in [GatherBatchEvidence](./gbe#batch). Order must match that of [depth_vcfs](#depth_vcfs).
Expand Down
66 changes: 33 additions & 33 deletions website/docs/modules/evidence_qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,39 @@ stateDiagram
class batching outModules
```

### Preliminary Sample QC

The purpose of sample filtering at this stage after EvidenceQC is to
prevent very poor quality samples from interfering with the results for
the rest of the callset. In general, samples that are borderline are
okay to leave in, but you should choose filtering thresholds to suit
the needs of your cohort and study. There will be future opportunities
(as part of [FilterBatch](/docs/modules/fb)) for filtering before the joint genotyping
stage if necessary. Here are a few of the basic QC checks that we recommend:

- Chromosome X and Y ploidy plots: check that sex assignments
match your expectations. If there are discrepancies, check for
sample swaps and update your PED file before proceeding.

- Whole-genome dosage score (WGD): examine distribution and check that
it is centered around 0 (the distribution of WGD for PCR-
samples is expected to be slightly lower than 0, and the distribution
of WGD for PCR+ samples is expected to be slightly greater than 0.
Refer to the gnomAD-SV paper for more information on WGD score).
Optionally filter outliers.

- Low outliers for each SV caller: these are samples with
much lower than typical numbers of SV calls per contig for
each caller. An empty low outlier file means there were
no outliers below the median and no filtering is necessary.
Check that no samples had zero calls.

- High outliers for each SV caller: optionally
filter outliers; samples with many more SV calls than average may be poor quality.

- Remove samples with autosomal aneuploidies based on
the per-batch binned coverage plots of each chromosome.


### Inputs

Expand Down Expand Up @@ -91,36 +124,3 @@ Outlier samples detected by call counts.

#### <HighlightOptionalArg>Optional</HighlightOptionalArg> `qc_table`
QC summary table. Enable with [run_ploidy](#optional-run_ploidy).

## Preliminary Sample QC

The purpose of sample filtering at this stage after EvidenceQC is to
prevent very poor quality samples from interfering with the results for
the rest of the callset. In general, samples that are borderline are
okay to leave in, but you should choose filtering thresholds to suit
the needs of your cohort and study. There will be future opportunities
(as part of FilterBatch) for filtering before the joint genotyping
stage if necessary. Here are a few of the basic QC checks that we recommend:

- Look at the X and Y ploidy plots, and check that sex assignments
match your expectations. If there are discrepancies, check for
sample swaps and update your PED file before proceeding.

- Look at the dosage score (WGD) distribution and check that
it is centered around 0 (the distribution of WGD for PCR-
samples is expected to be slightly lower than 0, and the distribution
of WGD for PCR+ samples is expected to be slightly greater than 0.
Refer to the gnomAD-SV paper for more information on WGD score).
Optionally filter outliers.

- Look at the low outliers for each SV caller (samples with
much lower than typical numbers of SV calls per contig for
each caller). An empty low outlier file means there were
no outliers below the median and no filtering is necessary.
Check that no samples had zero calls.

- Look at the high outliers for each SV caller and optionally
filter outliers; samples with many more SV calls than average may be poor quality.

- Remove samples with autosomal aneuploidies based on
the per-batch binned coverage plots of each chromosome.
4 changes: 2 additions & 2 deletions website/docs/modules/filter_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@ Common variant metrics table [GenerateBatchMetrics](./gbm#metrics_common)

#### `outlier_cutoff_nIQR`
Defines outlier sample cutoffs based on variant counts. Samples deviating from the batch median count by more than
the given multiple of the interquartile range are hard filtered from the VCF. Recommended range is between 3 and 9
depending on desired sensitivity (higher is less stringent), or disable with 999.
the given multiple of the interquartile range are hard filtered from the VCF. Recommended range is between `3` and `9`
depending on desired sensitivity (higher is less stringent), or disable with `10000`.

#### <HighlightOptionalArg>Optional</HighlightOptionalArg> `outlier_cutoff_table`
A cutoff table to set permissible nIQR ranges for each SVTYPE. If provided, overrides `outlier_cutoff_nIQR`. Expected
Expand Down
12 changes: 7 additions & 5 deletions website/docs/modules/filter_genotypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ The model uses the following features:
* Genotype properties:
* Non-reference and no-call allele counts
* Genotype quality (`GQ`)
* Supporting evidence types (EV) and respective genotype qualities (`PE_GQ`, `SR_GQ`, `RD_GQ`)
* Supporting evidence types (`EV`) and respective genotype qualities (`PE_GQ`, `SR_GQ`, `RD_GQ`)
* Raw call concordance (`CONC_ST`)
* Variant properties:
* Variant type (`SVTYPE`) and size (`SVLEN`)
Expand All @@ -69,7 +69,7 @@ For ease of use, we provide a model pre-trained on high-quality data with truth
```
gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model
```
See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training.
See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training. The generation and release of this model was made possible by the All of Us program (see [here](/docs/acknowledgements)).

### SL scores

Expand All @@ -88,13 +88,15 @@ This workflow can be run in one of two modes:
Genotypes with `SL` scores less than the cutoffs are set to no-call (`./.`). The above values were taken directly from Appendix N of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7 ](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7). Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in [this script](https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/apply_sl_filter.py) for all available options.
2. (Advanced) The user provides truth labels for a subset of non-reference calls, and `SL` cutoffs are automatically optimized. These truth labels should be provided as a json file in the following format:
```
```json
{
"sample_1": {
"sample_1":
{
"good_variant_ids": ["variant_1", "variant_3"],
"bad_variant_ids": ["variant_5", "variant_10"]
},
"sample_2": {
"sample_2":
{
"good_variant_ids": ["variant_2", "variant_13"],
"bad_variant_ids": ["variant_8", "variant_11"]
}
Expand Down
Loading

0 comments on commit 6e358fb

Please sign in to comment.