Skip to content

Commit

Permalink
Merge pull request #92 from uab-cgds-worthey/joss_manuscript
Browse files Browse the repository at this point in the history
Bring master branch up to date on sample config as input and doc fixes
  • Loading branch information
ManavalanG authored Oct 13, 2023
2 parents 667b786 + 3a38e46 commit 5dd211c
Show file tree
Hide file tree
Showing 25 changed files with 358 additions and 273 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample_id bam vcf capture_bed fastqc_raw fastqc_trimmed fastq_screen dedup multiqc_rename_config
A .test/ngs-data/test_project/analysis/A/bam/A.bam .test/ngs-data/test_project/analysis/A/vcf/A.vcf.gz .test/ngs-data/test_project/analysis/A/configs/small_variant_caller/capture_regions.bed .test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-1-R1_screen.txt,.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-1-R2_screen.txt,.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-2-R1_screen.txt,.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-2-R2_screen.txt .test/ngs-data/test_project/analysis/A/qc/dedup/A-1.metrics.txt,.test/ngs-data/test_project/analysis/A/qc/dedup/A-2.metrics.txt .test/ngs-data/test_project/analysis/A/qc/multiqc_initial_pass/multiqc_sample_rename_config/A_rename_config.tsv
B .test/ngs-data/test_project/analysis/B/bam/B.bam .test/ngs-data/test_project/analysis/B/vcf/B.vcf.gz .test/ngs-data/test_project/analysis/B/configs/small_variant_caller/capture_regions.bed .test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-1-R1_screen.txt,.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-1-R2_screen.txt,.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-2-R1_screen.txt,.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-2-R2_screen.txt .test/ngs-data/test_project/analysis/B/qc/dedup/B-1.metrics.txt,.test/ngs-data/test_project/analysis/B/qc/dedup/B-2.metrics.txt .test/ngs-data/test_project/analysis/B/qc/multiqc_initial_pass/multiqc_sample_rename_config/B_rename_config.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample_id bam vcf fastqc_raw fastqc_trimmed fastq_screen dedup multiqc_rename_config
A .test/ngs-data/test_project/analysis/A/bam/A.bam .test/ngs-data/test_project/analysis/A/vcf/A.vcf.gz .test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-raw/A-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed/A-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-1-R1_screen.txt,.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-1-R2_screen.txt,.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-2-R1_screen.txt,.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed/A-2-R2_screen.txt .test/ngs-data/test_project/analysis/A/qc/dedup/A-1.metrics.txt,.test/ngs-data/test_project/analysis/A/qc/dedup/A-2.metrics.txt .test/ngs-data/test_project/analysis/A/qc/multiqc_initial_pass/multiqc_sample_rename_config/A_rename_config.tsv
B .test/ngs-data/test_project/analysis/B/bam/B.bam .test/ngs-data/test_project/analysis/B/vcf/B.vcf.gz .test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-raw/B-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-1-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-1-R2_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-2-R1_fastqc.zip,.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed/B-2-R2_fastqc.zip .test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-1-R1_screen.txt,.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-1-R2_screen.txt,.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-2-R1_screen.txt,.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed/B-2-R2_screen.txt .test/ngs-data/test_project/analysis/B/qc/dedup/B-1.metrics.txt,.test/ngs-data/test_project/analysis/B/qc/dedup/B-2.metrics.txt .test/ngs-data/test_project/analysis/B/qc/multiqc_initial_pass/multiqc_sample_rename_config/B_rename_config.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample_id bam vcf capture_bed
C .test/ngs-data/test_project/analysis/C/bam/C.bam .test/ngs-data/test_project/analysis/C/vcf/C.vcf.gz .test/ngs-data/test_project/analysis/C/configs/small_variant_caller/capture_regions.bed
D .test/ngs-data/test_project/analysis/D/bam/D.bam .test/ngs-data/test_project/analysis/D/vcf/D.vcf.gz .test/ngs-data/test_project/analysis/D/configs/small_variant_caller/capture_regions.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample_id bam vcf
C .test/ngs-data/test_project/analysis/C/bam/C.bam .test/ngs-data/test_project/analysis/C/vcf/C.vcf.gz
D .test/ngs-data/test_project/analysis/D/bam/D.bam .test/ngs-data/test_project/analysis/D/vcf/D.vcf.gz
38 changes: 38 additions & 0 deletions docs/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,44 @@ YYYY-MM-DD John Doe
```
---

2023-10-09 Manavalan Gajapathy

* Merges `joss_manuscript` to the `master` branch to bring it up to date.

2023-10-06 Manavalan Gajapathy

* Adds documentation on providing sample filepaths via user-provided sample config file due to recent PRs #87, #88, #89
and #90 (closes #86).
* Adds documentation on editing thresholds in the QuaC-Watch config file (closes #85)

2023-10-05 Manavalan Gajapathy

* Refactors to accept sample filepaths via user-provided sample config file, when `--allow_sample_renaming` is used (#86)

2023-10-05 Manavalan Gajapathy

* Refactors to accept sample filepaths via user-provided sample config file, when `--include_prior_qc` is used (#86)
* Adds a test sample config file that includes priorQC filepaths

2023-10-05 Manavalan Gajapathy

* Refactors to accept sample filepaths via user-provided sample config file. Only for exome mode in minimal manner (w/o
--include_prior_qc, --allow_sample_renaming) (#86)
* Adds a test sample config file
* Refactors to get capture bed file as input from the sample configfile

2023-10-05 Manavalan Gajapathy

* Refactors to accept sample filepaths via user-provided sample config file. Only for WGS mode in minimal manner (w/o
--include_prior_qc, --allow_sample_renaming) (#86)
* Adds sample config file to use with system testing datasets -
`.test/configs/no_priorQC/sample_config/project_2samples.tsv`. This provides map of sample name to their VCF and BAM
filepaths.
* Refactors use of `--sample_config` arg to work with this config file as input
* Deprecates args `--project_name` and `--projects_path`
* Modifies workflow to use the new input setup
* Updates README concerning the changes made

2023-07-17 Manavalan Gajapathy

* Minor updates to documentation.
Expand Down
130 changes: 40 additions & 90 deletions docs/input_output.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,40 @@

## Input

### Sample config file

Sample identifier and their necessary filepaths (`bam`, `vcf`, etc.) are provided to QuaC in a `tsv` formatted config
file via `--sample_config`. Columns required depend on the flags supplied to `src/run_quac.py`. This table lists the
allowed columns and when to use them.

| Column | When to use | Description |
| --------------------- | ------------------------- | ----------------------------------------------------------------------------------------------------- |
| sample_id | Always | Sample identifier |
| bam | Always | BAM filepath |
| vcf | Always | VCF filepath |
| capture_bed | `--exome` | Capture region bed filepath |
| fastqc_raw | `--include_prior_qc` | Filepath to FastQC `zip` files created from raw fastqs. Use comma as delimiter if multiple files. |
| fastqc_trimmed | `--include_prior_qc` | Filepath to FastQC `zip` files created from trimmed fastqs. Use comma as delimiter if multiple files. |
| fastq_screen | `--include_prior_qc` | Filepath to FastQ Screen `txt` files. Use comma as delimiter if multiple files. |
| dedup | `--include_prior_qc` | Filepath to Picard's MarkDuplicates `txt` files. Use comma as delimiter if multiple files. |
| multiqc_rename_config | `--allow_sample_renaming` | Filepath to label rename configfile to use with multiqc |

Refer to our system testing directory for example sample config files at `.test/configs`. For example:

* `.test/configs/no_priorQC/sample_config/project_2samples_wgs.tsv` - Sample config file for WGS samples and no prior
QC.
* `.test/configs/no_priorQC/sample_config/project_2samples_exome.tsv` - Sample config file for exome samples and no
prior QC. Note that WGS and exome samples can't be used in the same config file.
* `.test/configs/include_priorQC/sample_config/project_2samples_wgs.tsv` - Sample config file for WGS samples with prior
QC data available from [certain QC tools](./index.md#optional-qc-output-consumed-by-quac).

### Pedigree file

<!-- markdown-link-check-disable -->

Samples belonging to a project are provided as input via `--pedigree` to QuaC in [pedigree file
format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). Only the samples that are
supplied in pedigree file will be processed by QuaC and all of these samples must belong to the same project.
QuaC requires a [pedigree
file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format) as input via `--pedigree`.
Samples listed in this file must correspond to those in sample config file (`--sample_config`).

<!-- markdown-link-check-enable -->

Expand All @@ -16,102 +45,23 @@ supplied in pedigree file will be processed by QuaC and all of these samples mus
create a dummy pedigree file, which will lack sex (unless project tracking sheet is provided), relatedness and
affected status info. See header of the script for usage instructions.


Each sample must have `BAM` and `VCF` files available in the directory structure shown below for sample `X`.

```
test_project/
└── analysis
├── X
│ ├── bam
│ │   ├── X.bam
│ │   └── X.bam.bai
│ └── vcf
│ ├── X.vcf.gz
│ └── X.vcf.gz.tbi
└── Y
└── ....
```

When run in exome mode using flag `--exome`, QuaC requires a capture-regions bed file at the path
`path_to_sample/configs/small_variant_caller/<capture_regions>.bed` for each sample.

```
test_project/
└── analysis
├── X
│ ├── bam
│ │   ├── X.bam
│ │   └── X.bam.bai
│ ├── configs
│ │   └── small_variant_caller
│ │   └── capture_regions.bed
│ └── vcf
│ ├── X.vcf.gz
│ └── X.vcf.gz.tbi
└── Y
└── ....
```

*Optionally*, QuaC can also utilize QC results produced by [certain
tools](./index.md#optional-qc-output-consumed-by-quac) when run with flag `--include_prior_qc`. In this case, following
directory structure is expected.

```
test_project/
└── analysis
├── X
│ ├── bam
│ │   ├── X.bam
│ │   └── X.bam.bai
│ ├── qc
│ │   ├── dedup
│ │   │   ├── X-1.metrics.txt
│ │   │   └── X-2.metrics.txt
│ │   ├── fastqc-raw
│ │   │   ├── ....
│ │   ├── fastqc-trimmed
│ │   │   ├── ....
│ │   ├── fastq_screen-trimmed
│ │   │   └── ....
│ │   └── multiqc_initial_pass <--- needed only when `--allow_sample_renaming` flag is used
│ │   └── multiqc_sample_rename_config
│ │   └── X_rename_config.tsv
│ └── vcf
│ ├── X.vcf.gz
│ └── X.vcf.gz.tbi
└── Y
└── ....
```


!!! note "CGDS users only"

Output (bam, vcf and QC output) produced by CGDS's small variant caller pipeline can be readily used as input to
QuaC with flags `--include_prior_qc` and `--allow_sample_renaming`.

### Example project structure

Refer to system testing directory `.test/` in the repo for an example project to see an example project with above
mentioned directory structure needed as input. In this setup, projects A and B have prior QC data included, whereas
samples C and D do not have them. Refer to pedigree files under `.test/configs/` on how these example samples were used
as input to QuaC.


## Output

QuaC results are stored at the path specified via option `--outdir` (default:
`data/quac/results/test_project/analysis`). Refer to the [system testing's
output](./system_testing.md#expected-output-files) to learn more about the output directory structure.
`data/quac/results/test_project/analysis`). Refer to the [system testing's
output](./system_testing.md#expected-output-files) to learn more about the output directory structure.

QC output are stored at the sample level as well as the project level (ie. all samples considered together) depending on
the type of QC run. For example, Qualimap tool is run at the sample level whereas Somalier tool is run at the project
level. MultiQC reports are available both at the sample and project level.

!!! tip
!!! tip

Users may primarily be interested in the aggregated QC results produced by [multiqc](https://multiqc.info/),
Users may primarily be interested in the aggregated QC results produced by [MultiQC](https://multiqc.info/),
both at sample-level as well as at the project-level. These multiqc reports also include summary of QuaC-Watch
results at the top.

!!! note "CGDS users only"

QuaC's output directory structure was designed based on the output structure of the [CGDS small variant caller
pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline).

5 changes: 2 additions & 3 deletions docs/installation_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,9 +207,8 @@ PROJECT_CONFIG="project_2samples"
PRIOR_QC_STATUS="no_priorQC"

python src/run_quac.py \
--project_name test_project \
--projects_path ".test/ngs-data/" \
--pedigree ".test/configs/${PRIOR_QC_STATUS}/${PROJECT_CONFIG}.ped" \
--sample_config ".test/configs/${PRIOR_QC_STATUS}/sample_config/${PROJECT_CONFIG}_wgs.tsv" \
--pedigree ".test/configs/${PRIOR_QC_STATUS}/pedigree/${PROJECT_CONFIG}.ped" \
--outdir "data/quac/results/test_${PROJECT_CONFIG}_wgs-${PRIOR_QC_STATUS}/analysis" \
--quac_watch_config "configs/quac_watch/wgs_quac_watch_config.yaml" \
--workflow_config "configs/workflow.yaml" \
Expand Down
21 changes: 10 additions & 11 deletions docs/quac_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,9 @@ wrapper/CLI (command line interface) tool `src/run_quac.py`.
## Command line interface

```sh
$ python src/run_quac.py -h
usage: run_quac.py [-h] --project_name PROJECT_NAME --projects_path
PROJECTS_PATH --pedigree PEDIGREE --quac_watch_config
QUAC_WATCH_CONFIG [--workflow_config]
$ python src/run_quac.py -h
usage: run_quac.py [-h] --sample_config SAMPLE_CONFIG --pedigree PEDIGREE
--quac_watch_config QUAC_WATCH_CONFIG [--workflow_config]
[--snakemake_cluster_config] [--outdir] [--tmp_dir]
[--exome] [--include_prior_qc] [--allow_sample_renaming]
[-e] [-n] [--cli_cluster_config] [--log_dir]
Expand All @@ -20,13 +19,13 @@ optional arguments:
-h, --help show this help message and exit

QuaC snakemake workflow options:
--project_name PROJECT_NAME
Project name. Required. (default: None)
--projects_path PROJECTS_PATH
Path where all projects are hosted. Do not include
project name here. Required. (default: None)
--pedigree PEDIGREE Pedigree filepath. Must correspond to the project
supplied via --project_name. Required. (default: None)
--sample_config SAMPLE_CONFIG
Sample config file in TSV format. Provides sample name
and necessary input filepaths (bam, vcf, etc.).
Required. (default: None)
--pedigree PEDIGREE Pedigree filepath. Must correspond to samples
mentioned in configfile via --sample_config. Required.
(default: None)
--quac_watch_config QUAC_WATCH_CONFIG
YAML config path specifying QC thresholds for QuaC-
Watch. See directory 'configs/quac_watch/' in quac
Expand Down
Loading

0 comments on commit 5dd211c

Please sign in to comment.