Name	Name	Last commit message	Last commit date
Latest commit ATpoint change commits of versions and command lines Dec 8, 2022 457dbda · Dec 8, 2022 History 116 Commits
.github/workflows	.github/workflows	change commits of versions and command lines	Dec 8, 2022
assets	assets	add blacklists	Dec 7, 2022
bin	bin	add new custom scripts	Dec 7, 2022
configs	configs	add scheduler config	Dec 7, 2022
functions	functions	add params validation	Dec 7, 2022
misc	misc	current software versions and command lines	Dec 8, 2022
modules	modules	make channel order for chromsizes deterministic	Dec 8, 2022
test	test	remove old test data	Dec 7, 2022
.gitignore	.gitignore	added local test script	Aug 19, 2021
CHANGELOG.md	CHANGELOG.md	update all docs	Dec 7, 2022
CITATIONS.md	CITATIONS.md	update all docs	Dec 7, 2022
CONTAINERLOG.md	CONTAINERLOG.md	update all docs	Dec 7, 2022
Dockerfile	Dockerfile	update to container v1.2.0	Dec 7, 2022
README.md	README.md	fix parsing of peak calling defaults	Dec 8, 2022
environment.yml	environment.yml	update to container v1.2.0	Dec 7, 2022
main.nf	main.nf	fix bigwig process .collect()	Dec 8, 2022
nextflow.config	nextflow.config	change memory	Dec 8, 2022
schema.nf	schema.nf	fix parsing of peak calling defaults	Dec 8, 2022

Repository files navigation

atac_chip_preprocess

Introduction

atac_chip_preprocess is a containerized Nextflow pipeline for preprocessing of ATAC-seq and ChIP-seq data.

The workflow consists of these steps:

validation of the provided samplesheet
initial QC with fastqc
merging of lane replicates per sample into one fastq file per R1/R2
adapter and quality trimming with fastp
mapping with bowtie2
duplicate marking with samblaster
removal of MAPQ < 20, non-primary or supplementary, reads mapped to non-primary (random/unplaced) chromosomes, mitochondrial alignments and duplicate reads with samtools
for paired-end data fetching of insert size metrics with picard
for ATAC-seq data extraction of transposome insertion events (cutsites) using custom GNU tool combinations
peak calling with macs2 and filtering of peaks against NGS blacklists (ENCODE+mitochondrial homologs in the nuclear genome, the latter for ATAC-seq only) using bedtools
calculation of Fractions Of Reads in Peaks (FRiPs) as a QC metric with featureCounts
creation of raw bigwig tracks for visual inspection of data quality with bedtools
summary report with MultiQC
output of all used software versions and the exact command lines per process step and sample using custom scripts

Run the following test profile to see all possible outputs that the pipeline produces. Default output directory is ./atac_chip_preprocess_results/). Downloading the Docker image may take a minute or two (automated).

NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --keep_merge --keep_trim

An overview of current software versions and exact command lines when using default settings of the pipeline can be found in the misc directory.

Usage

The three minimal parameters the user has to provide are the following ones:

--samplesheet: path to a samplesheet csv file with three columns, being sample (the sample name), r1 (path to R1) and r2 (path to R2), where r2 can be empty. If empty, then the sample is considered single-end.
--index: path to a folder containing a bowtie2 index with the typical *.bt2 files. Note, it is the path to the folder, not the path to the index basename, as the pipeline will find the bt2 files automatically.
--species: either of mm or hs to let the peak caller know whether mouse or human data are provided, so it gets the effective genome length right.

Note that the bowtie2 index must be produced beforehand, we did not include that into the pipeline as it is trivially just bowtie2-build genome.fa idx.

On our HPC we typically use:

# Example for mouse ATAC-seq data
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm

# Example for mouse ChIP-seq data
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm --atacseq false

Options

We used reasonable defaults for all processing steps that should be used without modifications. Still, the following options exist for customization:

General options

--atacseq, a logical, set to false if processing something like ChIP-seq data, by default true for ATAC-seq data

Filtering options

--blacklist: path to a BED file to filter peaks against. By default when --species is mm then the provided mm10 blacklist is used, for hs the hg38 one is used.
--filter_blacklist: logical, set to false to turn off any blacklist filtering, default true.
--flag_remove: a numeric flag to be used with samtools view -F, so indicating which alignments to remove. Default is 3332, so discard unmapped, not primary, supplementary and duplicates . See here for details.
--chr_regex: a groovy-compatible regex to indicate which chromosomes to keep in the BAM alignments. Default is chr[1-9,X,Y] which means keep everything starting with chr and then a number of X/Y. That in turn removes decoys (chrEBV) and unplaced/random contigs such as chrU..., therefore keeping only the primary autosomes and sex chromosomes.
--min_mapq: an integer, keep only alignments with MAPQ greater than that, default is 20.
--fragment_length: for single-end data an average expected fragment length to extend reads to fragments for bigwig creation and FRiP calculation, default is 250. That is only used if --atacseq false as for ATAC-seq data everything is based on the transposome cutsites (that is the 5' ends of the alignments).
keep_merge: logical, whether to keep the merged fastq files, else they're not published to the output directory
keep_trim: logical, whether to keep the trimmed fastq files, else they're not published to the output directory

Process options

--do_not_trim: logical, whether to skip adapter and quality trimming
--trim_additional: additional arguments for the fastp trimming process beyond what is coded in the module definition, default --dont_eval_duplication -z 6 to skip duplicate level assessment and to compress outputs
--align_additional: additional arguments for the bowtie2 alignment process beyond what is coded in the module definition, default is -X2000 --very-sensitive, see bowtie2 manual
--sort_additional: additional arguments for the samtools sorting process beyond what is coded in the module definition, default is -l 6 to compress the resulting BAM file to that level
--filter_additional: additional arguments for the samtools view filtering process beyond what is described above and given with the -q and -F flags
--macs_additional: additional arguments for the macs2 callpeak, default is in any case --keep-dup=all since we provide already deduplicated data to that process and if ATAC-seq data are processed (default) then --nomodel --extsize 100 --shift -50 --min-length 250 to provide some smoothing when using the cutsites for peak calling.

Resources

The nextflow.config files contains hardcoded defaults towards resources for the individual processes, suitable for use on HPC environments. The most demanding process is the alignment steps, requiring 16 threads and 16GB of RAM per sample.

Schedulers

The schedulers.config file currently contains a single scheduler profile for SLURM as used on or HPC, submitting jobs (if using -profile slurm) to a quere called normal with a maximum 8h of walltime. Custom profiles should be added to this config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

atac_chip_preprocess

Introduction

Usage

Options

Resources

Schedulers

About

Releases 3

Packages

Languages

ATpoint/atac_chip_preprocess

Folders and files

Latest commit

History

Repository files navigation

atac_chip_preprocess

Introduction

Usage

Options

Resources

Schedulers

About

Topics

Resources

Citation

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages