atac_chip_preprocess is a containerized Nextflow pipeline for preprocessing of ATAC-seq and ChIP-seq data.
The workflow consists of these steps:
- validation of the provided samplesheet
- initial QC with
fastqc
- merging of lane replicates per sample into one fastq file per R1/R2
- adapter and quality trimming with
fastp
- mapping with
bowtie2
- duplicate marking with
samblaster
- removal of MAPQ < 20, non-primary or supplementary, reads mapped to non-primary (random/unplaced) chromosomes, mitochondrial alignments and duplicate reads with
samtools
- for paired-end data fetching of insert size metrics with
picard
- for ATAC-seq data extraction of transposome insertion events (cutsites) using custom GNU tool combinations
- peak calling with
macs2
and filtering of peaks against NGS blacklists (ENCODE+mitochondrial homologs in the nuclear genome, the latter for ATAC-seq only) usingbedtools
- calculation of Fractions Of Reads in Peaks (FRiPs) as a QC metric with
featureCounts
- creation of raw bigwig tracks for visual inspection of data quality with
bedtools
- summary report with
MultiQC
- output of all used software versions and the exact command lines per process step and sample using custom scripts
Run the following test profile to see all possible outputs that the pipeline produces. Default output directory is ./atac_chip_preprocess_results/
).
Downloading the Docker image may take a minute or two (automated).
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --keep_merge --keep_trim
An overview of current software versions and exact command lines when using default settings of the pipeline can be found in the misc directory.
The three minimal parameters the user has to provide are the following ones:
--samplesheet
: path to a samplesheet csv file with three columns, beingsample
(the sample name),r1
(path to R1) andr2
(path to R2), where r2 can be empty. If empty, then the sample is considered single-end.--index
: path to a folder containing abowtie2
index with the typical*.bt2
files. Note, it is the path to the folder, not the path to the index basename, as the pipeline will find the bt2 files automatically.--species
: either ofmm
orhs
to let the peak caller know whether mouse or human data are provided, so it gets the effective genome length right.
Note that the bowtie2
index must be produced beforehand, we did not include that into the pipeline as it is trivially just bowtie2-build genome.fa idx
.
On our HPC we typically use:
# Example for mouse ATAC-seq data
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm
# Example for mouse ChIP-seq data
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm --atacseq false
We used reasonable defaults for all processing steps that should be used without modifications. Still, the following options exist for customization:
General options
--atacseq
, a logical, set tofalse
if processing something like ChIP-seq data, by defaulttrue
for ATAC-seq data
Filtering options
--blacklist
: path to a BED file to filter peaks against. By default when--species
ismm
then the provided mm10 blacklist is used, forhs
the hg38 one is used.--filter_blacklist
: logical, set tofalse
to turn off any blacklist filtering, defaulttrue
.--flag_remove
: a numeric flag to be used withsamtools view -F
, so indicating which alignments to remove. Default is 3332, so discard unmapped, not primary, supplementary and duplicates . See here for details.--chr_regex
: a groovy-compatible regex to indicate which chromosomes to keep in the BAM alignments. Default ischr[1-9,X,Y]
which means keep everything starting withchr
and then a number of X/Y. That in turn removes decoys (chrEBV
) and unplaced/random contigs such aschrU...
, therefore keeping only the primary autosomes and sex chromosomes.--min_mapq
: an integer, keep only alignments with MAPQ greater than that, default is 20.--fragment_length
: for single-end data an average expected fragment length to extend reads to fragments for bigwig creation and FRiP calculation, default is 250. That is only used if--atacseq false
as for ATAC-seq data everything is based on the transposome cutsites (that is the 5' ends of the alignments).keep_merge
: logical, whether to keep the merged fastq files, else they're not published to the output directorykeep_trim
: logical, whether to keep the trimmed fastq files, else they're not published to the output directory
Process options
--do_not_trim
: logical, whether to skip adapter and quality trimming--trim_additional
: additional arguments for thefastp
trimming process beyond what is coded in the module definition, default--dont_eval_duplication -z 6
to skip duplicate level assessment and to compress outputs--align_additional
: additional arguments for thebowtie2
alignment process beyond what is coded in the module definition, default is-X2000 --very-sensitive
, seebowtie2
manual--sort_additional
: additional arguments for thesamtools
sorting process beyond what is coded in the module definition, default is-l 6
to compress the resulting BAM file to that level--filter_additional
: additional arguments for thesamtools view
filtering process beyond what is described above and given with the-q
and-F
flags--macs_additional
: additional arguments for themacs2 callpeak
, default is in any case--keep-dup=all
since we provide already deduplicated data to that process and if ATAC-seq data are processed (default) then--nomodel --extsize 100 --shift -50 --min-length 250
to provide some smoothing when using the cutsites for peak calling.
The nextflow.config files contains hardcoded defaults towards resources for the individual processes, suitable for use on HPC environments. The most demanding process is the alignment steps, requiring 16 threads and 16GB of RAM per sample.
The schedulers.config file currently contains a single scheduler profile for SLURM as used on or HPC,
submitting jobs (if using -profile slurm
) to a quere called normal
with a maximum 8h of walltime. Custom profiles should be added to this config.