Skip to content

Preprocessing workflow for ATAC-seq and ChIP-seq data

Notifications You must be signed in to change notification settings

ATpoint/atac_chip_preprocess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

457dbda · Dec 8, 2022
Dec 8, 2022
Dec 7, 2022
Dec 7, 2022
Dec 7, 2022
Dec 7, 2022
Dec 8, 2022
Dec 8, 2022
Dec 7, 2022
Aug 19, 2021
Dec 7, 2022
Dec 7, 2022
Dec 7, 2022
Dec 7, 2022
Dec 8, 2022
Dec 7, 2022
Dec 8, 2022
Dec 8, 2022
Dec 8, 2022

Repository files navigation

atac_chip_preprocess

CI Nextflow run with docker run with singularity

Introduction

atac_chip_preprocess is a containerized Nextflow pipeline for preprocessing of ATAC-seq and ChIP-seq data.

The workflow consists of these steps:

  • validation of the provided samplesheet
  • initial QC with fastqc
  • merging of lane replicates per sample into one fastq file per R1/R2
  • adapter and quality trimming with fastp
  • mapping with bowtie2
  • duplicate marking with samblaster
  • removal of MAPQ < 20, non-primary or supplementary, reads mapped to non-primary (random/unplaced) chromosomes, mitochondrial alignments and duplicate reads with samtools
  • for paired-end data fetching of insert size metrics with picard
  • for ATAC-seq data extraction of transposome insertion events (cutsites) using custom GNU tool combinations
  • peak calling with macs2 and filtering of peaks against NGS blacklists (ENCODE+mitochondrial homologs in the nuclear genome, the latter for ATAC-seq only) using bedtools
  • calculation of Fractions Of Reads in Peaks (FRiPs) as a QC metric with featureCounts
  • creation of raw bigwig tracks for visual inspection of data quality with bedtools
  • summary report with MultiQC
  • output of all used software versions and the exact command lines per process step and sample using custom scripts

Run the following test profile to see all possible outputs that the pipeline produces. Default output directory is ./atac_chip_preprocess_results/). Downloading the Docker image may take a minute or two (automated).

NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --keep_merge --keep_trim

An overview of current software versions and exact command lines when using default settings of the pipeline can be found in the misc directory.

Usage

The three minimal parameters the user has to provide are the following ones:

  • --samplesheet: path to a samplesheet csv file with three columns, being sample (the sample name), r1 (path to R1) and r2 (path to R2), where r2 can be empty. If empty, then the sample is considered single-end.
  • --index: path to a folder containing a bowtie2 index with the typical *.bt2 files. Note, it is the path to the folder, not the path to the index basename, as the pipeline will find the bt2 files automatically.
  • --species: either of mm or hs to let the peak caller know whether mouse or human data are provided, so it gets the effective genome length right.

Note that the bowtie2 index must be produced beforehand, we did not include that into the pipeline as it is trivially just bowtie2-build genome.fa idx.

On our HPC we typically use:

# Example for mouse ATAC-seq data
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm

# Example for mouse ChIP-seq data
NXF_VER=21.10.6 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --samplesheet path/to/samplesheet.csv --index path/to/index_folder --species mm --atacseq false

Options

We used reasonable defaults for all processing steps that should be used without modifications. Still, the following options exist for customization:

General options

  • --atacseq, a logical, set to false if processing something like ChIP-seq data, by default true for ATAC-seq data

Filtering options

  • --blacklist: path to a BED file to filter peaks against. By default when --species is mm then the provided mm10 blacklist is used, for hs the hg38 one is used.
  • --filter_blacklist: logical, set to false to turn off any blacklist filtering, default true.
  • --flag_remove: a numeric flag to be used with samtools view -F, so indicating which alignments to remove. Default is 3332, so discard unmapped, not primary, supplementary and duplicates . See here for details.
  • --chr_regex: a groovy-compatible regex to indicate which chromosomes to keep in the BAM alignments. Default is chr[1-9,X,Y] which means keep everything starting with chr and then a number of X/Y. That in turn removes decoys (chrEBV) and unplaced/random contigs such as chrU..., therefore keeping only the primary autosomes and sex chromosomes.
  • --min_mapq: an integer, keep only alignments with MAPQ greater than that, default is 20.
  • --fragment_length: for single-end data an average expected fragment length to extend reads to fragments for bigwig creation and FRiP calculation, default is 250. That is only used if --atacseq false as for ATAC-seq data everything is based on the transposome cutsites (that is the 5' ends of the alignments).
  • keep_merge: logical, whether to keep the merged fastq files, else they're not published to the output directory
  • keep_trim: logical, whether to keep the trimmed fastq files, else they're not published to the output directory

Process options

  • --do_not_trim: logical, whether to skip adapter and quality trimming
  • --trim_additional: additional arguments for the fastp trimming process beyond what is coded in the module definition, default --dont_eval_duplication -z 6 to skip duplicate level assessment and to compress outputs
  • --align_additional: additional arguments for the bowtie2 alignment process beyond what is coded in the module definition, default is -X2000 --very-sensitive, see bowtie2 manual
  • --sort_additional: additional arguments for the samtools sorting process beyond what is coded in the module definition, default is -l 6 to compress the resulting BAM file to that level
  • --filter_additional: additional arguments for the samtools view filtering process beyond what is described above and given with the -q and -F flags
  • --macs_additional: additional arguments for the macs2 callpeak, default is in any case --keep-dup=all since we provide already deduplicated data to that process and if ATAC-seq data are processed (default) then --nomodel --extsize 100 --shift -50 --min-length 250 to provide some smoothing when using the cutsites for peak calling.

Resources

The nextflow.config files contains hardcoded defaults towards resources for the individual processes, suitable for use on HPC environments. The most demanding process is the alignment steps, requiring 16 threads and 16GB of RAM per sample.

Schedulers

The schedulers.config file currently contains a single scheduler profile for SLURM as used on or HPC, submitting jobs (if using -profile slurm) to a quere called normal with a maximum 8h of walltime. Custom profiles should be added to this config.

About

Preprocessing workflow for ATAC-seq and ChIP-seq data

Topics

Resources

Citation

Stars

Watchers

Forks

Packages

No packages published