Skip to content
Mfkessler edited this page May 23, 2022 · 35 revisions

About WP2

WP2 is using SnapATAC for several preprocessing steps of the raw snATAC-seq .fastq files like barcode filtering, bin filtering, clustering and peak calling of each cluster. It's also possible to annotate cell types of each cluster using UROPA, Panglao DB and the created peak files. Please note that the cell type annotation script is still experimental.

Workflow

This chapter will give a short overview of the workflow of our work package. In the scripts section is a detailed description of each script. The processing can be done manually. We provide a docker image that automates the processes and is described at the end of this page.

WP2_Workflow drawio (5)

Since the other work packages are based on our data the output files should be arranged in the following folder structure.

folderstructure drawio

Create .bam and .snap file

The first step of our work package workflow is the creation of a .bam and .snap file from the .fastq files. This .fastq files are the results of an ATAC sequencing. The .fastq files we used in our project can be found here. Beside these files an indexed genome.fa and a matching chrom.sizes are needed.

The fasta-to-snap.sh script will create the files based on SnapTools.

Work package 3 needs a different format of the .bam file. Using the parsebam.py script the format can be parsed.

Create a cluster assignment table and narrowPeak files with SnapATAC

The cluster assignment table and narrowPeak files are created with the preprocessing script based on SnapATAC.

Scripts

SnapTools

Requirements:

  • .fastq files tissue.R1.fastq.gz and tissue.R2.fastq.gz
  • bwa installed as aligner
  • genome.fa file matching the tissue genome
  • chrom.sizes file matching the genome

To create .bam and .snap files we use a script based on SnapTools. SnapTools and all requirements need to be installed. The genome.fa file needs to be indexed as explained in the tutorial from SnapTools

.fasta-to-snap.sh \
     --fasta1="PATH" \
     --fasta2="PATH" \
     --refgenome="PATH" \
     --refname="GENOMENAME" \
     --chrom-sizes="PATH" \
     --cores=NUMCORES

The .bam format from SnapTools needs to be parsed for WP3. The Barcode will be added in the Tags section. The python script using Pysam will create the right format:

.parsebam.py "PATH/BAMFILE.bam"

.bam example Before:

AAACTACCAGAAACCCGAGATA:K00168:263:H7WNNBBXY:7:1111:7222:46750	77	*	0	0	*	*	0	0	CGGCTGTCCCTGTCCATGACACGCTCACCGCGCCTCAGATGTGTATAAGAG	<AA<A7FJFJJFJJJJJJJJJJJJJJJJJJJJJJFJJAJJFJFJJJJJJJJ	AS:i:0	XS:i:0
AAACTACCAGAAACCCGAGATA:K00168:263:H7WNNBBXY:7:1111:7222:46750	141	*	0	0	*	*	0	0	GGATACATGAGCTGCATGTGCAGCTTTGCTCCACACGGACACATCAACCAC	---77-7<--<777-A-FAJJ-----77---7-7A---7-7<J<F---<-A	AS:i:0	XS:i:0
AAACTACCAGAAACCCGAGATA:K00168:263:H7WNNBBXY:7:1227:9110:46188	77	*	0	0	*	*	0	0	GTCCCTGTCCCTCACAGCCTCACCGTCTCCGCCTCAGATGTGTATAAGAGA	AAAF<FFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJJJJFJJJ	AS:i:0	XS:i:0

.bam example after:

K00168:263:H7WNNBBXY:7:1111:7222:46750	77	*	0	0	*	*	0	0	CGGCTGTCCCTGTCCATGACACGCTCACCGCGCCTCAGATGTGTATAAGAG	<AA<A7FJFJJFJJJJJJJJJJJJJJJJJJJJJJFJJAJJFJFJJJJJJJJ	AS:i:0	XS:i:0	CB:Z:AAACTACCAGAAACCCGAGATA
K00168:263:H7WNNBBXY:7:1111:7222:46750	141	*	0	0	*	*	0	0	GGATACATGAGCTGCATGTGCAGCTTTGCTCCACACGGACACATCAACCAC	---77-7<--<777-A-FAJJ-----77---7-7A---7-7<J<F---<-A	AS:i:0	XS:i:0	CB:Z:AAACTACCAGAAACCCGAGATA
K00168:263:H7WNNBBXY:7:1227:9110:46188	77	*	0	0	*	*	0	0	GTCCCTGTCCCTCACAGCCTCACCGTCTCCGCCTCAGATGTGTATAAGAGA	AAAF<FFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJJJJFJJJ	AS:i:0	XS:i:0	CB:Z:AAACTACCAGAAACCCGAGATA

SnapATAC

The next lines explain the usage of the preprocessing_WP2.R script. This script is based on this tutorial of SnapATAC.

Preparations

The following libraries are required:

library(GenomicRanges)  
library(SnapATAC)
library(viridisLite)
library(ggplot2)
library(rtracklayer)

We assume that your working directory is called "preprocessing". We provide a blacklist file for specific regions of the hg38 genome. Add the paths of the blacklist file and the corresponding gtf file:

blacklist_regions = "path" # path to blacklist file
gtf.gr = rtracklayer::import("path") # path to .gtf file

Load snap object and add bin matrix

Next, create a folder for the analysis and load a .snap file which has been created with SnapTools before:

s_file="path" # path to .snap file
sample_name = "sample" # sample name (folder name)

# set/getwd and create folder
setwd("preprocessing")
dir.create(file.path(getwd(), sample_name))
setwd(file.path(getwd(), sample_name))

x.sp = createSnap(
  file=s_file,
  sample=sample_name,
  num.cores=6
)
x.sp

After loading the .snap file to a snap object, add the bin matrix:

x.sp = addBmatToSnap(x.sp, bin.size=5000, num.cores=6)
x.sp = makeBinary(x.sp, mat="bmat")
x.sp

Barcode filtering

The next lines will perform the barcode filtering.

# load genes of .gtf file
gene.gr = gtf.gr[gtf.gr$type == "gene"]
# extract promoter region for each gene
promoter.gr = reduce(promoters(gene.gr, upstream=2000, downstream=0))
# find promoter overlapping bins 
ov = findOverlaps(x.sp@feature, promoter.gr)
idy = queryHits(ov)
log_cov = log10(SnapATAC::rowSums(x.sp, mat="bmat")+1)
promoter_ratio = Matrix::rowSums(x.sp@bmat[,idy]) / Matrix::rowSums(x.sp@bmat)
png(filename="barcodes.png", width=6, height=4, units="in", res=300)
plot(log_cov, promoter_ratio, cex=0.5, col="grey", xlab="log(count(bins))", ylab="Promoter ratio", ylim=c(0,1 ))
dev.off()

barcodes.png show the promoter ratio and bin coverage of all cells: barcodes

Choose the cutoffs to filter barcodes:

idx = which(promoter_ratio > 0.3 & promoter_ratio < 0.6 & log_cov > 2.5)
x.sp = x.sp[idx,]

Bin filtering

Bin filtering will be done by overlapping blacklist- or unwanted chromosome regions. Finally bins with a low coverage will also be filtered:

# remove blacklist regions
black_list = read.table(blacklist_regions)
black_list.gr = GRanges(
  black_list[,1], 
  IRanges(black_list[,2], black_list[,3])
)
idy = queryHits(findOverlaps(x.sp@feature, black_list.gr))
if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]}
x.sp

# remove unwanted chromosomes
chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM|chrX|chrY", seqlevels(x.sp@feature))]
idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature)
if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]}
x.sp

# filter bins with low coverage
bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95)
idy = which(bin.cov <= bin.cutoff & bin.cov > 0)
x.sp = x.sp[, idy, mat="bmat"];
x.sp

Dimensional reduction

Dimensional reduction and clustering begins with performing diffusion maps on the bin matrix and showing the first 25 pairs of eigenvectors:

x.sp = runDiffusionMaps(
  obj=x.sp,
  input.mat="bmat",
  num.eigs=50
)

show first 25 pairs of eigenvectors
png(filename="eigenvectors.png", width=6, height=4, units="in", res=300)
plotDimReductPW(
  obj=x.sp,
  eigs.dims=1:50,
  point.size=0.3,
  point.color="grey",
  point.shape=19,
  point.alpha=0.6,
  down.sample=5000,
  pdf.file.name=NULL,
  pdf.height=7,
  pdf.width=7
)
dev.off()

vectors

The SnapATAC Tutorial uses an adhoc method by selecting the first pair which looks like a "blob". We achieved good results by picking 15 instead when handling snap objects with ~10.000 to 20.000 cells.

v_pair = 15

Now dimensional reduction will be continued with the first 15 eigenvectors:

x.sp = runKNN(
  obj=x.sp,
  eigs.dims=1:v_pair,
  k=200
)

After perfoming R-igraph on the snap object the t-sne clustering is done and finally visualized:

x.sp=runCluster(
  obj=x.sp,
  tmp.folder=tempdir(),
  louvain.lib="R-igraph",
  seed.use=10
)

x.sp@metaData$cluster = x.sp@cluster

t-sne clustering and cluster assignment table

The following steps create a t-sne plot and the cluster assignment table, which contains the barcodes and the corresponding cluster numbers:

x.sp = runViz(
  obj=x.sp, 
  tmp.folder=tempdir(),
  dims=2,
  eigs.dims=1:v_pair, 
  method="Rtsne",
  seed.use=10
)

# plot t-sne
png(filename="t-sne.png", width=6, height=6, units="in", res=300)
plotViz(
  obj=x.sp,
  method="tsne", 
  main=sample_name,
  point.color=x.sp@cluster, 
  point.size=0.8, 
  point.shape=19, 
  point.alpha=0.8, 
  text.add=TRUE,
  text.size=1.5,
  text.color="black",
  text.halo.add=TRUE,
  text.halo.color="white",
  text.halo.width=0.2,
  #down.sample=10000,
  legend.add=FALSE
)
dev.off()

To save a cluster assignment table the following code can be executed:

ca_table = do.call(rbind, Map(data.frame, cell_name=x.sp@barcode, cluster=x.sp@cluster))
write.table(ca_table,"cluster_assignment_table.txt", sep = "\t", quote = FALSE, row.names = FALSE)

Peak calling

For peak calling a new folder is being created:

dir.create(file.path(getwd(), "peaks"))
setwd(file.path(getwd(), "peaks"))

The next lines of code will perform a peak calling of each cluster with more than 100 cells and saves a combined peak file with all peaks:

clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)]
peaks.ls = mclapply(seq(clusters.sel), function(i){
  print(clusters.sel[i])
  runMACS(
    obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
    output.prefix=paste0(paste(sample_name, ".", sep=""), gsub(" ", "_", clusters.sel)[i]),
    path.to.snaptools="/usr/local/bin/snaptools",
    path.to.macs="/usr/local/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500, 
    num.cores=6,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
  )
}, mc.cores=6);
# assuming all .narrowPeak files in the current folder are generated from the clusters
peaks.names = system("ls | grep narrowPeak", intern=TRUE)
peak.gr.ls = lapply(peaks.names, function(x){
  peak.df = read.table(x)
  GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
})
peak.gr = reduce(Reduce(c, peak.gr.ls))
peak.gr

# create a cell-by-peak matrix
peaks.df = as.data.frame(peak.gr)[,1:3]
write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
              quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
              row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
              fileEncoding = "")

Create .bed files for WP6 from .narrowPeak files

The python script will create .bed files which contain the open regions in the format "CHR START STOP" for all .narrowPeak files in a folder.

.narrowpeaks_to_bed.py "PATH_TO_NARROWPEAK_FOLDER/"

Cell type annotation

Preparations

Requirements:

Download and unpack the Cell type annotation folder of this repo. After that create a new folder inside the CTI folder (for example "sample"). Inside the sample folder create two new folders: "annot" and "npeaks". Finally copy the .narrowPeak files into the npeaks folder.

Annotating .narrowPeak files with UROPA

Parsing of all .narrowPeaks files:

gawk -i inplace '{print substr($1, 3, length($1) - 3)"\t"$2"\t"$3"\t"$7}' *narrowPeak

For example, to filter the X, Y and M chromosome, run the following commands:

gawk -i inplace '!/chrM/' *narrowPeak
gawk -i inplace '!/chrX/' *narrowPeak
gawk -i inplace '!/chrY/' *narrowPeak

Activate the UROPA conda environment and run the following shell script inside the sample folder to annotate all parsed .narrowPeak files (be sure to add your path to the .gtf file!):

#!/bin/bash
for FILE in npeaks/* ;
do
  uropa -g >path to your .gtf file< -b "$FILE" -o annot/
done
cd annot/
rm *json *allhits* *finalhits.txt
gawk -F "\t" -i inplace '{print $1"\t"$2"\t"$3"\t"$22}' *finalhits.bed

The created .narrowPeak and .bed files should look like the files of the provided example.

Perfoming cell type annotation

Simply run the python script as follows to annotate the clusters:

python3 cell_type_annotation.py sample liver

In this case the sample "sample" is being annotated with cell types which appear in the liver.

To add connective tissue run the script like this:

python3 cell_type_annotation.py sample liver True

Output

annotation.txt:

3	Hepatocytes
7	Hepatocytes
2	Kupffer cells
1	Adipocytes
8	Adipocyte progenitor cells
6	Hepatocytes
5	Cholangiocytes
4	Hepatocytes

These files show potential cell types with the highest scores for each cluster which could be annotated.

cluster_6.txt:

Hepatocytes	20	8	444	13
Hepatoblasts	8	3	71	8
Cholangiocytes	6	2	176	16
Adipocytes	0	2	369	4
Fibroblasts	0	1	564	6
Stromal cells	0	1	106	6

For each cluster a .txt file is created which shows a more detailed insight of the calculation. These files can be found in the ranks folder. The first column shows a potential cell type, the second the overall score, the third the count of matched marker genes, the fourth the overall marker genes of this cell type and the last one shows the UI (Ubiquitousness Index). The higher the UI the more specific the corresponding genes should be.

The overall score is calculated like this:

score = round(sum(ranks) * ub_score * count / gene_count)

ub_score represents the mean of all UIs of matched genes, count and gene_count are equivalent to the second and third column and ranks is a list of the top 5 peak values of matched genes.

Dockerized version of scripts and workflow

The dockerfile can be used to install an image that contains all libraries, dependencies and scripts.

The following code will install the image:

docker build /path/Datenanalyse-2021/WP2/docker/ -t wp2_workflow:wp2_workflow

This image will process our workflow for all .fastq files in a given path.

-v mounts the docker file system to a given volume.

In this example the file system that contains all files is located in "/mnt/workspace_stud/stud3/". The other paths given as arguments start after this path.

docker run -v /mnt/workspace_stud/stud3/:/mnt/ wp2_workflow:wp2_workflow \# run docker and mount to file system
     path="WP2_OUTPUT"2 \                                                # working directory for alle scripts and outputs
     clear=TRUE \                                                        # if True unneccesary files are deleted
     fastq_path="data/fastq_data/renlab.sdsc.edu/kai/data/fastq/first/" \# path to a folder with fastq files
     refgenome="WP2_OUTPUT/genome/hg38.fa" \                             # path to a ref genome .fa file
     chrom_sizes="WP2_OUTPUT/genome/hg38.chrom.sizes" \                  # path to a chrom.sizes file
     gencode_bed="WP2_OUTPUT/gencode.filtered.bed" \                     # path to gencode.bed file
     blacklist="WP_OUTPUT/hg38-blacklist.v2_parsed.bed" \                # path to a blacklist file 
     gtf="WP2_OUTPUT/genome/homo_sapiens.104.mainChr.gtf"                # path to gtf file