dCNS is a sensitive sequence alignment implementation to detect conserved non-coding sequence (CNS) by using a k-mer match free and dynamic programming sequence alignment strategy.
GNU GCC >=6.0 Cmake >= 3.0
git clone https://github.com/baoxingsong/dCNS.git
cd dCNS
cmake CMakeLists.txt
make
usage and parameters are available from the command line.
The majority of the sequence alignment parameters dependent on each other. If you want to change a single parameter, other parameters might should be adjusted accordingly.
Here explains how did we optimize the parameters for the sequence alignment of Andropogoneae non-coding sequence. If you are working with a different population, the default parameters might do not work very well.
The match score, mis-match, gap open penalty, gap extension penalty were modified from minimap2 parameters by keeping in mind the high diversity of the Andropogoneae non-coding regions.
The k and lambda are used for p-value calculation, and they were calculated by fitting the maximum smith-waterman scores of random sequence fragments in to a non-linear least square regression model. The non-linear least square model is available as an R script in this release.
Random sampling was performed between maize and sorghum, sugarcane, Miscanthus, Setaria. All the pair-wise species gave similar k and lambda estimations. The default k and lambda parameters were determined using maize as the reference against sequence extract from all the other four species randomly.
The w and xDrop values were set by referring parameters in the example code of Seqan library.
In our dataset, the significant smith-waterman score is set as ~54, and we used a smith-waterman score 40 as a minimum score of seed. The seed window size 38 was selected to make sure there is only one seed in each window. The step_size value 8 to make sure there is no seed missing for each window sliding. The openGapPenalty2 value -45 basing on the guess that the normal gap is in general <20bp and TEs are >25bp
K-mer masking, we used 20-mer, since 20 is the minimum seed size. (40/2), 40 is the minimum seed score, 2 is the match score.
**Another example for setting parameters for Brassicaceae species is available at here.
if you run it on different machine, it would be computationally efficient to recompile the code for each machine. dCNS uses hardware instructions (SSE4 and AVX2) to speed up.
docker1 run -d -i -t ubuntu /bin/bash
docker1 images
docker1 ps
docker1 exec -it 3f97507f8d9a /bin/bash
go to docker
apt update
apt install build-essential -y
apt install cmake -y
apt install xjobs -y
git clone https://github.com/baoxingsong/dCNS.git
cd /workdir/dCNS
cmake ./
make
We are using KAT to count k-mers.
kat hist -t 12 Setaria_italica.Setaria_italica_v2.0.dna.toplevel.fa -m 20 -o setaria.kat.m20.hist
Check the distribution using R
library(ggplot2)
data = read.table("setaria.kat.m20.hist")
ggplot(data=data, aes(x=V1, y=V2)) + geom_line() + xlim(3, 1000) + ylim(0, 1000000)
ggplot(data=data, aes(x=V1, y=V2)) + geom_line() + xlim(3, 200) + ylim(0, 1000000)
ggplot(data=data, aes(x=V1, y=V2)) + geom_line() + xlim(3, 100) + ylim(0, 1000000)
ggplot(data=data, aes(x=V1, y=V2)) + geom_line() + xlim(3, 50) + ylim(0, 1000000)
According to this plot, pickup a k-mer frequency threshold.
Currently, we use the secondary derivative to pickup the threshold. (It maybe not the best solution. Other ideas are welcome to test.) This value will be used for the downstream genome masking via the parameter -f of dCNS maskGenome
.
data = read.table("setaria.kat.m20.hist")
d = data.frame(x=diff(diff(data$V2)), y=1:(nrow(data)-2))
which.min(d$x^2+d$y^2)
kat_jellyfish count -m 20 -s 100M -t 12 -C Setaria_italica.Setaria_italica_v2.0.dna.toplevel.fa -o setaria_k20_count.js
kat_jellyfish dump setaria_k20_count.js > setaria_k20_count_dumps.fa
mapping the CDS sequence of reference to the query genome sequence using [minimap2] (https://github.com/lh3/minimap2)
python3 ./scripts/longestTranscript.py -g Zea_mays.AGPv4.dna.toplevel.fa -f Zea_mays.AGPv4.34.gff3 -t false -o gene.fa
minimap2 -ax splice -a -uf -C 1 -k 12 -P -t 12 --cs Setaria_italica.Setaria_italica_v2.0.dna.toplevel.fa gene.fa > setaria.sam
mask the query genome
dCNS maskGenome -i Setaria_italica.Setaria_italica_v2.0.dna.toplevel.fa -o masked_setaria_k20_33.fa -s setaria.sam -c gene.fa -k setaria_k20_count_dumps.fa -f 33
mask the reference genome
dCNS maskGenome -i Zea_mays.AGPv4.dna.toplevel.fa -o masked_B73_v4_k20_46_cds.fa -g Zea_mays.AGPv4.34.gff3 -k b73_m20_mer_counts_dumps.fa -f 46
prepare the sequence for alignment
python3 ./scripts/extractInterGeneticSequence/sequenceUpStreamGeneAndDownStreamV2.py -g Zea_mays.AGPv4.34.gff3 -r Zea_mays.AGPv4.dna.toplevel.fa -c gene.fa -q Setaria_italica.Setaria_italica_v2.0.dna.toplevel.fa -s setaria.sam -o dCNS_setaria_maize_V2
go the folder and generate commands, the genome file must be the masked genome
cd dCNS_setaria_maize_V2
ls | awk '{print("dCNS cut1Gap -ra masked_B73_v4_k20_46_cds.fa -qa masked_setaria_k20_33.fa -i "$1" -r reference -o "$1".5")}' > command1
run all the commands with in file command1
file. I use GNU parallel to run it.
check if every command finished successfully
perl ../scripts/checkUnfinishedJobsV2.pl command1 > command1_missing
goto the farther folder and combine those sam outputs
cd ../
perl ./scripts/combineCnsSamFiles.pl dCNS_setaria_maize_V2 > 5.sam
The output file is in sam format, and it works with majority functions implemented in tools compatible with sam format.
The 5th column is the sequence alignment score. There would be some information lost when converting the sam files into bam files.
The 6th column is always start with regex [0-9]+H
, which tells the coordinate where query sequence alignment starts from, the value is 1 based coordinate.
Those SAM files could be reformatted into bam using samtools
cat 5.sam| sort | uniq | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}' | samtools view -O BAM --reference masked_B73_v4_k20_46_cds.fa | samtools sort > 5.bam
samtools index 5.bam
To reformat bam file into bed format, we excluded those reference genome masked regions. We did this by using bedtools and SeqKit.
seqkit locate -F --only-positive-strand --bed -m 0 -p n masked_B73_v4_k20_46_cds.fa > ns.bed
bedtools merge -i ns.bed > ns_megered.bed
bamToBed -i 5.bam | bedtools sort -i | bedtools merge > 5.bed
bedtools subtract -a 5.bed -b ns_megered.bed > 5_nons.bed
- please NOTE. Each line in the bed file could NOT be interrupted as a CNS
If you are interested in multiple sequence alignment, Firstly, you should perform pair-wise sequence alignment for each species against reference species.
Give each sam file a unique name. And use this command to generate multiple sequence alignment.
dCNS multCns -i Zea_mays.AGPv4.dna.toplevel.fa -o msa.fasta -s sorghum.sam setaria.sam miscanthus.sam sugarcane.sam 1013.sam 1025.sam
This work is funded by NSF #1822330
Baoxing Song, Edward S. Buckler, Hai Wang, Yaoyao Wu, Evan Rees, Elizabeth A. Kellogg, Daniel J. Gates, Merritt Khaipho-Burch, Peter J. Bradbury, Jeffrey Ross-Ibarra, Matthew B. Hufford, and M. Cinta Romay. Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize. Genome Res. Published in Advance May 27, 2021, doi:10.1101/gr.266528.120