A script to find regions of difference (RODs) between a query genome and reference genome(s).
blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000
or
perl blast_rod_finder.pl -q query.embl -r blastn.out -m 2000
This script is intended to identify RODs between a nucleotide query and a nucleotide subject/reference sequence. In order to do so, a blastn (http://blast.ncbi.nlm.nih.gov/Blast.cgi) needs to be performed beforehand with the query and the subject sequences (see also blast_rod_finder_legacy.sh below). blast_rod_finder.pl is mainly designed to work with bacterial genomes, while a query genome can be blasted against several subject sequences to detect RODs over a number of references. Although the results are optimized towards a complete query genome, both the reference(s) as well as the query can be used in draft form. To create artificial genomes via concatenation use cat_seq.pl or the EMBOSS application union (http://emboss.sourceforge.net/).
The blastn report file, the query sequence file (preferably in RichSeq format, see below) and a minimum size for ROD detection have to be provided. Subsequently, RODs are summarized in a tab-separated summary file, a gff3 (usable e.g. in Artemis/DNAPlotter, http://www.sanger.ac.uk/resources/software/artemis/) and a BRIG (BLAST Ring Image Generator, http://brig.sourceforge.net/) output file. Nucleotide sequences of each ROD are written to a multi-fasta file.
The query sequence can be provided in RichSeq format (embl or genbank), but has to correspond to the fasta file used in querying the BLAST database (the accession numbers have to correspond to the fasta headers). Use seq_format-converter.pl to create a corresponding fasta file from embl|genbank files for blastn if needed. With RichSeq query files additional info is given in the result summary and the amino acid sequences of all non-pseudo CDSs, which are contained or overlap a ROD, are written to a result file. Furthermore, all detected RODs are saved in individual sequence files in the corresponding query sequence format.
Run blastn and the script blast_rod_finder.pl consecutively manually or use the bash shell wrapper script blast_rod_finder_legacy.sh (see usage below) to perform the pipeline with one command. The same folder has to contain the subject fasta file(s), the query fasta file, optionally the query RichSeq file and the script blast_rod_finder.pl! blastn is run without filtering of query sequences ('-F F') and an evalue cutoff of '2e-11' is set.
formatdb -p F -i subject.fasta -n blast_db
blastall -p blastn -d blast_db -i query.fasta -o blastn.out -e 2e-11 -F F
perl blast_rod_finder.pl -q query.[embl|gbk|fasta] -r blastn.out -m 5000
blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000
- -m, -min
Minimum size of RODs that are reported
- -q, -query
Query sequence file [fasta, embl, or genbank format]
- -r, -report
blastn report/output file
- -h, -help: Help (perldoc POD)
- blastn database files for subject sequence(s)
*.nhr, *.nin, *.nsq
- blastn report
Text file named 'blastn.out'
- ./results
All output files are stored in this result folder
- rod_summary.txt
Summary of detected ROD regions (for embl/genbank queries includes annotation), tab-separated
- rod.gff
GFF3 file with ROD coordinates to use in Artemis/DNAPlotter etc.
- rod_BRIG.txt
ROD coordinates to use in BRIG (BLAST Ring Image Generator), tab-separated
- rod_seq.fasta
Nucleotide sequences of ROD regions (>ROD# size start..stop), multi-fasta
- rod_aa_fasta.txt
Only present if query is in RichSeq format. Amino acid sequences of all CDSs that are contained in or overlap a ROD region in multi-fasta format (>locus_tag gene product). RODs are seperated in the file via '~~' (~~ROD# size start..stop).
- ROD#.[embl|gbk]
Only present if query is in RichSeq format. Each identified ROD is written to an individual sequence file (in the same format as the query).
The Perl script runs under Windows and UNIX flavors, the bash-shell script of course only under UNIX.
- Legacy blast (tested version blastall 2.2.18)
- BioPerl (tested with version 1.006901)
Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
David Studholme (original code; D[dot]J[dot]Studholme[at]exeter[dot]ac[dot]uk; University of Exeter)
For citation, installation, and license information please see the repository main README.md.
- v0.4 (13.02.2013)
- included a POD
- options with Getopt::Long
- results directory for output files
- include accession number column for multi-sequence files in 'rod_summary.txt'
- include locus_tags (or alternatively gene, product, note ...) in 'rod_summary.txt'
- feature positions according to leading or lagging strand
- indicate if a primary feature overlaps ROD boundaries
- output each ROD in the query RichSeq format with BioPerl's Bio::SeqUtils
- v0.3 (23.11.2011)
- status messages with autoflush
- BRIG output file
- extended primary tag output for RODs (in addition to CDS): tRNA, rRNA, tmRNA, ncRNA, misc_RNA, repeat_region, misc_binding, and mobile_element
- v0.1 (07.11.2011)