-
Notifications
You must be signed in to change notification settings - Fork 9
7: Multiple Sequence Alignment
- Overview
- Ensuring Sequences are in Correct Orientation
- Identify and Adjust Reading Frames
- Perform Multiple Sequence Alignment
SuperCRUNCH offers a variety of options for multiple sequence alignment, and also includes two important pre-alignment steps. One of these steps includes ensuring that all sequences are written in the same direction. Even a single reversed sequence can wreak havoc during the alignment step. Details on this step can be found in the Ensuring Sequences are in Correct Orientation section. An additional pre-alignment step can be performed for coding loci. This step will identify the correct reading frame of sequences, adjust them to the first codon position, and ensure completion of the final codon. This is particularly useful for performing translation alignments, but can also be used for other purposes. The Identify and Adjust Reading Frames section contains more information about this topic. Multiple sequence alignment in SuperCRUNCH can be performed using Clustal-O, MAFFT, Muscle, and MACSE. These alignment methods should generally work well for a reasonable size dataset. More information on this topic can be found in the Perform Multiple Sequence Alignment section.
If ultra-large alignments (>5,000 sequences) need to be made then other alignment methods such as SATé-II (Liu et al. 2012), PASTA (Mirarab et al. 2015), and UPP (Nguyen et al. 2015) should be used instead. The UPP method can also be used if a large alignment is to be made from a set of full length sequences and shorter sequence fragments - most of the methods in SuperCRUNCH will not perform well under this condition.
To build a multiple sequence alignment, all sequences should be written in the same direction. Given the nature of GenBank, it should never be assumed all sequences were uploaded in the same direction. Even a single reversed sequence can cause many problems for sequence alignment. To avoid these issues, the Adjust_Direction.py
module can be used prior to alignment to ensure all sequences are written in the same direction.
The purpose of the Adjust_Direction.py
module is to ensure sequences are all written in the same direction before performing alignments. The input is an unaligned fasta file and the output is an unaligned fasta file with all sequences in the 'correct' orientation. This module relies on the --adjustdirection feature of MAFFT to identify sequences in the incorrect direction and subsequently reverse them. If the optional --accurate
flag is included, it will use the --adjustdirectionaccurately option in MAFFT, which is slower but more accurate. The number of threads can be specified using the --threads
flag (the more, the better!). The output from mafft is an interleaved fasta with sequences in all lowercase, and sequences that have been reversed are flagged with an _R_
at the beginning of the record ID. This module takes that file and converts it to a cleaner format. Sequences are written in uppercase, are ungapped (thereby stripping the alignment), and the _R_
is removed from sequence records that were reversed.
Adjust_Direction.py
is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
python -i <input directory> -o <output directory>
Required: The full path to a directory which contains the input fasta files.
Required: The full path to an existing directory to write output files.
Optional: Specify number of threads to use for MAFFT. Default = 1.
Optional: Use the --adjustdirectionaccurately MAFFT option, rather than --adjustdirection.
python Adjust_Direction.py -i /bin/Adjust/Input -o /bin/Adjust/Output
Above command will adjust all unaligned fasta files in directory
/bin/Adjust/Input
using --adjustdirection in MAFFT. Outputs are written to/bin/Adjust/Output
.
python Adjust_Direction.py -i /bin/Adjust/Input -o /bin/Adjust/Output --threads 6
Above command will adjust all unaligned fasta files in directory
/bin/Adjust/Input
using --adjustdirection in MAFFT with six threads. Outputs are written to/bin/Adjust/Output
.
python Adjust_Direction.py -i /bin/Adjust/Input -o /bin/Adjust/Output --accurate --threads 8
Above command will adjust all unaligned fasta files in directory
/bin/Adjust/Input
using --adjustdirectionaccurately in MAFFT with eight threads. Outputs are written to/bin/Adjust/Output
.
Several outputs are created in the output directory specified:
-
Directory
Adjusted-Fasta-Files
- For each input fasta file, this directory will contain an unaligned fasta file with correctly oriented sequences. They will be labeled asNAME_Adjusted.fasta
. These files should be used for downstream alignment. -
Directory
Log-Files
- For each input fasta file, this directory contains a file labeled asNAME_Adjusted_Name_Log.txt
. This file contains the full names of the sequences that were reversed in this particular fasta file, if any were adjusted. Here is an example of the contents of one of these files:
Sequences_Flipped
_R_AM055673.1 Rhampholeon acuminatus Voucher_MHNG_2645.002 DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher mhng 2645.002
_R_AM055667.1 Rhampholeon beraduccii Voucher_MHNG_2655.019 DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher mhng 2655.019
_R_AM055645.1 Rhampholeon boulengeri Voucher_ZFMK_55104 DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher zfmk 55104
_R_AM055669.1 Rhampholeon moyeri Voucher_MTSN_KIHANGA DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher mtsn kihanga
-
File
Sequences_Adjusted.txt
- This summary file contains three columns:- Locus - The name of the input fasta file.
- Seqs_Correct_Direction - The number of sequences in the 'correct' direction.
- Seqs_Direction_Adjusted - The number of sequences that were found to be in the 'incorrect' direction and were subsequently reversed.
Here is an example of the contents of Sequences_Adjusted.txt
:
Locus Seqs_Correct_Direction Seqs_Direction_Adjusted
12S_extracted_contfiltered_oneseq 529 1
16S_extracted_contfiltered_oneseq 566 14
ADNP_extracted_oneseq 145 0
AHR_extracted_oneseq 137 0
AKAP9_extracted_oneseq 184 0
BACH1_extracted_oneseq 187 0
BACH2_extracted_oneseq 31 0
BDNF_extracted_oneseq 394 0
Sequences for protein-coding loci are often uploaded to GenBank without a consistent reading frame. This means sequences can start in forward reading frame 1, 2, or 3, or even in a reverse frame. Most translation aligners require that 1) all sequences begin in the same reading frame, and 2) all sequences have a complete final codon (e.g. the sequence length is divisible by three). Despite my best efforts, I could not find a tool that automates either of these tasks, so I created a new one called Coding_Translation_Tests.py
. This module can be used to adjust all sequences to ensure they start in the same reading frame, and to complete the final codon position. Although this is perhaps most relevant for performing translation alignment, the reading frame adjustment is also useful for other tasks that require sequences to be in the same reading frame. For example, this module can be used to prepare sequences for dN/dS analyses, or to prepare unpublished sequences for NCBI submission to BankIt.
The Coding_Translation_Tests.py
module will adjust reading frames for sequences present in an unaligned fasta file. Sequences are translated in all forward frames to check for the presence of stop codons. The translation table should be specified with the --table
flag. If stop codons are detected in all frames, the sequence will be checked to see if there is only one stop codon. If a single stop codon is found, it must be located in the final two codon positions of the sequence to be allowed. If the --rc
flag is included the translation will also be performed for the reverse complement. However, if the sequences are all correctly oriented (e.g., the Adjust_Direction.py
module was used previously) then this is not recommended. If necessary, sequences with be padded with N
's to complete the final codon for the reading frame found. The final sequence is written such that the first base represents the first codon position (based on the detected reading frame) and the sequence is divisible by three (to ensure a complete final codon position). Sequences that have more than one stop codon detected for all the reading frames investigated are marked as failing the translation test. For each input fasta file, up to three output files are created. These files contain 1) only sequences passing translation (correctly adjusted), 2) only sequences failing translation (no adjustments made), and 3) both sequences passing translation (correctly adjusted) and sequences failing translation (no adjustments made). More details are explained in the Outputs section below.
If your dataset contains a mix of loci that need to be translated differently (e.g., coding mtDNA vs. coding nucDNA), the --onlyinclude
flag can be used. This requires the full path to a text file containing the names of the fasta files (one file name per line) to be processed for a particular run. This allows you to run this module with particular settings for a subset of the files in the input directory.
Coding_Translation_Tests.py
is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
python Coding_Translation_Tests.py -i <input directory> -o <output directory> --table <translation table choice>
Required: The full path to a directory which contains the input fasta files.
Required: The full path to an existing directory to write output files.
Required: Specifies translation table to use for all files. Choices = standard, vertmtdna, invertmtdna, yeastmtdna, plastid, or any integer 1-31.
Optional: In addition to all forward reading frames, examine all reverse reading frames during translation tests.
Optional: The full path to a text file containing the names of the fasta files to process for this run.
Optional: Show less output while running (useful when filtering many loci).
python Coding_Translation_Tests.py -i bin/Coding-Adjust/Input -o bin/Coding-Adjust/Output --table standard
Above command will perform translation tests for each unaligned fasta file in the directory
bin/Coding-Adjust/Input/
using the standard code. It will examine only forward reading frames. Outputs are written tobin/Coding-Adjust/Output
.
python Coding_Translation_Tests.py -i bin/Coding-Adjust/Input -o bin/Coding-Adjust/Output --table standard --rc --onlyinclude bin/Coding-Adjust/mtDNA_File_List.txt
Above command will perform translation tests using the vert mitochondrial code for all files included in the
mtDNA_File_List.txt
. It will examine all forward and reverse reading frames. Outputs are written tobin/Coding-Adjust/Output
.
Several outputs are created in the output directory specified:
-
Directory
Translation-All-Seqs
- This directory will be populated with files labeled asNAME_Adjusted_All.fasta
. This file will contain all of the starting sequences. It will include all sequences that passed translation (correctly adjusted), and all sequences that failed translation (no adjustments made). -
Directory
Translation-Failed-Seqs
- This directory will be populated with files labeled asNAME_Adjusted_Failed.fasta
. This file will contain only sequences that failed translation (no adjustments made). If no sequences fail translation for a given locus (e.g., all the sequences for that locus pass translation), there will be no corresponding file written to this directory. -
Directory
Translation-Passed-Seqs
- This directory will be populated with files labeled asNAME_Adjusted_All.fasta
. This file will contain only sequences that passed translation (correctly adjusted). If no sequences pass translation for a given locus (e.g., all the sequences for that locus fail translation), there will be no corresponding file written to this directory. -
File
Log_Sequences_Filtered.txt
: This summary file contains three columns:- Locus - The name of the input fasta file.
- Seqs_Passed - The number of sequences that passed translation.
- Seqs_Failed - The number of sequences that failed translation.
Here is an example of the contents of Log_Sequences_Filtered.txt
:
Locus Seqs_Passed Seqs_Failed
CO1 479 5
CYTB 507 6
ND1 192 10
ND2 1005 0
ND4 548 3
python
Required: The full path to a directory which contains the input fasta files.
Required: The full path to an existing directory to write output files.
Required: Specify whether alignment is by mafft, macse, muscle, or clustalo. If macse must provide flags
--mpath
and--table
. Selectingall
will run mafft, muscle, and clustalo sequentially. Choices = mafft, macse, muscle, clustalo, all.
Optional: Specifies mafft, clustalo, or macse to use more thorough search settings.
Optional: Specify number of threads to use for mafft and/or clustalo.
Required for MACSE: Full path to a
macse.jar
file.
Required for MACSE: Specifies translation table for macse. Choices = standard, vertmtdna, invertmtdna, yeastmtdna, plastid, or any integer 1-6, 9-16, 21-23.
Optional for MACSE: An integer to assign additional memory to macse (in GB). Default = 1.
Optional for MACSE: Specifies macse to use two fasta files for dual file alignment. See documentation for details.
python
Above command will .
Several outputs are created in the output directory specified.
Last updated: September, 2019
For SuperCRUNCH v1.2