7: Multiple Sequence Alignment

Multiple Sequence Alignment

Overview
Ensuring Sequences are in Correct Orientation
- Adjust_Direction.py
Identify and Adjust Reading Frames
- Coding_Translation_Tests.py
Perform Multiple Sequence Alignment
- Align.py

Overview

SuperCRUNCH offers a variety of options for multiple sequence alignment, and also includes two important pre-alignment steps. One of these steps includes ensuring that all sequences are written in the same direction. Even a single reversed sequence can wreak havoc during the alignment step. Details on this step can be found in the Ensuring Sequences are in Correct Orientation section. An additional pre-alignment step can be performed for coding loci. This step will identify the correct reading frame of sequences, adjust them to the first codon position, and ensure completion of the final codon. This is particularly useful for performing translation alignments, but can also be used for other purposes. The Identify and Adjust Reading Frames section contains more information about this topic. Multiple sequence alignment in SuperCRUNCH can be performed using Clustal-O, MAFFT, Muscle, and MACSE. These alignment methods should generally work well for a reasonable size dataset. More information on this topic can be found in the Perform Multiple Sequence Alignment section.

If ultra-large alignments (>5,000 sequences) need to be made then other alignment methods such as SATé-II (Liu et al. 2012), PASTA (Mirarab et al. 2015), and UPP (Nguyen et al. 2015) should be used instead. The UPP method can also be used if a large alignment is to be made from a set of full length sequences and shorter sequence fragments - most of the methods in SuperCRUNCH will not perform well under this condition.

Ensuring Sequences are in Correct Orientation

To build a multiple sequence alignment, all sequences should be written in the same direction. Given the nature of GenBank, it should never be assumed all sequences were uploaded in the same direction. Even a single reversed sequence can cause many problems for sequence alignment. To avoid these issues, the Adjust_Direction.py module can be used prior to alignment to ensure all sequences are written in the same direction.

Back to top

Adjust_Direction.py

The purpose of the Adjust_Direction.py module is to ensure sequences are all written in the same direction before performing alignments. The input is an unaligned fasta file and the output is an unaligned fasta file with all sequences in the 'correct' orientation. This module relies on the --adjustdirection feature of MAFFT to identify sequences in the incorrect direction and subsequently reverse them. If the optional --accurate flag is included, it will use the --adjustdirectionaccurately option in MAFFT, which is slower but more accurate. The number of threads can be specified using the --threads flag (the more, the better!). The output from mafft is an interleaved fasta with sequences in all lowercase, and sequences that have been reversed are flagged with an _R_ at the beginning of the record ID. This module takes that file and converts it to a cleaner format. Sequences are written in uppercase, are ungapped (thereby stripping the alignment), and the _R_ is removed from sequence records that were reversed.

Adjust_Direction.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

Basic Usage:

python -i <input directory> -o <output directory>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input fasta files.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`--threads <integer>`

Optional: Specify number of threads to use for MAFFT. Default = 1.

`--accurate`

Optional: Use the --adjustdirectionaccurately MAFFT option, rather than --adjustdirection.

Example Usage:

python Adjust_Direction.py -i /bin/Adjust/Input -o /bin/Adjust/Output

Above command will adjust all unaligned fasta files in directory /bin/Adjust/Input using --adjustdirection in MAFFT. Outputs are written to /bin/Adjust/Output.

python Adjust_Direction.py -i /bin/Adjust/Input -o /bin/Adjust/Output --threads 6

Above command will adjust all unaligned fasta files in directory /bin/Adjust/Input using --adjustdirection in MAFFT with six threads. Outputs are written to /bin/Adjust/Output.

python Adjust_Direction.py -i /bin/Adjust/Input -o /bin/Adjust/Output --accurate --threads 8

Above command will adjust all unaligned fasta files in directory /bin/Adjust/Input using --adjustdirectionaccurately in MAFFT with eight threads. Outputs are written to /bin/Adjust/Output.

Outputs:

Several outputs are created in the output directory specified:

Directory Adjusted-Fasta-Files - For each input fasta file, this directory will contain an unaligned fasta file with correctly oriented sequences. They will be labeled as NAME_Adjusted.fasta. These files should be used for downstream alignment.
Directory Log-Files - For each input fasta file, this directory contains a file labeled as NAME_Adjusted_Name_Log.txt. This file contains the full names of the sequences that were reversed in this particular fasta file, if any were adjusted. Here is an example of the contents of one of these files:

Sequences_Flipped
_R_AM055673.1 Rhampholeon acuminatus Voucher_MHNG_2645.002 DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher mhng 2645.002
_R_AM055667.1 Rhampholeon beraduccii Voucher_MHNG_2655.019 DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher mhng 2655.019
_R_AM055645.1 Rhampholeon boulengeri Voucher_ZFMK_55104 DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher zfmk 55104
_R_AM055669.1 Rhampholeon moyeri Voucher_MTSN_KIHANGA DESCRIPTION mitochondrial partial 16s rrna gene specimen voucher mtsn kihanga

File Sequences_Adjusted.txt - This summary file contains three columns:
- Locus - The name of the input fasta file.
- Seqs_Correct_Direction - The number of sequences in the 'correct' direction.
- Seqs_Direction_Adjusted - The number of sequences that were found to be in the 'incorrect' direction and were subsequently reversed.

Here is an example of the contents of Sequences_Adjusted.txt:

Locus	Seqs_Correct_Direction	Seqs_Direction_Adjusted
12S_extracted_contfiltered_oneseq	529	1
16S_extracted_contfiltered_oneseq	566	14
ADNP_extracted_oneseq	145	0
AHR_extracted_oneseq	137	0
AKAP9_extracted_oneseq	184	0
BACH1_extracted_oneseq	187	0
BACH2_extracted_oneseq	31	0
BDNF_extracted_oneseq	394	0

Back to top

Identify and Adjust Reading Frames

Sequences for protein-coding loci are often uploaded to GenBank without a consistent reading frame. This means sequences can start in forward reading frame 1, 2, or 3, or even in a reverse frame. Most translation aligners require that 1) all sequences begin in the same reading frame, and 2) all sequences have a complete final codon (e.g. the sequence length is divisible by three). Despite my best efforts, I could not find a tool that automates either of these tasks, so I created a new one called Coding_Translation_Tests.py . This module can be used to adjust all sequences to ensure they start in the same reading frame, and to complete the final codon position. Although this is perhaps most relevant for performing translation alignment, the reading frame adjustment is also useful for other tasks that require sequences to be in the same reading frame. For example, this module can be used to prepare sequences for dN/dS analyses, or to prepare unpublished sequences for NCBI submission to BankIt.

Back to top

Coding_Translation_Tests.py

The Coding_Translation_Tests.py module will attempt to adjust reading frames for all coding sequences present in an unaligned fasta file. Sequences are translated in all forward frames to check for the presence of stop codons. The translation table can be specified with the --table flag, and includes all options available on NCBI. If stop codons are detected in all frames, the sequence will next be checked to see if there is only one stop codon. If a single stop codon is found, it must be located in the final two codon positions of the sequence to be allowed. If the --rc flag is included then all reverse frames will be searched too. However, if all the sequences are correctly oriented (e.g., the Adjust_Direction.py module was used previously) then this is not recommended. If necessary, sequences will be padded with N's to complete the final codon, based on the correct reading frame found. The final sequence is written such that the first base represents the first codon position (based on the inferred reading frame) and the sequence is divisible by three (guaranteeing a complete final codon). Sequences that have more than one stop codon detected for all the reading frames investigated are marked as failing the translation test. For each input fasta file, up to three output files are created. These files contain:

only sequences passing translation (correctly adjusted)
only sequences failing translation (no adjustments made)
a combination of sequences passing translation (correctly adjusted) and sequences failing translation (no adjustments made).

More details concerning these output files are provided in the outputs section below.

If your dataset contains a mix of loci that need to be translated differently (e.g., coding mtDNA vs. coding nucDNA), the --onlyinclude flag can be used. This requires the full path to a text file containing the names of the fasta files (one file name per line) to be processed for a particular run. This allows you to run this module with particular settings for a subset of the files in the input directory. In this case, a separate output directory (-o ) should be specified for each distinct run.

Coding_Translation_Tests.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores.

Basic Usage:

python Coding_Translation_Tests.py -i <input directory> -o <output directory> --table <translation table choice>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input fasta files.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`--table <choice>`

Required: Specifies translation table to use for all files. Choices = standard, vertmtdna, invertmtdna, yeastmtdna, plastid, or any integer 1-31.

`--rc`

Optional: In addition to all forward reading frames, examine all reverse reading frames during translation tests.

`--onlyinclude <path-to-file>`

Optional: The full path to a text file containing the names of the fasta files to process for this run.

`--quiet`

Optional: Show less output while running (useful when filtering many loci).

Example Usage:

python Coding_Translation_Tests.py -i bin/Coding-Adjust/Input -o bin/Coding-Adjust/Output --table standard

Above command will perform translation tests for each unaligned fasta file in the directory bin/Coding-Adjust/Input/ using the standard code. It will examine only forward reading frames. Outputs are written to bin/Coding-Adjust/Output.

python Coding_Translation_Tests.py -i bin/Coding-Adjust/Input -o bin/Coding-Adjust/Output --table standard --rc --onlyinclude bin/Coding-Adjust/mtDNA_File_List.txt

Above command will perform translation tests using the vert mitochondrial code for all files included in the mtDNA_File_List.txt. It will examine all forward and reverse reading frames. Outputs are written to bin/Coding-Adjust/Output.

Outputs:

Several outputs are created in the output directory specified:

Directory Translation-All-Seqs - This directory will be populated with files labeled as NAME_Adjusted_All.fasta. This is the only output file that is guaranteed to contain all of the starting sequences. It will include all sequences that passed translation (correctly adjusted), and all sequences that failed translation (no adjustments made).
Directory Translation-Failed-Seqs - This directory will be populated with files labeled as NAME_Adjusted_Failed.fasta. This file will contain only sequences that failed translation (no adjustments made). If no sequences fail translation for a given locus (e.g., all the sequences for that locus pass translation), there will be no corresponding file written to this directory.
Directory Translation-Passed-Seqs - This directory will be populated with files labeled as NAME_Adjusted_Passed.fasta. This file will contain only sequences that passed translation (correctly adjusted). If no sequences pass translation for a given locus (e.g., all the sequences for that locus fail translation), there will be no corresponding file written to this directory.
File Log_Sequences_Filtered.txt - This summary file contains three columns:
- Locus - The name of the input fasta file.
- Seqs_Passed - The number of sequences that passed translation.
- Seqs_Failed - The number of sequences that failed translation.

Here is an example of the contents of Log_Sequences_Filtered.txt:

Locus	Seqs_Passed	Seqs_Failed
CO1	479	5
CYTB	507	6
ND1	192	10
ND2	1005	0
ND4	548	3

Back to top

Perform Multiple Sequence Alignment

Multiple sequence alignment in SuperCRUNCH can be performed using Clustal-O, MAFFT, Muscle, and MACSE. These alignment methods should generally work well for a reasonable size dataset. It is possible to run a single alignment method for all loci, or try out all the alignment methods. The Align.py makes it easy to run each of the alignment methods for any number of unaligned fasta files.

If ultra-large alignments (>5,000 sequences) need to be made then other alignment methods such as SATé-II (Liu et al. 2012), PASTA (Mirarab et al. 2015), and UPP (Nguyen et al. 2015) should be used instead. The UPP method can also be used if a large alignment is to be made from a set of full length sequences and shorter sequence fragments - most of the methods in SuperCRUNCH will not perform well under this condition.

Back to top

Align.py

The Align.py module can be used to perform multiple sequence alignment for a directory of unaligned fasta files using Clustal-O, MAFFT, Muscle, and MACSE. Align.py is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta or NAME.fa. The NAME portion should not contain any periods or spaces, but can contain underscores (but see special case for paired MACSE alignments below).

The -a flag determines which alignment method(s) are used:

-a clustalo - Runs Clustal-O. An executable called clustalo must be installed in path. See the Clustal-O Usage section for implementation details.
-a mafft - Runs MAFFT. An executable called mafft must be installed in path. See the MAFFT Usage section for implementation details.
-a muscle - Runs Muscle. An executable called muscle must be installed in path. See the Muscle Usage section for implementation details.
-a all - Runs Clustal-O, MAFFT, and Muscle sequentially. Executables called clustalo, mafft, and muscle must be installed in path.
-a macse - Runs MACSE. The full path to a MACSE jar file must be specified with the --mpath flag, and the translation table must be specified using the --table flag. The optional --mem, and --pass_fail flags can also be used with the MACSE alignment method. See the MACSE Usage section for implementation details.

Basic Usage:

python -i <input directory> -o <output directory> -a <alignment method>

Argument Explanations:

`-i <path-to-directory>` or `--indir <path-to-directory>`

Required: The full path to a directory which contains the input fasta files.

`-o <path-to-directory>` or `--outdir <path-to-directory>`

Required: The full path to an existing directory to write output files.

`-a <choice>` or `--aln <choice>`

Required: Specify whether alignment is by mafft, macse, muscle, or clustalo. If macse must provide flags --mpath and --table. Selecting all will run mafft, muscle, and clustalo sequentially. Choices = mafft, macse, muscle, clustalo, all.

`--accurate`

Optional: Specifies mafft, clustalo, or macse to use more thorough search settings.

`--threads <integer>`

Optional: Specify number of threads to use for mafft and/or clustalo.

`--mpath <path-to-file>`

Required for MACSE: Full path to a macse.jar file.

`--table <choice>`

Required for MACSE: Specifies translation table for macse. Choices = standard, vertmtdna, invertmtdna, yeastmtdna, plastid, or any integer 1-6, 9-16, 21-23.

`--mem <integer>`

Optional for MACSE: An integer to assign additional memory to macse (in GB). Default = 1.

`--pass_fail`

Optional for MACSE: Specifies macse to use two fasta files for dual file alignment. See documentation for details.

Example Usage:

The usage of each alignment method and relevant details are provided in the Alignment Method Implementations section below, rather than here.

Outputs:

Outputs are created in the output directory specified, and depend on the alignment method selected. The outputs can include:

Directory /Alignments-MAFFT - A directory containing a corresponding MAFFT alignment for each unaligned input file, labeled as NAME_MAFFT_Aligned.fasta.
Directory /Alignments-MUSCLE - A directory containing a corresponding Muscle alignment for each unaligned input file, labeled as NAME_MUSCLE_Aligned.fasta.
Directory /Alignments-CLUSTALO - A directory containing a corresponding Clustal-O alignment for each unaligned input file, labeled as NAME_CLUSTALO_Aligned.fasta.
Directory /Alignments-MACSE - A directory containing the following subdirectories:
- Directory /Cleaned-Alignments - Reformatted MACSE alignments that should be used for any downstream steps, labeled as NAME_MACSE_Aligned.fasta.
- Directory /Additional-Outputs - Contains original MACSE output files, including NAME_AA.fasta and NAME_NT.fasta. These should not be used for downstream steps.

Alignment Methods Implementation

Unless changed, the aligment methods will either run under the default settings (Muscle) or the auto select settings (MAFFT, Clustal-O). The optional --threads and --accurate flags can be used for MAFFT and Clustal-O, but not for Muscle. These alignment method implementations, along with all the MACSE options, are explained in detail below.

Muscle Usage

The simplest execution of Muscle results from the following usage:

python Align.py -i bin/Align/Input -o bin/Align/Output -a muscle

When the -a muscle option is used as above, MAFFT is executed using the following command-line equivalent:

muscle -in [input fasta] -out [output alignment]

And that's literally it for Muscle. The --threads and --accurate flags have no effect on Muscle.

MAFFT Usage

The simplest execution of MAFFT results from the following usage:

python Align.py -i bin/Align/Input -o bin/Align/Output -a mafft

When the -a mafft option is used as above, MAFFT is executed using the following command-line equivalent:

mafft --thread 1 --auto [input fasta] > [output alignment]

This uses the auto select setting of MAFFT.

The `--threads` flag can be used to increase the number of threads for MAFFT.

For example:

python Align.py -i bin/Align/Input -o bin/Align/Output -a mafft --threads 8

The above Align.py command would result in MAFFT being executed using the following command-line equivalent:

mafft --thread 8 --auto [input fasta] > [output alignment]

Using the `--accurate` flag changes the implementation of MAFFT to the FFT-NS-i algorithm.

For example:

python Align.py -i bin/Align/Input -o bin/Align/Output -a mafft --threads 8 --accurate

The above Align.py command would result in MAFFT being executed using the following command-line equivalent:

mafft --thread 8 --retree 2 --maxiterate 1000 [input fasta] > [output alignment]

And that's it for MAFFT options (so far!).

Clustal-O Usage

The simplest execution of MAFFT results from the following usage:

python Align.py -i bin/Align/Input -o bin/Align/Output -a clustalo

When the -a clustalo option is used as above, MAFFT is executed using the following command-line equivalent:

clustalo -i [input fasta] -o [output alignment] --auto -v --threads=1 --output-order=tree-order --force

This uses the auto select setting of Clustal-O.

The `--threads` flag can be used to increase the number of threads for Clustal-O.

For example:

python Align.py -i bin/Align/Input -o bin/Align/Output -a clustalo --threads 8

The above Align.py command would result in Clustal-O being executed using the following command-line equivalent:

clustalo -i [input fasta] -o [output alignment] --auto -v --threads=8 --output-order=tree-order --force

Using the `--accurate` flag changes the implementation of Clustal-O to allow the guide-tree and HMM to each undergo 5 iterations. It also changes the cluster-size.

For example:

python Align.py -i bin/Align/Input -o bin/Align/Output -a clustalo --threads 8 --accurate

The above Align.py command would result in Clustal-O being executed using the following command-line equivalent:

clustalo -i [input fasta] -o [output alignment] --full --full-iter --iter=5 -v --threads=8 --cluster-size=500 --output-order=tree-order --force

And that's it for Clustal-O options (so far!).

MACSE Usage

MACSE is a translation aligner and can be using with any set of coding sequences. One of the major benefits of using MACSE is that it aligns coding nucleotide sequences with respect to their amino-acid translation while allowing those sequences to contain multiple frameshifts and/or stop codons. There are two main ways that MACSE can be run using the Align.py module:

Single alignment - Takes a single set of sequences and performs translation alignment. This requires a single input fasta file.
Dual alignment - Takes a set of high quality sequences and a set of lower quality sequences (containing frameshifts and/or stop codons) and performs a translation alignment using different parameter settings. This requires two input fasta files, one containing the high quality sequences and the other containing the lower quality sequences. Although it is possible to include a mix of higher and lower quality sequences and run the single alignment option, it will apply the same set of parameters to all sequences. This can result in lower quality alignments if there are many sequences with errors. The dual alignment performs much better in this circumstance.

For both options, all sequences must begin in the same reading frame and must have complete final codons. For the dual alignment option, this requires identifying higher and lower quality sequences and putting them in two different fasta files. The Coding_Translation_Tests.py module automates all of these tasks, preparing all of the necessary input files for running MACSE.

When running MACSE (-a macse), the --mpath and --table flags are required. The --mpath flag is the full path the MACSE jar file, whereas the --table flag specifies the translation table. Please note that not all tables are available in MACSE - it is a subset of those available on NCBI (see the flag details in the argument explanations section above). In addition to these required flags, the optional --mem flag can be used to assign additional memory, and the optional --accurate flag can be used to invoke different parameter settings (see below).

The single alignment option

The single alignment option is the default when selecting MACSE for alignment. For example:

python Align.py -i bin/Align/Input -o bin/Align/Output -a macse --mpath bin/programs/macse_v2.03.jar --table standard

The above Align.py command would result in MACSE being executed using the following command-line equivalent:

java -jar -Xmx1g bin/programs/macse_v2.03.jar -prog alignSequences -gc_def 1 -seq [input fasta]

This would use the standard code for translation, run alignments for all fasta files found in the input directory, and write the outputs to the output directory. Note the default memory assigned is 1GB, this can be changed using the --mem flag. For example, including --mem 5 would change the above command to use 5GB - java -jar -Xmx5g .

The --accurate flag can be included to change the parameter settings. For example:

python Align.py -i bin/Align/Input -o bin/Align/Output -a macse --mpath bin/programs/macse_v2.03.jar --table standard --accurate

The above Align.py command would result in MACSE being executed using the following command-line equivalent:

java -jar -Xmx1g bin/programs/macse_v2.03.jar -prog alignSequences -gc_def 1 -seq [input fasta] -local_realign_init 0.9 -local_realign_dec 0.9

These settings make MACSE v2 more similar to v1 - increasing potential accuracy at the cost of computational time.

The dual alignment option

The dual alignment option is invoked by using the --pass_fail flag. In order to run dual alignment, there must two different files for each locus. They must be named NAME_Passed.fasta and NAME_Failed.fasta, where the NAME component cannot contain any periods or spaces, but can contain underscores. The NAME component must be identical for the passed and failed files. This is the exact file naming scheme that results from the Coding_Translation_Tests.py module. To run the dual alignment, you can copy/paste the file pairs from the output directories created by Coding_Translation_Tests.py. Here is an example of an input directory and the file pairs:

Align
│
├── macse-inputs
│	├── CO1_extracted_oneseq_Adjusted_Failed.fasta
│	├── CO1_extracted_oneseq_Adjusted_Passed.fasta
│	├── CYTB_extracted_oneseq_Adjusted_Failed.fasta
│	└── CYTB_extracted_oneseq_Adjusted_Passed.fasta
├── macse-outputs
│

The file pairs for dual alignment are present for CO1 and for CYTB. The following command can be used to invoke the dual alignment method:

python Align.py -i bin/Align/macse-inputs -o bin/Align/macse-outputs -a macse --mpath bin/programs/macse_v2.03.jar --table vertmtdna --pass_fail

Notice that the vertebrate mtDNA code is used. The above Align.py command would result in MACSE being executed using the following command-line equivalent:

java -jar -Xmx1g bin/programs/macse_v2.03.jar -prog alignSequences -gc_def 2 -seq [NAME_Passed.fasta] -seq_lr [NAME_Failed.fasta]

Dual alignment can also use the --mem and --accurate flags, such as the following:

python Align.py -i bin/Align/macse-inputs -o bin/Align/macse-outputs -a macse --mpath bin/programs/macse_v2.03.jar --table vertmtdna --pass_fail --mem 10 --accurate

The above Align.py command would result in MACSE being executed using the following command-line equivalent:

java -jar -Xmx10g bin/programs/macse_v2.03.jar -prog alignSequences -gc_def 2 -seq [NAME_Passed.fasta] -seq_lr [NAME_Failed.fasta] -local_realign_init 0.9 -local_realign_dec 0.9

Automatically detect single or dual alignment

The Coding_Translation_Tests.py module produces NAME_Passed.fasta and NAME_Failed.fasta files in separate directories, and will only create a NAME_Failed.fasta file for a locus if there are sequences that fail translation. Thus, not all loci will produce a corresponding NAME_Passed.fasta and NAME_Failed.fasta file, many may only produce a NAME_Passed.fastafile (meaning no sequences failed translation). The latter results in a NAME_Passed.fasta file in the Translation-Passed-Seqs directory, but no NAME_Failed.fasta file in the Translation-Failed-Seqs directory. To identify loci requiring dual alignment, one could look in each directory to see if both files were written. With a handful of loci, it is relatively easy to do this and determine which loci would require dual alignment (the locus has a corresponding NAME_Passed.fasta AND NAME_Failed.fasta file), and which should be run using single alignment (the locus ONLY has a corresponding NAME_Passed.fasta file). However, with dozens or hundreds of loci, this task becomes burdensome or impossible. To overcome this issue, the --pass_fail option can automatically determine whether a single alignment or dual alignment should be carried out for a given locus. To use this feature, you can simply copy and paste all of the fasta files from the Translation-Failed-Seqs directory and the Translation-Passed-Seqs directory to a new input directory for Align.py.

To illustrate how this works, let's first look at an example output from the Coding_Translation_Tests.py module:

Coding-tests
│
├── input
│	├── ADNP_extracted.fasta
│	├── BDNF_extracted.fasta
│	├── EXPH5_extracted.fasta
│	├── FSTL5_extracted.fasta
│	└── KIF24_extracted.fasta
├── output
│	│ 
│	├── Log_Sequences_Filtered.txt
│	│ 
│	├── Translation-All-Seqs
│	│ 
│	├── Translation-Failed-Seqs
│	│	├── BDNF_extracted_oneseq_Adjusted_Failed.fasta
│	│	├── EXPH5_extracted_oneseq_Adjusted_Failed.fasta
│	│	└── FSTL5_extracted_oneseq_Adjusted_Failed.fasta
│	│ 
│	└─ Translation-Passed-Seqs
│		├── ADNP_extracted_oneseq_Adjusted_Passed.fasta
│		├── BDNF_extracted_oneseq_Adjusted_Passed.fasta
│		├── EXPH5_extracted_oneseq_Adjusted_Passed.fasta
│		├── FSTL5_extracted_oneseq_Adjusted_Passed.fasta
│		└── KIF24_extracted_oneseq_Adjusted_Passed.fasta

In the above example, there were five loci used to perform translation tests. Of those five, two of the loci (ADNP, KIF24) did not have any sequences that failed translation. These two loci have a NAME_Passed.fasta file in the Translation-Passed-Seqs directory, but do not have a NAME_Failed.fasta file in the Translation-Failed-Seqs directory. Three of the loci (BDNF, EXPH5, FSTL5) had some sequences that failed translation, and therefore each of these loci has a NAME_Passed.fasta and a NAME_Failed.fasta file in the relevant directories.

You can copy all of the files from the Translation-Passed-Seqs directory and the Translation-Failed-Seqs directory into a new input directory for Align.py:

Align
├── macse-input
│	├── ADNP_extracted_oneseq_Adjusted_Passed.fasta   <- Pass only!
│	├── BDNF_extracted_oneseq_Adjusted_Failed.fasta	  <- Pass/Fail combo
│	├── BDNF_extracted_oneseq_Adjusted_Passed.fasta   <- Pass/Fail combo
│	├── EXPH5_extracted_oneseq_Adjusted_Failed.fasta  <- Pass/Fail combo
│	├── EXPH5_extracted_oneseq_Adjusted_Passed.fasta  <- Pass/Fail combo
│	├── FSTL5_extracted_oneseq_Adjusted_Failed.fasta  <- Pass/Fail combo
│	├── FSTL5_extracted_oneseq_Adjusted_Passed.fasta  <- Pass/Fail combo
│	└── KIF24_extracted_oneseq_Adjusted_Passed.fasta  <- Pass only!

The NAME_Passed.fasta and NAME_Failed.fasta file naming scheme must be used for the automatic detection feature to work. Then the following command can be used:

python Align.py -i bin/Align/macse-input -o bin/Align/macse-output -a macse --mpath bin/programs/macse_v2.03.jar --table standard --pass_fail

The program will automatically identify which loci are represented by only NAME_Passed.fasta files, and which loci are represented by both NAME_Passed.fasta and NAME_Failed.fasta files. For loci that are found with pass only files, the following will be executed:

java -jar -Xmx1g bin/programs/macse_v2.03.jar -prog alignSequences -gc_def 1 -seq [NAME_Passed.fasta]

For loci that are found with pass and fail files, the following will be executed:

java -jar -Xmx10g bin/programs/macse_v2.03.jar -prog alignSequences -gc_def 1 -seq [NAME_Passed.fasta] -seq_lr [NAME_Failed.fasta]

This feature will remove any need to manually identify file pairs. You can simply copy/paste all the outputs in the Translation-Failed-Seqs directory and the Translation-Passed-Seqs directory to a new directory, and let the auto-detect feature figure it out for you. The only requirement is adherence to the NAME_Passed.fasta and NAME_Failed.fasta file naming scheme. The optional --mem and --accurate flags will work the same as described above for the single and dual alignments, and can be used with this auto-detect feature.

The combination of the Coding_Translation_Tests.py module and the MACSE implementation of Align.py should make performing translation alignments much, much easier!

Back to top

Last updated: September, 2019

For SuperCRUNCH v1.2

7: Multiple Sequence Alignment

Multiple Sequence Alignment

Overview

Ensuring Sequences are in Correct Orientation

Adjust_Direction.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

--threads <integer>

--accurate

Example Usage:

Outputs:

Identify and Adjust Reading Frames

Coding_Translation_Tests.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

--table <choice>

--rc

--onlyinclude <path-to-file>

--quiet

Example Usage:

Outputs:

Perform Multiple Sequence Alignment

Align.py

Basic Usage:

Argument Explanations:

-i <path-to-directory> or --indir <path-to-directory>

-o <path-to-directory> or --outdir <path-to-directory>

-a <choice> or --aln <choice>

--accurate

--threads <integer>

--mpath <path-to-file>

--table <choice>

--mem <integer>

--pass_fail

Example Usage:

Outputs:

Alignment Methods Implementation

Muscle Usage

The simplest execution of Muscle results from the following usage:

MAFFT Usage

The simplest execution of MAFFT results from the following usage:

The --threads flag can be used to increase the number of threads for MAFFT.

Using the --accurate flag changes the implementation of MAFFT to the FFT-NS-i algorithm.

Clustal-O Usage

The simplest execution of MAFFT results from the following usage:

The --threads flag can be used to increase the number of threads for Clustal-O.

Using the --accurate flag changes the implementation of Clustal-O to allow the guide-tree and HMM to each undergo 5 iterations. It also changes the cluster-size.

MACSE Usage

The single alignment option

The dual alignment option

Automatically detect single or dual alignment

Clone this wiki locally

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`--threads <integer>`

`--accurate`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`--table <choice>`

`--rc`

`--onlyinclude <path-to-file>`

`--quiet`

`-i <path-to-directory>` or `--indir <path-to-directory>`

`-o <path-to-directory>` or `--outdir <path-to-directory>`

`-a <choice>` or `--aln <choice>`

`--accurate`

`--threads <integer>`

`--mpath <path-to-file>`

`--table <choice>`

`--mem <integer>`

`--pass_fail`

The `--threads` flag can be used to increase the number of threads for MAFFT.

Using the `--accurate` flag changes the implementation of MAFFT to the FFT-NS-i algorithm.

The `--threads` flag can be used to increase the number of threads for Clustal-O.

Using the `--accurate` flag changes the implementation of Clustal-O to allow the guide-tree and HMM to each undergo 5 iterations. It also changes the cluster-size.