-
Notifications
You must be signed in to change notification settings - Fork 9
1: Analysis Overview
SuperCRUNCH is a python toolkit for extracting, filtering, and manipulating nucleotide data. It is modular in design, and there are many ways to perform analyses to meet specific research goals. This page provides an overview of the major topics involved in SuperCRUNCH analyses.
SuperCRUNCH can be used to:
- Construct de novo supermatrices from GenBank and/or local sequence data.
- Construct phylogeographic datasets from GenBank and/or local sequence data, with the ability to detect voucher codes.
- Parse a large fasta file of GenBank and/or local sequences into gene-specific fasta files, based on a list of genes and list of species names.
- Perform similarity filtering for all sequences of a gene to remove mis-identified or highly divergent sequences.
- Extract specific genes from a set of mitochondrial genomes.
- Use a set of reference sequences to trim all input sequences to match the reference region.
- Adjust reading frames for coding sequences, and identify problematic coding sequences.
- Adjust sequence directions to prepare for alignment.
- Automate multiple sequence alignment using Clustal-O, MAFFT, Muscle, and MACSE.
- Generate accession tables for supermatrices, for all genes and species included.
- Relabel GenBank sequences using species names, accession numbers, and/or voucher codes.
- Trim alignments using trimAl or a custom trimming routine.
- Convert fasta format to phylip and nexus.
- Concatenate any number of fasta or phylip alignment files. Outputs a concatenated alignment, data partitions, and counts of loci for all species included.
- And many other tasks!
SuperCRUNCH is a modular workflow that contains several major steps. These are depicted in the figure below:
In the list below, a summary of the main topics and modules is provided in the approximate order in which they would be used. Complete instructions for performing each step are provided in the individual wiki pages for each topic. These can be reached by clicking on the relevant link below or to the right on the wiki toolbar.
Remember, helpful information can always be displayed on the command line by running any module using the -h flag.
- Obtaining GenBank Sequence Data
- Using Local Sequence Data
- Resolving Duplicate Records
Remove_Duplicate_Accessions.py
- Decreasing GenBank File Size NEW
Remove_Long_Accessions.py
- Creating a Taxon List
- Getting a Taxon List Directly from Fasta Files
Fasta_Get_Taxa.py
- Identifying Taxonomy Conflicts
Taxa_Assessment.py
- Resolving Taxonomy Conflicts
Rename_Merge.py
- Obtaining Locus Search Terms
- How do locus searches work?
- NEW: Including a negative search term
- How do I know how well my search terms are performing?
- What's the best strategy for recovering mitochondrial genes?
- How do I search for UCE loci?
- Building Fasta Files for Loci Available
Parse_Loci.py
- Filtering 'Simple' Sequence Records
Cluster_Blast_Extract.py
- Filtering 'Complex' Sequence Records
Reference_Blast_Extract.py
- Identifying 'Contaminated' Sequences
Contamination_Filter.py
- Selecting Sequences
Filter_Seqs_and_Species.py
- Enforcing Minimum Sequence/Taxon Requirement
Fasta_Filter_by_Min_Seqs.py
- Creating a Table of Accession Numbers From All Loci
Make_Acc_Table.py
- Calculating the Possible Number of Supermatrix Combinations
Infer_Supermatrix_Combinations.py
- Ensuring Sequences are in Correct Orientation
Adjust_Direction.py
- Identify and Adjust Reading Frames
Coding_Translation_Tests.py
- Perform Multiple Sequence Alignment
Align.py
- Relabel Sequence Records
Fasta_Relabel_Seqs.py
- Trim Sequence Alignments
Trim_Alignments_Trimal.py
Trim_Alignments_Custom.py
- Calculate Sequence Length Heterogeneity Statistics
Seq_Length_Heterogeneity.py
- Convert to Nexus and Phylip Format
Fasta_Convert.py
- Concatenate Alignments
Concatenation.py
A typical SuperCRUNCH run includes executing many of these steps. However, this workflow is extremely flexible and can be tailored to achieve a variety of goals.
There are many ways to perform analyses using SuperCRUNCH, and you may find yourself using all the modules or perhaps just one. Regardless of your purpose for using SuperCRUNCH, there are a few helpful suggestions to keep in mind when running your analyses:
-
Organization. Consider how you will organize your starting input files, and how you will manage files generated across steps.
-
File Management. In general, SuperCRUNCH requires specifying an input directory of files and an output directory of files, with some exceptions. To organize your analysis, it is best to create a new directory for each step with input and output subdirectories. For new users, it might be best to copy/paste relevant output files from a previous step into the inputs folder for the next step. For advanced users, this may not be as efficient. It is best to number the folders so that the progression across steps is easy to follow. For example, one might use this folder structure for an analysis:
Analysis-Hyperoliids
│
├── 00-Starting-seqs/
│ ├── input
│ └── output
│
├── 01-Taxon-assess/
│ ├── input
│ └── output
│
├── 02-Rename-merge/
│ ├── input
│ └── output
│
├── 03-Starting-materials/
│ ├── Fasta file of GenBank/local sequences (with or without updated taxon names)
│ ├── Locus search terms file
│ └── Taxon names list
│
├── 04-Cluster-blast/
│ ├── input
│ └── output
│
├── 05-Filter-seqs/
│ ├── input
│ └── output
│
├── 06-Min-taxa/
│ ├── input
│ └── output
│
├── 07-Adjust/
│ ├── input
│ └── output
│
├── 08-Align/
│ ├── input
│ └── output
│
├── 09-Trim/
│ ├── input
│ └── output
│
├── 10-Convert/
│ ├── input
│ └── output
│
├── 11-Concatenate/
│ ├── input
│ └── output
This directory structure makes it clear which step is being run, and where all the relevant output files should be located. For more examples of good file management, please try one of the tutorials.
-
Input File Formats and Example Files. If you are not planning on using the full pipeline, please pay careful attention to how you format your input files and also how you name them. Refer to the detailed wiki instructions for each module. For examples of various files, see the example data folder. The OSF example analyses will have all the input and output files from every step of the analysis, and can be referred to for examples of all types of input files.
-
Run Errors. If a module crashes or ends with an error, you should delete any intermediate/output files created before attempting to run the module again. You should carefully check both the input directory and output directory for any such intermediate files. By default, SuperCRUNCH will append to files rather than remove or overwrite them, which will introduce potentially serious errors if a script is run repeatedly.
-
Error Reporting. If you encounter any bugs or major problems while running SuperCRUNCH, please post them to the issues page on github so they can be addressed.