Skip to content

1: Analysis Overview

Daniel Portik edited this page Feb 14, 2021 · 28 revisions

SuperCrunch Logo


Analysis Overview

SuperCRUNCH is a python toolkit for extracting, filtering, and manipulating nucleotide data. It is modular in design, and there are many ways to perform analyses to meet specific research goals. This page provides an overview of the major topics involved in SuperCRUNCH analyses.

Summary of SuperCRUNCH Components

SuperCRUNCH can be used to:

  • Construct de novo supermatrices from GenBank and/or local sequence data.
  • Construct phylogeographic datasets from GenBank and/or local sequence data, with the ability to detect voucher codes.
  • Parse a large fasta file of GenBank and/or local sequences into gene-specific fasta files, based on a list of genes and list of species names.
  • Perform similarity filtering for all sequences of a gene to remove mis-identified or highly divergent sequences.
  • Extract specific genes from a set of mitochondrial genomes.
  • Use a set of reference sequences to trim all input sequences to match the reference region.
  • Adjust reading frames for coding sequences, and identify problematic coding sequences.
  • Adjust sequence directions to prepare for alignment.
  • Automate multiple sequence alignment using Clustal-O, MAFFT, Muscle, and MACSE.
  • Generate accession tables for supermatrices, for all genes and species included.
  • Relabel GenBank sequences using species names, accession numbers, and/or voucher codes.
  • Trim alignments using trimAl or a custom trimming routine.
  • Convert fasta format to phylip and nexus.
  • Concatenate any number of fasta or phylip alignment files. Outputs a concatenated alignment, data partitions, and counts of loci for all species included.
  • And many other tasks!

SuperCRUNCH is a modular workflow that contains several major steps. These are depicted in the figure below:

SuperCrunch workflow

In the list below, a summary of the main topics and modules is provided in the approximate order in which they would be used. Complete instructions for performing each step are provided in the individual wiki pages for each topic. These can be reached by clicking on the relevant link below or to the right on the wiki toolbar.

Remember, helpful information can always be displayed on the command line by running any module using the -h flag.

  • Obtaining GenBank Sequence Data
  • Using Local Sequence Data
  • Resolving Duplicate Records
    • Remove_Duplicate_Accessions.py
  • Decreasing GenBank File Size NEW
    • Remove_Long_Accessions.py
  • Creating a Taxon List
  • Getting a Taxon List Directly from Fasta Files
    • Fasta_Get_Taxa.py
  • Identifying Taxonomy Conflicts
    • Taxa_Assessment.py
  • Resolving Taxonomy Conflicts
    • Rename_Merge.py
  • Obtaining Locus Search Terms
    • How do locus searches work?
    • NEW: Including a negative search term
    • How do I know how well my search terms are performing?
    • What's the best strategy for recovering mitochondrial genes?
    • How do I search for UCE loci?
  • Building Fasta Files for Loci Available
    • Parse_Loci.py
  • Filtering 'Simple' Sequence Records
    • Cluster_Blast_Extract.py
  • Filtering 'Complex' Sequence Records
    • Reference_Blast_Extract.py
  • Identifying 'Contaminated' Sequences
    • Contamination_Filter.py
  • Selecting Sequences
    • Filter_Seqs_and_Species.py
  • Enforcing Minimum Sequence/Taxon Requirement
    • Fasta_Filter_by_Min_Seqs.py
  • Creating a Table of Accession Numbers From All Loci
    • Make_Acc_Table.py
  • Calculating the Possible Number of Supermatrix Combinations
    • Infer_Supermatrix_Combinations.py
  • Ensuring Sequences are in Correct Orientation
    • Adjust_Direction.py
  • Identify and Adjust Reading Frames
    • Coding_Translation_Tests.py
  • Perform Multiple Sequence Alignment
    • Align.py
  • Relabel Sequence Records
    • Fasta_Relabel_Seqs.py
  • Trim Sequence Alignments
    • Trim_Alignments_Trimal.py
    • Trim_Alignments_Custom.py
  • Calculate Sequence Length Heterogeneity Statistics
    • Seq_Length_Heterogeneity.py
  • Convert to Nexus and Phylip Format
    • Fasta_Convert.py
  • Concatenate Alignments
    • Concatenation.py

A typical SuperCRUNCH run includes executing many of these steps. However, this workflow is extremely flexible and can be tailored to achieve a variety of goals.

General Analysis Tips

There are many ways to perform analyses using SuperCRUNCH, and you may find yourself using all the modules or perhaps just one. Regardless of your purpose for using SuperCRUNCH, there are a few helpful suggestions to keep in mind when running your analyses:

  • Organization. Consider how you will organize your starting input files, and how you will manage files generated across steps.

  • File Management. In general, SuperCRUNCH requires specifying an input directory of files and an output directory of files, with some exceptions. To organize your analysis, it is best to create a new directory for each step with input and output subdirectories. For new users, it might be best to copy/paste relevant output files from a previous step into the inputs folder for the next step. For advanced users, this may not be as efficient. It is best to number the folders so that the progression across steps is easy to follow. For example, one might use this folder structure for an analysis:

Analysis-Hyperoliids
│
├── 00-Starting-seqs/
│   ├── input
│   └── output
│
├── 01-Taxon-assess/
│   ├── input
│   └── output
│
├── 02-Rename-merge/
│   ├── input
│   └── output
│
├── 03-Starting-materials/
│   ├── Fasta file of GenBank/local sequences (with or without updated taxon names)
│   ├── Locus search terms file
│   └── Taxon names list
│
├── 04-Cluster-blast/
│   ├── input
│   └── output
│
├── 05-Filter-seqs/
│   ├── input
│   └── output
│
├── 06-Min-taxa/
│   ├── input
│   └── output
│
├── 07-Adjust/
│   ├── input
│   └── output
│
├── 08-Align/
│   ├── input
│   └── output
│
├── 09-Trim/
│   ├── input
│   └── output
│
├── 10-Convert/
│   ├── input
│   └── output
│
├── 11-Concatenate/
│   ├── input
│   └── output

This directory structure makes it clear which step is being run, and where all the relevant output files should be located. For more examples of good file management, please try one of the tutorials.

  • Input File Formats and Example Files. If you are not planning on using the full pipeline, please pay careful attention to how you format your input files and also how you name them. Refer to the detailed wiki instructions for each module. For examples of various files, see the example data folder. The OSF example analyses will have all the input and output files from every step of the analysis, and can be referred to for examples of all types of input files.

  • Run Errors. If a module crashes or ends with an error, you should delete any intermediate/output files created before attempting to run the module again. You should carefully check both the input directory and output directory for any such intermediate files. By default, SuperCRUNCH will append to files rather than remove or overwrite them, which will introduce potentially serious errors if a script is run repeatedly.

  • Error Reporting. If you encounter any bugs or major problems while running SuperCRUNCH, please post them to the issues page on github so they can be addressed.

Back to top