Skip to content

Commit

Permalink
updated for v1.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
dportik committed Jan 28, 2021
1 parent d839e76 commit 2518876
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,20 +28,20 @@ A visual overview of the major steps in SuperCRUNCH is shown below:

![SuperCrunch workflow](https://github.com/dportik/SuperCRUNCH/blob/master/docs/Figure-1.jpg)

SuperCRUNCH is highly modular and analyses do not require running the full pipeline. There are many entry points which can provide useful tools for manipulating your phylogenetic and phylogeographic datasets.
SuperCRUNCH is highly modular and analyses do not require running the full pipeline. There are many useful tools available for manipulating your phylogenetic and phylogeographic datasets.

## Citation

SuperCRUNCH is described in more detail in the following publication:

+ Portik, D.M., and J.J. Wiens. (2020) SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets. Methods in Ecology and Evolution, 11: 763-772. https://doi.org/10.1111/2041-210X.13392

The article is available [**here**](https://github.com/dportik/SuperCRUNCH/tree/master/docs/publication), and the pre-print is available on BioRxiv [**here**](https://www.biorxiv.org/content/10.1101/538728v3).
The published article is available [**here**](https://github.com/dportik/SuperCRUNCH/tree/master/docs/publication). A pre-print was made available on BioRxiv prior to publication ([**here**](https://www.biorxiv.org/content/10.1101/538728v3)).


## Installation

There are several dependencies required to run SuperCRUNCH, including Python packages ([**Biopython**](https://biopython.org/) and numpy), as well as external tools ([**NCBI-BLAST+**](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download), [**CD-HIT-EST**](http://weizhongli-lab.org/cd-hit/), [**MAFFT**](https://mafft.cbrc.jp/alignment/software/), [**Muscle**](https://www.drive5.com/muscle/), [**Clustal-O**](http://www.clustal.org/omega/), [**MACSE**](https://bioweb.supagro.inra.fr/macse/), and [**trimAl**](http://trimal.cgenomics.org/)).
There are several dependencies required to run SuperCRUNCH, including Python packages ([**BioPython**](https://biopython.org/) and [numpy](https://numpy.org/)), as well as external tools ([**NCBI-BLAST+**](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download), [**CD-HIT-EST**](http://weizhongli-lab.org/cd-hit/), [**MAFFT**](https://mafft.cbrc.jp/alignment/software/), [**Muscle**](https://www.drive5.com/muscle/), [**Clustal-O**](http://www.clustal.org/omega/), [**MACSE**](https://bioweb.supagro.inra.fr/macse/), and [**trimAl**](http://trimal.cgenomics.org/)).

Installation of these requirements is fast and easy using `conda`. The [supercrunch-conda-env.yml](https://github.com/dportik/SuperCRUNCH/blob/master/supercrunch-conda-env.yml) file can be used to create the correct conda environment:

Expand All @@ -57,7 +57,7 @@ conda activate supercrunch

You can then run all SuperCRUNCH modules in this environment.

**NOTE** - this will install all requirements except for `MACSE`, which is a jar file that must be downloaded from [**here**](https://bioweb.supagro.inra.fr/macse/index.php?menu=releases) (get V2.05).
**NOTE** - this will install all requirements except for `MACSE`, which is a jar file that must be downloaded from [**here**](https://bioweb.supagro.inra.fr/macse/index.php?menu=releases) (get V2.05). This software is only required for the MACSE option in the alignment module (`Align.py`).

For non-conda installation of these packages, please see the [**Installation Instructions**](https://github.com/dportik/SuperCRUNCH/wiki/Installation-Instructions) wiki.

Expand All @@ -67,14 +67,14 @@ SuperCRUNCH itself consists of a set of modules written in Python (compatible wi

The current release of **SuperCRUNCH** is [**v1.3.0**](https://github.com/dportik/SuperCRUNCH/releases). Please see below for important changes.

### Changes in v1.3.0:
#### Changes in v1.3.0:
- Added a `conda` environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE.
- `Parse_Loci.py`: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term `pseudogene` will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file (`N/A` in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the `N/A` is automatically generated.
- `Filter_Seqs_and_Species.py`: Added `--accessions_include` flag. This points to a text file of accession numbers (one per line). When used with the `--seq_selection oneseq` option, if an accession included in the list is found in the available seqs for a taxon and gene, it will be selected. This is not just an allowed list, this list will override other settings for selection such as length. Also added the `--accessions_exclude` flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a blocked list.
- `Taxa_Assessment.py`: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Invoke `SeqIO.index_db()` method for sequence files >5GB, rather than using `SeqIO.index()` method.
- Added feature in `Cluster_Blast_Extract.py` to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.
- Added `Remove_Long_Accessions.py` module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH.
- Updated recognition for file extensions produced by updated blastn tools.
- `Parse_Loci.py`: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term `pseudogene` will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (`N/A` in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the `N/A` is automatically generated.
- `Filter_Seqs_and_Species.py`: Added `--accessions_include` flag. This points to a text file of accession numbers (one per line). When used with the `--seq_selection oneseq` option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the `--accessions_exclude` flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list".
- `Taxa_Assessment.py`: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes the `SeqIO.index_db()` method for sequence files >5GB, rather than using `SeqIO.index()` method, which is much more memory efficient for big data. The `SeqIO.index_db()` method is already used in `Parse_Loci.py`.
- `Cluster_Blast_Extract.py`: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths.
- Added a new `Remove_Long_Accessions.py` module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH.
- Updated recognition for file extensions produced by updated blastn tools (`.ndb`, `.not`, `.ntf`, `.nto`).

For complete version history please see the [releases](https://github.com/dportik/SuperCRUNCH/releases).

Expand All @@ -95,7 +95,7 @@ SuperCRUNCH has extensive documentation which can be accessed through the wiki t

Several tutorials were made available as part of the original SuperCRUNCH publication, which cover the full range of analyses available. These tutorials can be found on the [OSF SuperCRUNCH project page](https://osf.io/bpt94/). Each tutorial includes all data and instructions necessary to replicate the analysis. An overview of the tutorials available can be found on the [Tutorials wiki](https://github.com/dportik/SuperCRUNCH/wiki/Tutorials).

## Got a question, need some help, or have a suggestion?
## Reporting Issues, Getting Help, and Providing Suggestions

For main analysis issues and/or bugs, please create an issue on github [here](https://github.com/dportik/SuperCRUNCH/issues). Make sure you include the details of your analysis (inputs, outputs, commands) to assist the troubleshooting. The [**SuperCRUNCH user group**](http://groups.google.com/group/supercrunch-users) can also be used to create a post.

Expand Down

0 comments on commit 2518876

Please sign in to comment.