diff --git a/README.md b/README.md index 92d4788..6c26e5b 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ A visual overview of the major steps in SuperCRUNCH is shown below: ![SuperCrunch workflow](https://github.com/dportik/SuperCRUNCH/blob/master/docs/Figure-1.jpg) -SuperCRUNCH is highly modular and analyses do not require running the full pipeline. There are many entry points which can provide useful tools for manipulating your phylogenetic and phylogeographic datasets. +SuperCRUNCH is highly modular and analyses do not require running the full pipeline. There are many useful tools available for manipulating your phylogenetic and phylogeographic datasets. ## Citation @@ -36,12 +36,12 @@ SuperCRUNCH is described in more detail in the following publication: + Portik, D.M., and J.J. Wiens. (2020) SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets. Methods in Ecology and Evolution, 11: 763-772. https://doi.org/10.1111/2041-210X.13392 -The article is available [**here**](https://github.com/dportik/SuperCRUNCH/tree/master/docs/publication), and the pre-print is available on BioRxiv [**here**](https://www.biorxiv.org/content/10.1101/538728v3). +The published article is available [**here**](https://github.com/dportik/SuperCRUNCH/tree/master/docs/publication). A pre-print was made available on BioRxiv prior to publication ([**here**](https://www.biorxiv.org/content/10.1101/538728v3)). ## Installation -There are several dependencies required to run SuperCRUNCH, including Python packages ([**Biopython**](https://biopython.org/) and numpy), as well as external tools ([**NCBI-BLAST+**](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download), [**CD-HIT-EST**](http://weizhongli-lab.org/cd-hit/), [**MAFFT**](https://mafft.cbrc.jp/alignment/software/), [**Muscle**](https://www.drive5.com/muscle/), [**Clustal-O**](http://www.clustal.org/omega/), [**MACSE**](https://bioweb.supagro.inra.fr/macse/), and [**trimAl**](http://trimal.cgenomics.org/)). +There are several dependencies required to run SuperCRUNCH, including Python packages ([**BioPython**](https://biopython.org/) and [numpy](https://numpy.org/)), as well as external tools ([**NCBI-BLAST+**](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download), [**CD-HIT-EST**](http://weizhongli-lab.org/cd-hit/), [**MAFFT**](https://mafft.cbrc.jp/alignment/software/), [**Muscle**](https://www.drive5.com/muscle/), [**Clustal-O**](http://www.clustal.org/omega/), [**MACSE**](https://bioweb.supagro.inra.fr/macse/), and [**trimAl**](http://trimal.cgenomics.org/)). Installation of these requirements is fast and easy using `conda`. The [supercrunch-conda-env.yml](https://github.com/dportik/SuperCRUNCH/blob/master/supercrunch-conda-env.yml) file can be used to create the correct conda environment: @@ -57,7 +57,7 @@ conda activate supercrunch You can then run all SuperCRUNCH modules in this environment. -**NOTE** - this will install all requirements except for `MACSE`, which is a jar file that must be downloaded from [**here**](https://bioweb.supagro.inra.fr/macse/index.php?menu=releases) (get V2.05). +**NOTE** - this will install all requirements except for `MACSE`, which is a jar file that must be downloaded from [**here**](https://bioweb.supagro.inra.fr/macse/index.php?menu=releases) (get V2.05). This software is only required for the MACSE option in the alignment module (`Align.py`). For non-conda installation of these packages, please see the [**Installation Instructions**](https://github.com/dportik/SuperCRUNCH/wiki/Installation-Instructions) wiki. @@ -67,14 +67,14 @@ SuperCRUNCH itself consists of a set of modules written in Python (compatible wi The current release of **SuperCRUNCH** is [**v1.3.0**](https://github.com/dportik/SuperCRUNCH/releases). Please see below for important changes. -### Changes in v1.3.0: +#### Changes in v1.3.0: - Added a `conda` environment recipe for SuperCRUNCH, allowing easy installation of all requirements except MACSE. - - `Parse_Loci.py`: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term `pseudogene` will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file (`N/A` in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the `N/A` is automatically generated. - - `Filter_Seqs_and_Species.py`: Added `--accessions_include` flag. This points to a text file of accession numbers (one per line). When used with the `--seq_selection oneseq` option, if an accession included in the list is found in the available seqs for a taxon and gene, it will be selected. This is not just an allowed list, this list will override other settings for selection such as length. Also added the `--accessions_exclude` flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a blocked list. - - `Taxa_Assessment.py`: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Invoke `SeqIO.index_db()` method for sequence files >5GB, rather than using `SeqIO.index()` method. - - Added feature in `Cluster_Blast_Extract.py` to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths. - - Added `Remove_Long_Accessions.py` module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH. - - Updated recognition for file extensions produced by updated blastn tools. + - `Parse_Loci.py`: Added new feature that allows a term to be added to the loci search terms that will exclude a record if a match is found. For example, adding the negative term `pseudogene` will exclude all records containing that word, even if they match the other abbreviation or description terms. This requires a four-column search terms file, where the fourth column is the negative term (`N/A` in this column indicates no negative term should be used). This module was made backwards-compatible with the three-column search terms file - if a fourth column is not present the `N/A` is automatically generated. + - `Filter_Seqs_and_Species.py`: Added `--accessions_include` flag. This points to a text file of accession numbers (one per line). When used with the `--seq_selection oneseq` option, if an accession included in the list is found in the available seqs for a taxon and gene, it must be selected. This is not just an "allowed list", this list will override other settings for selection such as length. Also added the `--accessions_exclude` flag, which points to a text file of accession numbers (one per line). These accessions will NEVER be selected - they are removed from all searches. This is the equivalent of including a "blocked list". + - `Taxa_Assessment.py`: Altered SQL search query for "unmatched" taxa to avoid sql variable limit maximum issue. Also, now invokes the `SeqIO.index_db()` method for sequence files >5GB, rather than using `SeqIO.index()` method, which is much more memory efficient for big data. The `SeqIO.index_db()` method is already used in `Parse_Loci.py`. + - `Cluster_Blast_Extract.py`: Added feature to remove problematic long sequences if they somehow end up in the main cluster of sequences for a gene. The new filter removes all seqs that are 1.3x the length of the 95th percentile of all lengths. + - Added a new `Remove_Long_Accessions.py` module, which can filter a downloaded GenBank fasta file to remove extremely long sequences (>150kb). This will eliminate whole genome sequencing records, which are not useful for SuperCRUNCH. + - Updated recognition for file extensions produced by updated blastn tools (`.ndb`, `.not`, `.ntf`, `.nto`). For complete version history please see the [releases](https://github.com/dportik/SuperCRUNCH/releases). @@ -95,7 +95,7 @@ SuperCRUNCH has extensive documentation which can be accessed through the wiki t Several tutorials were made available as part of the original SuperCRUNCH publication, which cover the full range of analyses available. These tutorials can be found on the [OSF SuperCRUNCH project page](https://osf.io/bpt94/). Each tutorial includes all data and instructions necessary to replicate the analysis. An overview of the tutorials available can be found on the [Tutorials wiki](https://github.com/dportik/SuperCRUNCH/wiki/Tutorials). -## Got a question, need some help, or have a suggestion? +## Reporting Issues, Getting Help, and Providing Suggestions For main analysis issues and/or bugs, please create an issue on github [here](https://github.com/dportik/SuperCRUNCH/issues). Make sure you include the details of your analysis (inputs, outputs, commands) to assist the troubleshooting. The [**SuperCRUNCH user group**](http://groups.google.com/group/supercrunch-users) can also be used to create a post.