Releases: B-UMMI/chewBBACA
v3.3.0 - Rwookrrorro
Added the AlleleCallEvaluator module. This module generates an interactive HTML report for the allele calling results. The report provides summary statistics to evaluate results per sample and per locus (with the possibility to provide a TSV file with loci annotations to include on a table). The report includes components to display a heatmap representing the loci presence-absence matrix, a heatmap representing the distance matrix based on allelic differences and a Neighbor-Joining tree based on the MSA of the core genome loci.
Additional changes
- Added pyrodigal for gene prediction. This simplified the processing of the gene prediction results and reduced runtime.
- Fixed an issue where the AlleleCall module would try to create results files for excluded inputs.
- Fixed exception capturing during multiprocessing when using Python>=3.11.
- Fixed PLOT5/3 identification when coding sequences are in the reverse strand.
- Fixed computation of the representative self-scores when performing allele calling for a subset of the loci in a schema (would only compute the self-scores for the subset of loci if the "self_scores" file had still not been created).
- Fixed issue related to the classification of single EXC/INF and single/multiple ASM/ALM (would classify some inputs as NIPH instead of EXC/INF).
- Fixed issue related to protein exact match classification when multiple pre-computed PROTEINtable files include the same protein hash.
- Changed the
-i
,--input-files
parameter in the PrepExternalSchema and UniprotFinder modules to-g
,--schema-directory
and added the--gl
,--genes-list
parameter to enable adapting or annotating a subset of the loci in the schema.
v3.2.0 - Wroshyr
New version of the SchemaEvaluator module. The updated version fixes several issues related to outdated dependencies that led to errors in the previous version. The new version also includes new features and components. Read the docs page to know more about the latest version of the SchemaEvaluator module.
Additional changes
- Updated the link to the UniProt FTP used by the UniprotFinder module.
- Added the
.fas
file extension to the list of file extensions accepted by chewBBACA. chewBBACA accepts genome assemblies and external schemas with FASTA files that use any of the following file extensions:.fasta
,.fna
,.ffn
,.fa
and.fas
. The FASTA files created by chewBBACA use the.fasta
extension. - Fixed an issue in the PrepExternalSchema module where it would only detect FASTA files if they ended with the
.fasta
extension. - Added the
--size-filter
parameter to the PrepExternalSchema module to define if the adaptation process should filter out alleles based on the minimum length and size threshold values. - Added the
--output-novel
parameter to the AlleleCall module. If this parameter is used, the AlleleCall module creates a FASTA file with the novel alleles inferred during the allele calling. This file is created even if the--no-inferred
parameter is used and the novel alleles are not added to the schema.
v3.0.0 - Shyriiwook
New implementation of the AlleleCall process. The new implementation was developed to reduce execution time, improve accuracy and provide more detailed results. It uses available computational resources more efficiently to allow for analyses with thousands of strains in a laptop. This new version is fully compatible with schemas created with previous versions.
AlleleCall changes
- The new implementation avoids redundant comparisons through the identification of the set of distinct CDSs in the input files. The classification for a distinct CDS is propagated to classify all input genomes that contain the CDS.
- Implemented a clustering step based on minimizers to cluster the translated CDSs. This step complements the alignment-based strategy with BLASTp to increase computational efficiency and classification accuracy.
- The AlleleCall process has 4 execution modes (1: only exact matches at DNA level; 2: exact matches at DNA and Protein level; 3: exact matches and minimizer-based clustering to find similar alleles with BSR > 0.7; 4: runs the full process to find exact matches and all matches with BSR >= 0.6).
- Files with information about loci length modes (
loci_modes
) and the self-alignment raw score for the representative alleles (short/self_scores
) are pre-computed and automatically updated (the process no longer creates and updates a file with the self-alignment raw score per locus). - The process creates the
pre_computed
folder to store files with hash tables that are used to speedup exact matching and avoid running the step to translate the schema alleles in every run. - Added the
--cds
parameter to accept FASTA files with CDSs (one FASTA file per genome) and skip gene prediction with Prodigal. - Users can control the addition of novel alleles to the schema with the
--no-inferred
parameter. - Added the
--output-unclassified
parameter to write a FASTA file (unclassified_sequences.fasta
) with the distinct CDSs that were not classified in a run. - Added the
--output-missing
parameter to write a FASTA file (missing_classes.fasta
) and a TSV file with information about the classified sequences that led to a locus being classified as ASM, ALM, PLOT3, PLOT5, LOTSC, NIPH, NIPHEM and PAMA. - Added the
--no-cleanup
parameter to keep the temporary folder with intermediate files created during a run. - Removed the
--contained
,--force-reset
,--store-profiles
(to be reimplemented in a future release),--json
and--verbose
parameters. - The
--force-continue
parameter no longer allows users to continue a run that was interrupted. This parameter is now used to ignore warnings and prompts about missing configuration files and the usage of multiple argument values per parameter. - The allelic profiles in the
results_alleles.tsv
file can be hashed by providing the--hash-profiles
parameter and a valid hash type as argument (hash algorithms available from the hashlib library and crc32 and adler32 from the zlib library). - The process creates a TSV file,
cds_coordinates.tsv
, with the genomic coordinates for all CDSs identified in the input files. - The process creates a TSV file,
loci_summary_stats.tsv
, with summary statistics for loci classifications. - The process no longer creates the
RepeatedLoci.txt
file. It now creates theparalogous_counts.tsv
andparalogous_loci.tsv
files with more detailed information about the loci identified as paralogous. - The PLNF class is attributed in modes 1, 2 and 3 to indicate that a more thorough analysis might have found a match for the loci that were not found (LNF).
- CDSs that match several loci are classified as PAMA.
- Bugfix for PLOT3, PLOT5 and LOTSC classification types. LOTSC classification was not always attributed when a contig was smaller than the matched representative allele and some PLOT5 cases were classified as LOTSC. LOTSC cases counted as exact matches in the
results_statistics.tsv
file.
Additional changes
- The UniprotFinder allows users to search for annotations through UniProt's SPARQL endpoint or based on matches against UniProt's reference proteomes or both.
- Bugfix for an issue in the UniprotFinder module that was leading to errors when the data returned by UniProt's SPARQL endpoint only contained one set of annotation terms.
- Bugfix for an issue in the UniprotFinder module that was preventing the annotations from being written to the output file.
- Bugfix for an issue in the map_async_parallelizer function that led to high memory usage.
- Implemented and changed several functions in the modules included in the
utils
folder to optimize code reusability, reduce runtime and peak memory usage, especially for large schemas and datasets (these changes affect mostly the CreateSchema and AlleleCall modules). - Updated function docstrings and added comments.
v2.7.0 - Aarrr wwgggh waah
New implementation of the CreateSchema process. This new implementation significantly reduces execution time. It is designed to enable schema creation based on hundreds or thousands of assemblies on a laptop. The schemas generated by the new implementation are fully compatible with previous versions.
Additional changes
- Improved detection of invalid inputs (inputs that do not contain coding sequences (CDSs), that contain invalid sequences/characters, empty files, etc).
- New parameter
--pm
allows users to set Prodigal's execution mode. Thesingle
mode is the default mode. Use themeta
mode for input files that have less than 100kbp (e.g.: plasmids, viruses). CreateSchema
accepts a single or several FASTA files with CDSs if the--CDS
option is included in the command. This option skips the gene prediction step with Prodigal and creates a schema seed based on the CDSs in the input files.AlleleCall
can automatically detect parameter values previously used with a schema. Users only need to provide values for the-i
,-g
and-o
parameters.
v2.6.0 - Yyyuurrrrrrruuunghh
v2.5.5 - Caretaker
Release with version 2.5.5
Changes since last release with version 2.1.0 include:
We've developed Chewie-NS, a Nomenclature Server that is based on the TypOn ontology and integrates with chewBBACA to provide access to gene-by-gene typing schemas and to allow a common and global allelic nomenclature to be maintained.
To allow all users to interact with Chewie-NS, we've implemented the following set of modules:
LoadSchema
: enables upload of new schemas to Chewie-NS.DownloadSchema
: enables download of any schema from Chewie-NS.SyncSchema
: compares local schemas, previously downloaded from Chewie-NS, with the remote versions in Chewie-NS to download and add new alleles to local schemas, submit new alleles to update remote schemas and ensure that a common allele identifier nomenclature is maintained.NSStats
: retrieves basic information about species and schemas in Chewie-NS.
The documentation includes information about the integration with chewBBACA and how to run the new LoadSchema, DownloadSchema, SyncSchema and NSStats processes.
Chewie-NS source code is freely available and deployment of local instances can be easily achieved through Docker Compose.
This version also includes other changes:
- The
AlleleCall
process will detect if a schema was created with previous chewBBACA versions and ask users if they wish to convert the schema to the latest version. The conversion process will not alter your schema files, it will simply add configuration files and copy the Prodigal training file to the schema's directory. You can force schema conversion with the--fc
argument. - The Prodigal training file used to create the schema will be included in the schema's directory and can be automatically detected by the
AlleleCall
process. - Schemas created with the
CreateSchema
process or adapted with thePrepExternalSchema
retain information about parameters values (BLAST Score Ratio, Prodigal training file, genetic code, minimum sequence length and sequence size variation threshold) and users are advised to keep performing allele call with those parameters values to ensure consistent results and provide the possibility of schema upload to the Chewie-NS. The AlleleCall process detects if a user provides parameters values that differ from the original values and requests confirmation before proceeding (you may force execution with the--fc
argument). - The AlleleCall process creates a SQLite database in the schema's directory that is used to store the allelic profiles determined with that schema.
- Further optimizations in the
PrepExternalSchema
process.
v2.1.0 - Mimban
Release with version 2.1.0.
Changes since last release with version 2.0.5 include:
- New PrepExternalSchema process implementation;
- New argument options in the AlleleCall process (--CDS, --st) and a timestamp is added to new alleles names;
- New argument option in the CreateSchema process (--CDS);
- ExtractCgMLST process optimization;
- Prodigal training files included in the package;
- Bug correction;
v2.0.5 - Bowcaster
chewBBACA pip package using python 3
v1.0.0 - Kashyyyk
First release of chewBBACA software using Python 2.7 .