Skip to content

Releases: bcgsc/NanoSim

v3.2.3

07 Feb 19:35
fa24994
Compare
Choose a tag to compare

General changes:

  • Additional lines for the new option --coverage for the genome and transcriptome modes of the simulator on the main README.md file.
  • Added the -x or --coverage flag for the simulator.py script. This option allows users to specify their target coverage for the simulation without any additional calculations on their end. Coverage is calculated based on raw read coverage (using the Lander/Waterman equation) and employs kernel density estimation functions for the aligned and unaligned read lengths, fitted on empirical data trained with the read_analysis.py script and specified to the simulator with the --model_prefix flag. The system automatically applies kernel density estimation functions and the aligned/unaligned reads ratio to calculate the mean read length. It then counts the number of bases in the reference and divides that number by the mean read length to determine the number of reads required to achieve 1x raw read coverage. Subsequently, the number of reads needed to reach the specified raw read coverage is inferred by multiplying the number of reads for 1x coverage by the specified raw read coverage (#242).

genome mode:

  • For the genome mode of the simulator.py script, the coverage is calculated using the reference genome specified by the -rg or --ref-g flag.

trancriptome mode:

  • For the transcriptome mode of the simulator.py script, the coverage is calculated using the reference transcriptome specified by the -rtor --ref_t flag.

metagenome mode:

  • We currently do not support --coverage option for the metagenome mode of the simulator.py script.

Notes:

  • We expect this approach to estimate the coverage precisely enough. However, users should also be aware that if they specify minimum, maximum, or mean length for the reads that are substantially different than the emprical data, the calculated coverage might not estimate the output coverage.

v3.2.2

09 Oct 21:29
7569bb8
Compare
Choose a tag to compare

General changes:

  • Fix return value when q_mode is True for function primary_and_unaligned_chimeric in get_primary_sam.py (#230)
  • Check for and skip secondary alignments in base quality characterization (#230)
  • If the homopolymer length modelling doesn't converge, detect this and raise a more informative error (#230)
  • Fixes for -min and -max parameters (#234)
  • Refactoring

genome mode:

  • Fix to avoid IndexError being raised with some parameter combinations when simulating genomic reads (#232)

transcriptome mode:

  • Remove --chimeric from transcriptome characterization to keep consistency with simulation options (#230)

metagenome mode:

  • Fix missing aligner variable in metagenome mode (#230)
  • Remove homopolymer characterization/simulation from metagenome mode (#230)
  • Fix for simulating abundances in metagenome mode (#232)

v3.2.1

17 Sep 19:37
edc8f8b
Compare
Choose a tag to compare

General changes:

  • Bugfixes for --fastq mode of the characterization stage for genome and metagenome modes (#222)
  • Bugfix for base quality characterization in chimeric mode (#223)
  • Corrections to help pages
  • Updating README.md for current help pages of read_analysis.py and simulator.py
  • Add missing dependencies to requirements.txt

Enhancements:

  • Added a new pre-trained model for newest release, which includes characterization of base qualities and homopolymers (#224)

v3.2.0

17 Aug 04:42
db9b936
Compare
Choose a tag to compare

NanoSim Version 3.2.0 Release Notes

Overview

This release includes changes from April 2022 to August 2024. It adds a new feature to NanoSim and provides a new pre-trained model (R10 Chemistry and Dorado Basecaller) for users to choose from. It also contains several bugfixes. Details of all updates are outlined below:

New Feature

  • This release incorporates the calculation and analyses related to homopolymer length and base quality into the characterization stage, removing the dependency on hard-coded metrics as discussed in #212 (pull request #217). Thanks, @theottlo for this.

Enhancements

  • Uploaded a new pre-trained model: NA24385 - hg002 AshkenazimTrio - Son, sequenced by Kitv14 (R10 chemistry) and basecalled by dorado. Thanks @lcoombe for this. The model trained on 1M subset reads is uploaded to NanoSim Github and the model with the whole dataset is available through Zenodo. (79e5f92) - by @SaberHQ
  • Relaxed package requirements by @kmnip in #177

Bug Fixes

  • Fixed a bug related to read headers in metagenome simulation by @LokiLuciferase in #167
  • Fixed a bug related to potential infinite loops in metagenome mode by @kmnip in #189 (Addresses #184 and #185)
  • Fixed an infinite loop bug for very short references by @kmnip in #199 (fixes the issue reported in #130 (comment))

Documentation Updates

We made changes to NanoSim’s documentation, so that it is more clear.

  • Consolidated citation format, clarified installation details, and updated old content: #192
  • Added more clarification regarding the pre-trained models and included information about the newly trained model on NA24385 - hg002 with R10 chemistry and basecalled with dorado. (ef56977) - by @SaberHQ
  • Added information on how to avoid package incompatibility issues and also problems with conda installations (f7b6cea) - by @SaberHQ
  • Fixed typo by @xinehc in #169
  • Added information about cs tag in BAM files (adfd7c6)

Known Issues

We acknowledge that there is still package dependency issue when using the old pre-trained models with the newest version of some python packages such as sci-kit-learn. We highly recommend everyone to take a look at dependencies section of readme file for more information. That being said, if you want to train your own models (which is super easy and straight forward), NanoSim should work just fine. However, if you prefer to use older pre-trained models, then you should pay attention to the package versions installed on your environment and use the same versions indicated here.

Full Changelog: v3.1.0...v3.2.0

v3.1.0

13 Apr 07:04
23911b6
Compare
Choose a tag to compare

This release contains several major bugfixes + new added features as outlined below.

General changes:

  • updated requirements.txt
    • added missing package names (fixes #135)
    • updated package version (fixes #159, #120, and #131)
  • fixed bug where head/tail lengths are calculated without considering the strand of the alignment
  • fixed bug where sequence IDs in _aligned_error_profile do not match those in _aligned_reads.fastq (fixes #151)
  • set default file compression level to 1 (previously level 6)

genome mode:

  • fixes bug where -c option crashes

transcriptome mode:

  • new options for read_analysis.py:
    • -c detect chimeric reads
    • -q quantify transcript expression
    • -n normalize expression values by transcript length
  • new expression quantification algorithm based on abundance estimation in metagenome mode
  • fixed bug where identical read lengths are simulated for the same transcript (fixes #155; thanks Haoran Li)
  • fixed bug where transcripts without a ENS name prefix cannot be simulated, which may result in an infinite loop (fixes #112, #156)
  • optimized various parts of simulation (see #150, #158)
  • fixed bug where head/tail lengths are calculated without considering genome alignments in addition to transcriptome alignments (fixes #136)

metagenome mode:

  • the option --dna_type_list is not required when reference genomes are streamed from RefSeq

v3.0.2

17 Sep 21:15
f8970b3
Compare
Choose a tag to compare

This release is the version used to generate Meta-NanoSim manuscript.

Changes include:

  1. Update README.md to include more information about dependencies and installation instructions.
  2. Bug fix for the pysam get cigar string function.
  3. Bug fix for simulated read length output.
  4. Included the option for EM base-level abundance quantification without chimeric reads detection.

v3.0.1

24 Jun 19:00
Compare
Choose a tag to compare

In this release, we are introducing a new feature about compressed files and have fixed a few bugs as follows:

  1. NanoSim now supports reading .gz sequence files, and bam files. When processing intermediate files, it saves bam files instead of sam files to reduce disk space.

  2. Every subprocess is re-seeded before running, to avoid the repetitive random sequences in simulated reads.

  3. Lognormal distribution simulation and -max_len feature bug was fixed (#118).

  4. Bug fix for read_analysis.py genome mode (#123).

  5. Added clarification to the README file about external programs needed to run NanoSim, including GenomeTools (gt) which is required to work with gtf/gff files for Intron Retention analysis.

V3.0.0

19 Apr 18:39
Compare
Choose a tag to compare

Official release of version 3.0.0

Major improvements from previous beta version:

  1. Quantification of metagenome abundance levels using EM algorithm

  2. Quantification mode now includes metagenome abundance level estimation. Parameters are a bit different now.

  3. requirements.txt includes joblib library, and the version numbers are removed. So users may install the latest versions of each package with best compatibility.

Minor changes:

  1. human_NA12878_cDNA_Bham1_guppy model is re-trained.

  2. README is updated with more info on input files

NanoSim v3.0.0 pre-release

19 Nov 14:23
Compare
Choose a tag to compare
Pre-release

Here we are announcing NanoSim v3.0.0 pre-release, and we will make it an official one once the manuscript is published. Please note that the tar ball attached doesn't contain any pre-trained models, so the downloading will be much faster.

In this release, NanoSim is able to simulate metagenomes with variable abundance profiles.

Key features include:

  1. Quantify species abundance level, which is not readily available in existing abundance quantification tools

  2. Simulate multiple samples in one batch

  3. Simulate chimeric reads in metagenome mode and genome mode

  4. Simulate abundance variance deviated from expected value

Bug fixes and small improvements:

  1. Fixed the bug in fastq simulation which leads to discrepancy between quality score length and sequence length

  2. Changed the way of importing model files, allowing better compatibility

  3. Re-trained all the models to be compatible with the model importing

  4. Added 2 more pre-trained models for metagenome datasets

V2.6.0

09 Jun 04:48
a640384
Compare
Choose a tag to compare

In this release, there's a key update in the simulation stage. NanoSim is capable of simulating fastq files now! We characterized a few datasets and used truncated log-normal distribution models to simulate the base quality of unaligned reads, matched bases, erroneous bases for genome and transcriptome reads separately.

Most of the changes in this release are for the simulation stage.

Other features:

  1. Perfect reads can have poly(A) tails now.

  2. Read files and error profiles for unaligned reads are separated from aligned reads now.

Bug fixes:

  1. Minor bugs in IR modeling, eliminated the exon extraction biases and read orientation problem

  2. Reversed the strandness information in simulation, which was opposite to the real orientation

  3. Solved occasional crashes when simulating unaligned reads

  4. Fixed the reversed head/tail length for reads from negative strand in transcriptome simulaiton

  5. Added missing file in pre-trained model