diff --git a/README.md b/README.md index 58ef317..0df9d85 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -

+

IsoSeq v3

Scalable De Novo Isoform Discovery

@@ -18,14 +18,15 @@ Latest version can be installed via bioconda package `isoseq3`. Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) for information on Installation, Support, License, Copyright, and Disclaimer. -## Specific Version Documentation +## Workflow Documentation - * [Version 3.2, SMRT Link 8.0](README_v3.2.md) - * [Version 3.1, SMRT Link 7.0](README_v3.1.md) - * [Version 3.0, SMRT Link 6.0](README_v3.0.md) + * [Iso-Seq Clustering](isoseq-clustering.md) + * Iso-Seq Deduplication (UMIs and cell barcodes) [Future release] ## Changelog - * **3.2.2** + * **3.3.0** + * SMRT Link release 9.0.0 + * 3.2.2 * Fix `polish` not generating fasta/q output. This bug was introduced in v3.2.0 * 3.2.1 * Fix a gff index 1-off bug in `collapse` @@ -49,6 +50,12 @@ called `refine`. Your custom `primers.fasta` is used in this step to detect concatemers. ## FAQ +### Where is the workflow starting from unpolished CCS reads? +To simplify, unify, and future proof Iso-Seq, we decided to remove documentation +starting from unpolished CCS reads. With the ever-increasing polymerase read +lengths and improvements of CCS, going forward, it is recommended to generate +polished CCS reads first and thus make final transcript polishing optional. + ### Why IsoSeq v3 and not the established versions 1 or 2? The ever-increasing throughput of the Sequel system gave rise to the need for a scalable software solution that can handle millions of CCS reads, while @@ -57,11 +64,11 @@ maintaining sensitivity and accuracy. Internal benchmarks have shown that [SQANTI](https://bitbucket.org/ConesaLab/sqanti) attributes *IsoSeq v3* a higher number of perfectly annotated isoforms: - + Additional benefit, single linux binary that requires no dependencies. -### Why is the number of transcripts much lower with IsoSeq3? +### Why is the number of transcripts much lower with IsoSeq v3? Even though we also observe fewer polished transcripts with *IsoSeq v3*, the overall quality is much higher. Most of the low-quality transcripts are lost in the demultiplexing step. *Isoseq v1/2 classify* is too relaxed and is not filtering @@ -70,18 +77,17 @@ effectively removes most molecules that are wrongly tagged, as in two 5' or two 3' primers. Only a proper 5' and 3' primer pair allows to identify a full-length transcript and its orientation. - ### I can't find the *classify* step Starting with version 3.1, *classify* functionality has been split into two tools. Removal of (barcoded) primers is performed with PacBio's standard demultiplexing tool *lima*. *Lima* does not remove poly(A) tails, nor detects concatemers. -For this, `isoseq3 refine` generates FLNC reads. +For this, `isoseq refine` generates FLNC reads. For version 3.0, poly(A) tail removal and concatemer detection is performed in -`isoseq3 cluster` +`isoseq cluster` ### My sample has poly(A) tails, how can I remove them? -Use `--require-polya` for `isoseq3 refine`. +Use `--require-polya` for `isoseq refine`. This filters for FL reads that have a poly(A) tail with at least 20 base pairs and removes identified tail. @@ -107,7 +113,7 @@ feasible. *IsoSeq v3* deems two reads to stem from the same transcript, if they meet following criteria: - + There is no upper limit on the number of gaps. @@ -128,7 +134,7 @@ PacBio supports three different SMRTbell designs for IsoSeq libraries. In all designs, transcripts are labelled with asymmetric primers, whereas a poly(A) tail is optional. Barcodes may be optionally added. - + ### The binary does not work on my linux system! Binaries require **SSE4.1 CPU support**; CPUs after 2008 (Penryn) include it. diff --git a/README_v3.0.md b/README_v3.0.md deleted file mode 100644 index df7d86d..0000000 --- a/README_v3.0.md +++ /dev/null @@ -1,222 +0,0 @@ -

-

IsoSeq v3.0

-

Scalable De Novo Isoform Discovery

- -*** - -*IsoSeq v3.0* contains the newest tools to identify transcripts in -PacBio single-molecule sequencing data. -Starting in SMRT Link v6.0.0, those tools power the -*IsoSeq GUI-based analysis* application. -A composable workflow of existing tools and algorithms, combined with -a new clustering technique, allows to process the ever-increasing yield of PacBio -machines with similar performance to *IsoSeq* versions 1 and 2. - -## Availability -Latest version can be installed via bioconda package `isoseq3`. - -Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) -for information on Installation, Support, License, Copyright, and Disclaimer. - -## Overview - - [SMRTbell Designs](README_v3.0.md#smrtbell-designs) - - [Workflow Overview](README_v3.0.md#workflow) - - [Real-World Example](README_v3.0.md#real-world-example) - - [FAQ](README.md#faq) - -## Workflow - - - -### Input -For each cell, the `.subreads.bam` and `.subreads.bam.pbi` -are needed for processing. - -### Circular Consensus Sequence calling -Each sequencing run is processed by [*ccs*](https://github.com/PacificBiosciences/ccs) -to generate one representative circular consensus sequence (CCS) for each ZMW. Only ZMWs with -at least one full pass (at least once subread with SMRT adapter on both ends) are -used for the subsequent analysis. Polishing is not necessary -in this step and is by default deactivated through. -_ccs_ can be installed with `conda install pbccs`. - - ccs movie.subreads.bam ccs.bam --noPolish --minPasses 1 - -For **CCS version ≥ 4.0.0** use this call: - - $ ccs movie.subreads.bam ccs.bam --skip-polish --min-passes 1 --draft-mode winpoa --disable-heuristics - -### Primer removal and demultiplexing -Removal of cDNA primers and identification of barcodes (if given) is performed using [*lima*](https://github.com/pacificbiosciences/barcoding), -which can be installed with `conda install lima` and offers a specialized `--isoseq` mode. - -More information about how to name input primer(+barcode) -sequences in this [FAQ](https://github.com/pacificbiosciences/barcoding#how-can-i-demultiplex-isoseq-data). - - lima --isoseq --dump-clips ccs.bam primers.fasta demux.bam - -The following is the `primer.fasta` for the Clontech SMARTer cDNA library prep, which is the officially recommended protocol: - - >primer_5p - AAGCAGTGGTATCAACGCAGAGTACATGGG - >primer_3p - GTACTCTGCGTTGATACCACTGCTT - -The following are examples for barcoded samples using a 16bp barcode followed by Clontech primer: - - >primer_5p - AAGCAGTGGTATCAACGCAGAGTACATGGGG - >brain_3p - CGCACTCTGATATGTGGTACTCTGCGTTGATACCACTGCTT - >liver_3p - CTCACAGTCTGTGTGTGTACTCTGCGTTGATACCACTGCTT - -*lima* will remove unwanted combinations and orient sequences to 5' -> 3' orientation. - -From here on, execute the following steps for each output BAM file. - -### Clustering and polishing -*IsoSeq v3* wraps all tools into one fat binary. - - $ isoseq3 - isoseq3 - De Novo Transcript Reconstruction - - Tools: - cluster - Cluster CCS reads to transcripts - polish - Polish the clustering output - summarize - Create a barcode overview CSV file - - Examples: - isoseq3 cluster movie.consensusreadset.xml unpolished.bam - isoseq3 polish unpolished.bam movie.subreadset.xml polished.bam - isoseq3 summarize polished.bam summary.csv - -#### Clustering and transcript clean up -Compared to previous IsoSeq approaches, *IsoSeq v3* performs a single clustering -technique. -Due to the nature of the algorithm, it can't be efficiently parallelized. It is advised to give this step as many cores -as possible. The individual steps of *cluster* are as following: - - [Trimming](https://github.com/PacificBiosciences/trim_isoseq_polyA) of polyA tails `--require-polya` - - Rapid concatmer [identification](https://github.com/jeffdaily/parasail) and removal - - Clustering using hierarchical n*log(n) [alignment](https://github.com/lh3/minimap2) and iterative cluster merging - - Unpolished [POA](https://github.com/rvaser/spoa) sequence generation - -##### Input -The input file for *cluster* is one demultiplexed CCS file: - - `` or `` - -##### Output -The following output files of *cluster* contain unpolished isoforms: - - `.bam` - - `.flnc.bam` - - `.fasta` - - `.bam.pbi` <- Only generated with `--pbi` - - `.transcriptset.xml` <- Only relevant for pbsmrtpipe - - `.consensusreadset.xml` <- Only relevant for pbsmrtpipe - -Example invocation: - - isoseq3 cluster demux.P5--P3.bam unpolished.bam -j 32 [--split-bam 24] - -#### Polishing -The algorithm behind *polish* is the *arrow* model that also used for CCS -generation and polishing of de-novo assemblies. This step can be massively -parallelized by splitting the `unpolished.bam` file. Split BAM files can be -generated by *cluster*. - -##### Input -The input files for *polish* are: - - `.bam` or `.transcriptset.xml` - - `.subreads.bam` or `.subreadset.xml` - -##### Output -The following output files of *polish* contain polished isoforms: - - `.bam` - - `.bam.pbi` <- Only generated with `--pbi` - - `.transcriptset.xml` - - `.hq.fasta.gz` with predicted accuracy ≥ 0.99 - - `.lq.fasta.gz` with predicted accuracy < 0.99 - - `.hq.fastq.gz` with predicted accuracy ≥ 0.99 - - `.lq.fastq.gz` with predicted accuracy < 0.99 - -Example invocation: - - isoseq3 polish unpolished.bam m54020_171110_2301211.subreads.bam polished.bam - - - -## Real-world example -This is an example of an end-to-end cmd-line-only workflow to get from -subreads to polished isoforms. - - $ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam - $ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam.pbi - - $ ccs --version - ccs 3.1.0 (commit v3.1.0) - - $ time ccs m54086_170204_081430.subreads.bam m54086_170204_081430.ccs.bam \ - --noPolish --minPasses 1 - - real 50m43.090s - user 3531m35.620s - sys 24m36.884s - - $ cat primers.fasta - >primer_5p - AAGCAGTGGTATCAACGCAGAGTACATGGGG - >primer_3p - AAGCAGTGGTATCAACGCAGAGTAC - - $ lima --version - lima 1.7.1 (commit v1.7.1) - - $ time lima m54086_170204_081430.ccs.bam primers.fasta demux.bam \ - --isoseq --dump-clips - - real 0m6.543s - user 0m51.170s - - $ ls demux* - demux.json demux.lima.counts demux.lima.report demux.lima.summary demux.primer_5p--primer_3p.bam demux.primer_5p--primer_3p.subreadset.xml - - $ time isoseq3 cluster demux.primer_5p--primer_3p.bam unpolished.bam --verbose - Read BAM : (200740) 8s 313ms - India : (197869) 9s 204ms - Save flnc file : 35s 366ms - Convert to reads : 36s 967ms - Sort Reads : 69ms 756us - Aligning Linear : 42s 620ms - Read to clusters : 7s 506ms - Aligning Linear : 37s 595ms - Merge by mapping : 37s 645ms - Consensus : 1m 47s - Merge by mapping : 8s 861ms - Consensus : 12s 633ms - Write output : 3s 265ms - Complete run time : 5m 12s - - real 5m12.888s - user 58m35.243s - - $ ls unpolished* - unpolished.bam unpolished.bam.pbi unpolished.cluster unpolished.fasta unpolished.flnc.bam unpolished.flnc.bam.pbi unpolished.flnc.consensusreadset.xml unpolished.transcriptset.xml - - $ time isoseq3 polish unpolished.bam m54086_170204_081430.subreads.bam polished.bam --verbose - 14561 - - real 60m37.564s - user 2832m8.382s - $ ls polished* - polished.bam polished.bam.pbi polished.hq.fasta.gz polished.hq.fastq.gz polished.lq.fasta.gz polished.lq.fastq.gz polished.transcriptset.xml - -If you have multiple cells, you should run `--split-bam` in the cluster step which will produce chunked cluster results. Each chunked cluster result can be run as a parallel polish job and merged at the end. The following example splits into 24 chunks. `sample.subreadset.xml` is the dataset containing all the input cells. The `isoseq3 polish` jobs can be run in parallel. - - $ isoseq3 cluster demux.primer_5p--primer_3p.bam unpolished.bam --split-bam 24 - $ isoseq3 polish unpolished.0.bam sample.subreadset.xml polished.0.bam - $ isoseq3 polish unpolished.1.bam sample.subreadset.xml polished.1.bam - $ ... - -## DISCLAIMER - -THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES. diff --git a/README_v3.1.md b/README_v3.1.md deleted file mode 100644 index 4054dd8..0000000 --- a/README_v3.1.md +++ /dev/null @@ -1,293 +0,0 @@ -

-

IsoSeq v3.1

-

Scalable De Novo Isoform Discovery

- -*** - -*IsoSeq v3.1* contains the newest tools to identify transcripts in -PacBio single-molecule sequencing data. -Starting in SMRT Link v7.0.0, those tools power the -*IsoSeq GUI-based analysis* application. -A composable workflow of existing tools and algorithms, combined with -a new clustering technique, allows to process the ever-increasing yield of PacBio -machines with similar performance to *IsoSeq* versions 1 and 2. - -## Availability -Latest version can be installed via bioconda package `isoseq3`. - -Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) -for information on Installation, Support, License, Copyright, and Disclaimer. - -## Overview - - Workflow Overview: [high](README_v3.1.md#high-level-workflow) / [mid](README_v3.1.md#mid-level-workflow) / [low](README_v3.1.md#low-level-workflow) level - - [Real-World Example](README_v3.1.md#real-world-example) - - [FAQ](README_v3.1.md#faq) - - [SMRTbell Designs](README_v3.1.md#what-smrtbell-designs-are-possible) - -## High-level workflow - -The high-level workflow depicts files and processes: - - - -## Mid-level workflow - -The mid-level workflow schematically explains what happens at each stage: - - - -## Low-level workflow - -The low-level workflow explained via CLI calls. All necessary dependencies are -installed via bioconda. - -### Step 0 - Input -For each SMRT cell, the `movieX.subreads.bam`, `movieX.subreads.bam.pbi`, -and `movieX.subreadset.xml` are needed for processing. - -### Step 1 - Circular Consensus Sequence calling -Each sequencing run is processed by [*ccs*](https://github.com/PacificBiosciences/ccs) -to generate one representative circular consensus sequence (CCS) for each ZMW. Only ZMWs with -at least one full pass (at least one subread with SMRT adapter on both ends) are -used for the subsequent analysis. Polishing is not necessary -in this step and is by default deactivated through. -_ccs_ can be installed with `conda install pbccs`. - - $ ccs movieX.subreads.bam movieX.ccs.bam --noPolish --minPasses 1 - -For long movies and short inserts, it is advised to limit the number of subreads -used per ZMW; this can decrease run-time (only available in ccs version ≥ 3.1.0): - - $ ccs movieX.subreads.bam movieX.ccs.bam --noPolish --minPasses 1 --maxPoaCoverage 10 - -For **CCS version ≥ 4.0.0** use this call: - - $ ccs movieX.subreads.bam movieX.ccs.bam --skip-polish --min-passes 1 --draft-mode winpoa --disable-heuristics - -### Step 2 - Primer removal and demultiplexing -Removal of primers and identification of barcodes is performed using [*lima*](https://github.com/pacificbiosciences/barcoding), -which can be installed with \ -`conda install lima` and offers a specialized `--isoseq` mode. -Even in the case that your sample is not barcoded, primer removal is performed -by *lima*. -If there are more than two sequences in your `primer.fasta` file or better said -more than one pair of 5' and 3' primers, please use *lima* with `--peek-guess` -to remove spurious false positive signal. -More information about how to name input primer(+barcode) -sequences in this [FAQ](https://github.com/pacificbiosciences/barcoding#how-can-i-demultiplex-isoseq-data). - - $ lima movieX.ccs.bam barcoded_primers.fasta movieX.fl.bam --isoseq --peek-guess - -**Example 1:** -Following is the `primer.fasta` for the Clontech SMARTer and NEB cDNA library -prep, which are the officially recommended protocols: - - >NEB_5p - GCAATGAAGTCGCAGGGTTGGG - >Clontech_5p - AAGCAGTGGTATCAACGCAGAGTACATGGGG - >NEB_Clontech_3p - GTACTCTGCGTTGATACCACTGCTT - -**Example 2:** -Following are examples for barcoded primers using a 16bp barcode followed by -Clontech primer: - - >primer_5p - AAGCAGTGGTATCAACGCAGAGTACATGGGG - >brain_3p - CGCACTCTGATATGTGGTACTCTGCGTTGATACCACTGCTT - >liver_3p - CTCACAGTCTGTGTGTGTACTCTGCGTTGATACCACTGCTT - -*Lima* will remove unwanted combinations and orient sequences to 5' → 3' orientation. - -Output files will be called according to their primer pair. Example for -single sample libraries: - - movieX.fl.NEB_5p--NEB_Clontech_3p.bam - -If your library contains multiple samples, execute the following workflow -for each primer pair: - - movieX.fl.primer_5p--brain_3p.bam - movieX.fl.primer_5p--liver_3p.bam - -### Step 3 - Refine -Your data now contains full-length reads, but still needs to be refined by: - - [Trimming](https://github.com/PacificBiosciences/trim_isoseq_polyA) of poly(A) tails - - Rapid concatmer [identification](https://github.com/jeffdaily/parasail) and removal - -**Input** -The input file for *refine* is one demultiplexed CCS file with full-length reads -and the primer fasta file: - - `.fl.bam` or `.fl.consensusreadset.xml` - - `primers.fasta` - -**Output** -The following output files of *refine* contain full-length non-concatemer reads: - - `.flnc.bam` - - `.flnc.transcriptset.xml` - -Actual command to refine: - - $ isoseq3 refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam - -If your sample has poly(A) tails, use `--require-polya`. -This filters for FL reads that have a poly(A) tail -with at least 20 base pairs and removes identified tail: - - $ isoseq3 refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flnc.bam --require-polya - -### Step 3b - Merge SMRT Cells -If you used more than one SMRT cells, use `dataset` for merging, -which can be installed with `conda install pbcoretools`. -Merge all of your `.flnc.bam` files: - - $ dataset create --type TranscriptSet merged.flnc.xml movie1.flnc.bam movie2.flnc.bam movieN.flnc.bam - -Similarly, merge all of your **source** `.subreadset.xml` files: - - $ dataset create --type SubreadSet merged.subreadset.xml movie1.subreadset.xml movie2.subreadset.xml movieN.subreadset.xml - -### Step 4 - Clustering -Compared to previous IsoSeq approaches, *IsoSeq v3* performs a single clustering -technique. -Due to the nature of the algorithm, it can't be efficiently parallelized. -It is advised to give this step as many coresas possible. -The individual steps of *cluster* are as following: - - - Clustering using hierarchical n*log(n) [alignment](https://github.com/lh3/minimap2) and iterative cluster merging - - Unpolished [POA](https://github.com/rvaser/spoa) sequence generation - -**Input** -The input file for *cluster* is one FLNC file: - - `.flnc.bam` or `merged.flnc.xml` - -**Output** -The following output files of *cluster* contain unpolished isoforms: - - `.bam` - - `.fasta` - - `.bam.pbi` - - `.transcriptset.xml` - -Example invocation: - - $ isoseq3 cluster merged.flnc.xml unpolished.bam --verbose - -### Step 5 - Serial Polishing -The algorithm behind *polish* is the *arrow* model that also used for CCS -generation and polishing of de-novo assemblies. - -**Input** -The input files for *polish* are: - - `.bam` or `.transcriptset.xml` - - `.subreadset.xml` or `merged.subreadset.xml` - -**Output** -The following output files of *polish* contain polished isoforms: - - `.bam` - - `.transcriptset.xml` - - `.hq.fasta.gz` with predicted accuracy ≥ 0.99 - - `.lq.fasta.gz` with predicted accuracy < 0.99 - - `.hq.fastq.gz` with predicted accuracy ≥ 0.99 - - `.lq.fastq.gz` with predicted accuracy < 0.99 - -Example invocation: - - $ isoseq3 polish unpolished.bam merged.subreadset.xml polished.bam - -### Alternative Step 4/5 - Parallel Polishing -Polishing can be massively parallelized on multiple servers by splitting -the `unpolished.bam` file. -Split BAM files can be generated by *cluster*. - - $ isoseq3 cluster merged.flnc.xml unpolished.bam --verbose --split-bam 24 - -This will create up to 24 output BAM files: - - unpolished.0.bam - unpolished.1.bam - ... - -Each of those `unpolished..bam` files can be polished in parallel: - - $ isoseq3 polish unpolished.0.bam sample.subreadset.xml polished.0.bam - $ isoseq3 polish unpolished.1.bam sample.subreadset.xml polished.1.bam - $ ... - -## Real-world example -This is an example of an end-to-end cmd-line-only workflow to get from -subreads to polished isoforms: - - $ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam - $ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam.pbi - $ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreadset.xml - - $ ccs --version - ccs 3.1.0 (commit v3.1.0) - - $ ccs m54086_170204_081430.subreads.bam m54086_170204_081430.ccs.bam \ - --noPolish --minPasses 1 --maxPoaCoverage 10 - - $ cat primers.fasta - >primer_5p - AAGCAGTGGTATCAACGCAGAGTACATGGGG - >primer_3p - AAGCAGTGGTATCAACGCAGAGTAC - - $ lima --version - lima 1.9.0 (commit v1.9.0) - - $ lima m54086_170204_081430.ccs.bam primers.fasta m54086_170204_081430.fl.bam \ - --isoseq --peek-guess - - $ ls m54086_170204_081430.fl* - m54086_170204_081430.fl.json m54086_170204_081430.fl.lima.summary - m54086_170204_081430.fl.lima.clips m54086_170204_081430.fl.primer_5p--primer_3p.bam - m54086_170204_081430.fl.lima.counts m54086_170204_081430.fl.primer_5p--primer_3p.subreadset.xml - m54086_170204_081430.fl.lima.report - - $ isoseq3 refine m54086_170204_081430.fl.primer_5p--primer_3p.bam primers.fasta m54086_170204_081430.flnc.bam - - $ ls m54086_170204_081430.flnc.* - m54086_170204_081430.flnc.bam m54086_170204_081430.flnc.filter_summary.json - m54086_170204_081430.flnc.bam.pbi m54086_170204_081430.flnc.report.csv - m54086_170204_081430.flnc.consensusreadset.xml - - $ isoseq3 cluster m54086_170204_081430.flnc.bam unpolished.bam --verbose - Read BAM : (197791) 4s 20ms - Convert to reads : 1s 431ms - Sort Reads : 56ms 947us - Aligning Linear : 2m 5s - Read to clusters : 9s 432ms - Aligning Linear : 54s 288ms - Merge by mapping : 36s 138ms - Consensus : 30s 126ms - Merge by mapping : 5s 418ms - Consensus : 3s 597ms - Write output : 1s 134ms - Complete run time : 4m 32s - - $ ls unpolished* - unpolished.bam unpolished.bam.pbi unpolished.cluster unpolished.fasta unpolished.transcriptset.xml - - $ isoseq3 polish unpolished.bam m54086_170204_081430.subreadset.xml polished.bam --verbose - 14561 - - $ ls polished* - polished.bam polished.hq.fastq.gz - polished.bam.pbi polished.lq.fasta.gz - polished.cluster_report.csv polished.lq.fastq.gz - polished.hq.fasta.gz polished.transcriptset.xml - -Or run *isoseq3 cluster* it in split mode and `isoseq3 polish` in parallel: - - $ isoseq3 cluster m54086_170204_081430.flnc.bam unpolished.bam --split-bam 24 - $ isoseq3 polish unpolished.0.bam m54086_170204_081430.subreadset.xml polished.0.bam - $ isoseq3 polish unpolished.1.bam m54086_170204_081430.subreadset.xml polished.1.bam - $ ... - -## DISCLAIMER - -THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES. diff --git a/doc/img/isoseq3-barcoding.png b/doc/img/isoseq-barcoding.png similarity index 100% rename from doc/img/isoseq3-barcoding.png rename to doc/img/isoseq-barcoding.png diff --git a/doc/img/isoseq-clustering-end-to-end.png b/doc/img/isoseq-clustering-end-to-end.png new file mode 100644 index 0000000..19b8b5c Binary files /dev/null and b/doc/img/isoseq-clustering-end-to-end.png differ diff --git a/doc/img/isoseq-clustering-workflow.pdf b/doc/img/isoseq-clustering-workflow.pdf new file mode 100644 index 0000000..c0401d3 Binary files /dev/null and b/doc/img/isoseq-clustering-workflow.pdf differ diff --git a/doc/img/isoseq-clustering-workflow.png b/doc/img/isoseq-clustering-workflow.png new file mode 100644 index 0000000..b0987a7 Binary files /dev/null and b/doc/img/isoseq-clustering-workflow.png differ diff --git a/doc/img/isoseq3-performance.png b/doc/img/isoseq-performance.png similarity index 100% rename from doc/img/isoseq3-performance.png rename to doc/img/isoseq-performance.png diff --git a/doc/img/isoseq3-similar-transcripts.png b/doc/img/isoseq-similar-transcripts.png similarity index 100% rename from doc/img/isoseq3-similar-transcripts.png rename to doc/img/isoseq-similar-transcripts.png diff --git a/doc/img/isoseq3.png b/doc/img/isoseq.png similarity index 100% rename from doc/img/isoseq3.png rename to doc/img/isoseq.png diff --git a/doc/img/isoseq3.0-workflow.png b/doc/img/isoseq3.0-workflow.png deleted file mode 100644 index 63a0f1f..0000000 Binary files a/doc/img/isoseq3.0-workflow.png and /dev/null differ diff --git a/doc/img/isoseq3.1-end-to-end.png b/doc/img/isoseq3.1-end-to-end.png deleted file mode 100644 index 83a0cee..0000000 Binary files a/doc/img/isoseq3.1-end-to-end.png and /dev/null differ diff --git a/doc/img/isoseq3.1-workflow.png b/doc/img/isoseq3.1-workflow.png deleted file mode 100644 index 17195d9..0000000 Binary files a/doc/img/isoseq3.1-workflow.png and /dev/null differ diff --git a/doc/img/isoseq3.2-end-to-end.png b/doc/img/isoseq3.2-end-to-end.png deleted file mode 100644 index 6b926bc..0000000 Binary files a/doc/img/isoseq3.2-end-to-end.png and /dev/null differ diff --git a/doc/img/isoseq3.2-workflow.png b/doc/img/isoseq3.2-workflow.png deleted file mode 100644 index c8e2aa9..0000000 Binary files a/doc/img/isoseq3.2-workflow.png and /dev/null differ diff --git a/README_v3.2.md b/isoseq-clustering.md similarity index 66% rename from README_v3.2.md rename to isoseq-clustering.md index 7b0dbf4..964ee25 100644 --- a/README_v3.2.md +++ b/isoseq-clustering.md @@ -1,48 +1,20 @@ -

-

IsoSeq v3.2

-

Scalable De Novo Isoform Discovery

+

+

IsoSeq v3

+

Generate transcripts by clustering HiFi reads

*** -*IsoSeq v3.2* contains the newest tools to identify transcripts in -PacBio single-molecule sequencing data. -Starting in SMRT Link v8.0.0, those tools power the -*IsoSeq GUI-based analysis* application. -A composable workflow of existing tools and algorithms, combined with -a new clustering technique, allows to process the ever-increasing yield of PacBio -machines with similar performance to *IsoSeq* versions 1 and 2. - -Focus of version 3.2 documentation is processing of polished CCS reads, -the latest feature of *IsoSeq v3*. Processing of unpolished CCS reads with final -transcript polishing is still supported, please refer to the -[documentation of version 3.1](README_v3.1.md). - -**Attention:** Version 3.2 dropped support of RS II data. -Please use version 3.1 for RS II data with `conda install isoseq3=3.1` - -## Availability -Latest version can be installed via bioconda package `isoseq3`. - -Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda) -for information on Installation, Support, License, Copyright, and Disclaimer. - -## Overview - - Workflow Overview: [high](README_v3.2.md#high-level-workflow) / [mid](README_v3.2.md#mid-level-workflow) / [low](README_v3.2.md#low-level-workflow) level - - [Real-World Example](README_v3.2.md#real-world-example) - - [FAQ](README_v3.2.md#faq) - - [SMRTbell Designs](README_v3.2.md#what-smrtbell-designs-are-possible) - ## High-level workflow The high-level workflow depicts files and processes: - + ## Mid-level workflow The mid-level workflow schematically explains what happens at each stage: - + ## Low-level workflow @@ -54,16 +26,13 @@ For each SMRT cell a `movieX.subreads.bam` is needed for processing. ### Step 1 - Circular Consensus Sequence calling Each sequencing run is processed by [*ccs*](https://github.com/PacificBiosciences/ccs) -to generate one representative circular consensus sequence (CCS) for each ZMW. Only ZMWs with -at least one full pass (at least one subread with SMRT adapter on both ends) are -used for the subsequent analysis. In contrast to older IsoSeq versions, -CCS polishing is required to enable skipping of the transcript polishing. -It is advised to use the latest CCS version 4.0.0 or newer. +to generate one representative circular consensus sequence (CCS) for each ZMW. +It is advised to use the latest CCS version 4.2.0 or newer. _ccs_ can be installed with `conda install pbccs`. $ ccs movieX.subreads.bam movieX.ccs.bam --min-rq 0.9 -More info how to [easily chunk ccs](https://github.com/PacificBiosciences/ccs#how-can-I-parallelize-on-multiple-servers). +You can easily parallelize _ccs_ generation by chunking, please follow [this how-to](https://github.com/PacificBiosciences/ccs#how-can-I-parallelize-on-multiple-servers). ### Step 2 - Primer removal and demultiplexing Removal of primers and identification of barcodes is performed using [*lima*](https://github.com/pacificbiosciences/barcoding), @@ -132,20 +101,19 @@ The following output files of *refine* contain full-length non-concatemer reads: Actual command to refine: - $ isoseq3 refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam + $ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam If your sample has poly(A) tails, use `--require-polya`. This filters for FL reads that have a poly(A) tail -with at least 20 base pairs and removes identified tail: +with at least 20 base pairs (`--min-polya-length`) and removes identified tail: - $ isoseq3 refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flnc.bam --require-polya + $ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flnc.bam --require-polya ### Step 3b - Merge SMRT Cells -If you used more than one SMRT cells, use `dataset` for merging, -which can be installed with `conda install pbcoretools`. -Merge all of your `.flnc.bam` files: +If you used more than one SMRT cells, list all of your `.flnc.bam` in one +`flnc.fofn`, a file of filenames: - $ dataset create --type TranscriptSet merged.flnc.xml movie1.flnc.bam movie2.flnc.bam movieN.flnc.bam + $ ls movie*.flnc.bam movie*.flnc.bam movie*.flnc.bam > flnc.fofn ### Step 4 - Clustering Compared to previous IsoSeq approaches, *IsoSeq v3* performs a single clustering @@ -159,7 +127,7 @@ The individual steps of *cluster* are as following: **Input** The input file for *cluster* is one FLNC file: - - `.flnc.bam` or `merged.flnc.xml` + - `.flnc.bam` or `flnc.fofn` **Output** The following output files of *cluster* contain polished isoforms: @@ -171,7 +139,59 @@ The following output files of *cluster* contain polished isoforms: Example invocation: - $ isoseq3 cluster merged.flnc.xml polished.bam --verbose --use-qvs + $ isoseq cluster flnc.fofn clustered.bam --verbose --use-qvs + +### Step 5 - Optional polishing and per base QV calculation +In this optional step, you can generate per base QVs for transcript consensus +sequences and improve results minimally. +The tool for this is called *polish* and it uses the original subreads in addition. +This step is very time consuming and you likely do not need the extra +quality and QVs. + +If you have more than one cell worth of data, you must merge the `subreadset.xml` +files. Please use `dataset` for merging, which can be installed with +`conda install pbcoretools`. +Merge all of your **source** `.subreadset.xml` files: + + $ dataset create --type SubreadSet merged.subreadset.xml movie1.subreadset.xml movie2.subreadset.xml movieN.subreadset.xml + +**Input** +The input files for *polish* are: + - `.bam` or `.transcriptset.xml` + - `.subreadset.xml` or `merged.subreadset.xml` + +**Output** +The following output files of *polish* contain polished isoforms: + - `.bam` + - `.transcriptset.xml` + - `.hq.fasta.gz` with predicted accuracy ≥ 0.99 + - `.lq.fasta.gz` with predicted accuracy < 0.99 + - `.hq.fastq.gz` with predicted accuracy ≥ 0.99 + - `.lq.fastq.gz` with predicted accuracy < 0.99 + +Example invocation: + + $ isoseq polish clustered.bam merged.subreadset.xml polished.bam + +### Alternative Step 4/5 - Parallel Polishing +Polishing can be massively parallelized on multiple servers by splitting +the `clustered.bam` file. +Split BAM files can be generated by *cluster*. + + $ isoseq cluster flnc.fofn clustered.bam --verbose --use-qvs --split-bam 24 + +This will create up to 24 output BAM files: + + clustered.0.bam + clustered.1.bam + ... + +Each of those `clustered..bam` files can be polished in parallel: + + $ isoseq polish clustered.0.bam merged.subreadset.xml polished.0.bam + $ isoseq polish clustered.1.bam merged.subreadset.xml polished.1.bam + $ ... + ## Real-world example This is an example of an end-to-end cmd-line-only workflow to get from @@ -193,10 +213,9 @@ subreads to polished isoforms: AAGCAGTGGTATCAACGCAGAGTAC $ lima --version - lima 1.9.0 (commit v1.9.0) + lima 1.10.0 (commit v1.10.0) - $ lima m54086_170204_081430.ccs.bam primers.fasta m54086_170204_081430.fl.bam \ - --isoseq --peek-guess + $ lima m54086_170204_081430.ccs.bam primers.fasta m54086_170204_081430.fl.bam --isoseq $ ls m54086_170204_081430.fl* m54086_170204_081430.fl.json m54086_170204_081430.fl.lima.summary @@ -204,14 +223,14 @@ subreads to polished isoforms: m54086_170204_081430.fl.lima.counts m54086_170204_081430.fl.primer_5p--primer_3p.subreadset.xml m54086_170204_081430.fl.lima.report - $ isoseq3 refine m54086_170204_081430.fl.primer_5p--primer_3p.bam primers.fasta m54086_170204_081430.flnc.bam + $ isoseq refine m54086_170204_081430.fl.primer_5p--primer_3p.bam primers.fasta m54086_170204_081430.flnc.bam $ ls m54086_170204_081430.flnc.* m54086_170204_081430.flnc.bam m54086_170204_081430.flnc.filter_summary.json m54086_170204_081430.flnc.bam.pbi m54086_170204_081430.flnc.report.csv m54086_170204_081430.flnc.consensusreadset.xml - $ isoseq3 cluster m54086_170204_081430.flnc.bam polished.bam --verbose --use-qvs + $ isoseq cluster m54086_170204_081430.flnc.bam clustered.bam --verbose --use-qvs Read BAM : (197791) 4s 20ms Convert to reads : 1s 431ms Sort Reads : 56ms 947us @@ -225,10 +244,10 @@ subreads to polished isoforms: Write output : 1s 134ms Complete run time : 4m 32s - $ ls polished* - polished.bam polished.hq.fasta.gz - polished.bam.pbi polished.lq.fasta.gz - polished.cluster polished.transcriptset.xml + $ ls clustered* + clustered.bam clustered.hq.fasta.gz + clustered.bam.pbi clustered.lq.fasta.gz + clustered.cluster clustered.transcriptset.xml ## DISCLAIMER