From f8184bcc549dbe3ebadaf98c775e6660c2866b0c Mon Sep 17 00:00:00 2001
From: Daniel <104648079+PB-DB@users.noreply.github.com>
Date: Thu, 12 May 2022 13:05:08 -0700
Subject: [PATCH] Docs Update: BAM Tags, changelog. (#20)

---
 docs/changelog.md          | 23 +++++++++++---
 docs/general-faq.md        | 18 +++++++----
 docs/index.md              |  2 +-
 docs/isoseq-tags.md        | 31 ++++++++++++++++++
 docs/umi/cli-workflow.md   | 65 +++++++++++++++++++++++++++++++++++---
 docs/umi/isoseq-bcstats.md | 29 +++++++++++++++++
 docs/umi/isoseq-correct.md | 62 ++++++++++++++++++++++++++++++++++++
 7 files changed, 215 insertions(+), 15 deletions(-)
 create mode 100644 docs/isoseq-tags.md
 create mode 100644 docs/umi/isoseq-bcstats.md
 create mode 100644 docs/umi/isoseq-correct.md

diff --git a/docs/changelog.md b/docs/changelog.md
index 99cbc89..a023cff 100644
--- a/docs/changelog.md
+++ b/docs/changelog.md
@@ -6,7 +6,22 @@ nav_order: 99
 
 # Version changelog
 
- * **3.4.0**
+ * **3.7.0**
+   * Adding `bcstats`, `correct`, and `groupdedup` to CLI
+   * `bcstats` emits frequency statistics for 10x barcodes
+   * `correct` uses a truth-set to correct sequencing errors in cell barcodes
+   * `groupdedup` provides substantial performance improvements over dedup
+   * Support SEGMENT read type
+
+ * 3.6.0
+   * Adding `tag` and `dedup` to CLI
+
+ * 3.5.0
+   * SMRT Link release 11.0
+   * Remove support for CLR data and disable `polish` step
+   * Enable `cluster --use-qvs` as always on
+
+ * 3.4.0
    * SMRT Link release 10.0.0
    * Add support for UMI and cell barcode handling, by adding `tag` and `dedup`
    * Add `refine --min-rq` to support RQ filtering for unfiltered
@@ -22,7 +37,7 @@ nav_order: 99
  * 3.2.1
    * Fix a gff index 1-off bug in `collapse`
    * We have removed implicit dependencies from the bioconda recipe. Please
-     install `pbccs`, `lima`, and `pbcoretools` as needed.
+     install `pbccs`, `lima`, and `pbcoretools` as needed
 
  * 3.2.0
    * **`polish` dropped support for RS II datasets!**
@@ -31,7 +46,7 @@ nav_order: 99
    * Add `refine --min-polya-length`
    * Add `cluster --singletons` to output unclustered FLNCs; potential sample
      prep artifacts!
-   * Fix minimap2 bugs. Outputs might change slightly.
+   * Fix minimap2 bugs. Outputs might change slightly
 
  * 3.1.2
    * Reduce `polish` memory footprint
@@ -44,4 +59,4 @@ nav_order: 99
  * 3.1.0
    * We outsourced the poly(A) tail removal and concatemer detection into a new
      tool called `refine`. Your custom `primers.fasta` is used in this step to
-     detect concatemers.
+     detect concatemers
diff --git a/docs/general-faq.md b/docs/general-faq.md
index 630e60c..957159b 100644
--- a/docs/general-faq.md
+++ b/docs/general-faq.md
@@ -8,18 +8,24 @@ nav_order: 5
 ## BAM tags explained
 Following BAM tags are being used:
 
- - `ib` Barcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the ZMW counts, delimited by comma. Example: `0,1,20;0,3,5`
- - `ic` Sum of number of passes from all ZMWs used to create consensus
- - `im` ZMW names associated with this isoform
- - `is` Number of ZMWs associated with this isoform
+ - `ib` Barcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the read counts, delimited by comma. Example: `0,1,20;0,3,5`
+ - `ic` Number of reads used to generate consensus. If less than `is`, this means that reads were down-sampled when consensus-calling
+ - `im` Read names associated with this isoform
+ - `is` Number of reads associated with this isoform
  - `it` List of barcodes/UMIs clipped during `tag`
  - `iz` Maximum number of subreads used for polishing
  - `rq` Predicted accuracy for polished isoform
  - `XA` Order of `tag` names
- - `XC` barcode sequence `tag`
+ - `XC` Cell/group barcode sequence `tag`
+ - `CB` Cell/group barcode sequence `tag`. This is an alias for XC, but its presence indicates that the barcode has been corrected
+ - `CR` Raw cell/group barcode sequence `tag`
  - `XG` PacBio's `GGG` UMI suffix `tag`
  - `XM` UMI sequence `tag`
- - `XO` overhang sequence `tag`
+ - `XO` Overhang sequence `tag`
+ - `nb` Edit distance between corrected cell/group barcode and raw cell/group barcode
+ - `gp` Pass/fail for cell/group barcode correction using a truth-set. 1 for pass, 0 for fail
+ - `nc` Number of known cell/group barcodes in the truth-set sharing the shortest distance from the raw barcode. If this number is > 1, this indicates ambiguity in remapping
+ - `oc` Original known cell/group barcodes in the truth-set sharing the shortest distance from the raw barcode
 
  Quality values are capped at `93`.
 
diff --git a/docs/index.md b/docs/index.md
index a5305b9..070bee4 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -26,7 +26,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
 for information on Installation, Support, License, Copyright, and Disclaimer.
 
 ## Latest Version
-Version **3.4.0**: [Full changelog here](/changelog)
+Version **3.7.0**: [Full changelog here](/changelog)
 
 ## What's new!
 New documentation is up, a 1:1 port from the original GitHub docs with minor
diff --git a/docs/isoseq-tags.md b/docs/isoseq-tags.md
new file mode 100644
index 0000000..01ca356
--- /dev/null
+++ b/docs/isoseq-tags.md
@@ -0,0 +1,31 @@
+---
+layout: default
+title: BAM Tags
+nav_order: 8
+---
+
+#### Iso-seq Tags
+
+| Tag | Type | Short Name | Relevant Executable | Value |
+| --- | ---- | ---------- | ----- | ----- |
+|CR| string  | Cell Raw | `correct` | Raw (uncorrected) barcode. |
+|CB| string  | Cell Barcode | `correct` | Corrected cell/group barcode. |
+|UR| string  | UMI Raw | None currently | Molecular/UMI barcode. |
+|UB| string  | UMI Barcode | None currently | Corrected molecular/UMI barcode. |
+|XM| string  | UMI Barcode | `tag` | Corrected molecular/UMI barcode. |
+|XC| string  | Cell Barcode | `tag`, `correct` | Original Cell barcode. |
+|XA| string  | tag name order| `tag`, `correct` | Order of tags names. |
+|nc| int     | Number of Candidates | `correct` | Number of candidate barcodes. |
+|oc| string  | Other Choices | `correct` | String representation of other potential barcodes. |
+|gp| int     | Group Passes | `correct` | Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing. |
+|nb| int     | Barcode Distance | `correct` | Edit distance from the barcode for the read to the barcode to which it was reassigned. This is 0 if the barcode matches exactly, -1 if the barcode could not be rescued, and the edit distance otherwise. |
+|ic| int     | input-consensus | `dedup`, `groupdedup` | Number of reads used to generate consensus. If less than `is`, this means that reads were down-sampled when consensus-calling. |
+|is| int     | input-sequences | `dedup`, `groupdedup` | Number of reads associated with isoform. |
+|XO| string  | X Overhang | `tag` |  Overhang sequence tag. | 
+|XG| string  | X GGG      | `tag` | PacBio's GGG UMI suffix tag |
+|rq | float | read quality | | Predicted accuracy for polished isoform |
+|iz | int   | maximum subreads used | | maximum number of subreads used for polishing |
+|it | string | trimmed | `tag` | List of barcodes/UMIs clipped during tag |
+|im | string | names | `dedup`, `groupdedup` | List of names of input reads used in generating consensus |
+
+<img src="../doc/img/isoseq.png"/>
diff --git a/docs/umi/cli-workflow.md b/docs/umi/cli-workflow.md
index b12b223..0c76b05 100644
--- a/docs/umi/cli-workflow.md
+++ b/docs/umi/cli-workflow.md
@@ -135,10 +135,54 @@ If you used more than one SMRT cells, merge all of your `<movie>.fltnc.bam` file
 
     $ ls movie1.fltnc.bam movie2.fltnc.bam movieN.fltnc.bam > fltnc.fofn
 
-## Step 5 - Deduplication
+
+## Step 5 - Cell Barcode Correction
+This step identifies 10x cell barcode errors and correct them. The tool uses the 10x cell barcode whitelist to reassign erroneous barcodes based on edit distance.
+
+
+**Method**
+
+First, the *correct* tool builds a Locality-Sensitive Hashing (LSH) index over the 10x whitelist barcode subsequences.
+In the second step, *correct* uses the LSH index to map raw input barcodes to their nearest barcodes in the truth-set.
+
+For each input HiFI read containing a 10x cell barcode:
+ -  If the barcode is in the whitelist, it is unchanged.
+ -  If the barcode is not found in the whitelist, the index is queried for the closest match in the whitelist.
+ -  Edit distance is calculated between all retrieved whitelist cell barcodes and the input barcode.
+ -  The barcode with the lowest edit distance and lowest hamming distance is output.
+ -  By default, if the edit distance between the cell barcode and whitelist barcode is > 2, the read is marked as failing.
+ -  If no candidates were found, the barcode is unchanged, and the read is marked as failing.
+
+**Input** The input file for correct is one FLTNC file:
+ -  <movie>.fltnc.bam
+
+**Output** The following output files of correct contain reads with corrected cell barcodes:
+ -  <prefix>.bam
+ -  <prefix>.bam.pbi
+
+Example invocation:
+    $ isoseq correct --barcodes barcode_set.txt flnc.bam flnc.corrected.bam
+
+
+## Step 6 - Deduplication
 This step performs PCR deduplicatation via clustering by UMI and cell barcodes (if available).
-After deduplication, *dedup* generates one consensus sequence per founder molecule,
-using a QV guided consensus approach.
+
+We provide two methods: *dedup* and *groupdedup*.
+
+They perform nearly identical functionality. The key difference is that *groupdedup* only deduplicates
+reads sharing a cell barcode and *groupdedup* requires both barcode correction with the *correct* tool and sorting by cell barcode (tag "CB").
+(Sorting a BAM by cell barcode may be efficiently accomplished by `samtools sort -t CB`.)
+
+This is because sequencing errors introduce erroneous barcodes, yielding spurious reads.
+*dedup* allows for barcode errors through pairwise barcode alignment, but *groupdedup* assumes that barcodes are correct.
+Performing this correction step allows this faster *groupdedup* step to reasonably make this assumption while
+also allowing for mismatches using the index.
+
+This can provide over 200x speed-ups, as well as substantially reducing RAM requirements.
+
+
+After deduplication, *dedup* and *groupdedup* generate one consensus sequence per founder molecule,
+using a QV guided consensus.
 
 **Method**
 
@@ -148,11 +192,16 @@ Perform all vs all comparison and cluster two reads if:
  * pairwise concordance is at least 97%
  * alignment starts/ends within 5 bp of the other read
  * no more than 5 bps are deleted or inserted in a window of 20 bp (like in isoseq cluster)
+ * *groupdedup* only: these reads have the same cell barcode
 
 **Input**
 The input file for *dedup* is one FLTNC file:
  - `<movie>.fltnc.bam` or `fltnc.fofn`
 
+The input file for *groupdedup* is one FLTNC file, sorted by 10x cell barcode tag:
+ - `<movie>.tagsort.bam`
+
+
 **Output**
 The following output files of *dedup* contain polished isoforms:
  - `<prefix>.bam`
@@ -161,6 +210,14 @@ The following output files of *dedup* contain polished isoforms:
  - `<prefix>.bam.pbi`
  - `<prefix>.transcriptset.xml`
 
-Example invocation:
+The following output files of *groupdedup* contain polished isoforms:
+ - `<prefix>.bam`
+ - `<prefix>.bam.pbi`
+
+Example invocation (*dedup*):
 
     $ isoseq dedup fltnc.fofn dedup.bam --verbose
+
+Example invocation (*groupdedup*):
+
+    $ isoseq groupdedup fltnc.tagsort.bam dedup.bam
diff --git a/docs/umi/isoseq-bcstats.md b/docs/umi/isoseq-bcstats.md
new file mode 100644
index 0000000..e68c34d
--- /dev/null
+++ b/docs/umi/isoseq-bcstats.md
@@ -0,0 +1,29 @@
+---
+layout: default
+parent: Single cell
+title: Barcode Statistics
+nav_order: 7
+---
+
+***
+
+`isoseq3 bcstats` emits statistics for each barcode:
+
+1. Barcode sequence
+2. Number of reads matching the barcode
+3. Frequency Rank (within barcodes)
+4. Number of unique molecular barcodes matching this barcode
+5. Whether the barcode is Group/Cell barcode or a Molecular Barcode/UMI
+
+If `--json` is unset, JSON summary information is written to stderr ("/dev/stderr").
+Similarly, if '-o' is unset, output TSV information is written to stdout ("/dev/stdout").
+
+```bash
+# Example:
+isoseq3 bcstats --json sample.bcstats.json -o sample.bcstats.tsv sample.bam
+```
+
+In default behavior, the program only emits stats on group barcodes.
+Adding `--umi` will cause stats for the full molecular barcodes to be emitted as well.
+
+<img src="../../doc/img/isoseq.png"/>
diff --git a/docs/umi/isoseq-correct.md b/docs/umi/isoseq-correct.md
new file mode 100644
index 0000000..08c1f1e
--- /dev/null
+++ b/docs/umi/isoseq-correct.md
@@ -0,0 +1,62 @@
+---
+layout: default
+parent: Single cell
+title: Barcode Correction Documentation via correct
+nav_order: 6
+---
+
+## Barcode Correction Documentation
+
+### Why Barcode Correction?
+
+Single-cell, spatially-resolved, and other barcoded sequencing applications
+rely on the accuracy of the cell or group barcode, which is typically chosen from a set of
+known candidates, often referred to as a "whitelist".
+
+This contrasts with the uniformly randomly-generated molecular barcodes (a.k.a. UMIs, "Unique molecular identifiers").
+
+This tool uses the set of known candidates to correct sequencing errors in cell barcode identification. There are two primary benefits:
+
+1. Increased yield
+2. Improved accuracy in downstream deduplication.
+
+By correcting errors in cell barcodes, the total number of usable reads is increased (typically ~5%).
+
+And, once cell barcodes are corrected, the downstream groupdedup software tool can perform deduplication much more efficiently
+than standard deduplication. This is because only reads sharing a cell barcode are compared, which dramatically reduces the search space compared to exhaustive pairwise comparisons.
+
+### What does Barcode Correction do?
+
+The tool takes a list of true barcodes and builds a locality-sensitive hashing (LSH) index over that set to facilitate fast nearest-neighbor queries.
+
+This remaps reads with cell barcodes to their nearest-neighbors within the truth set.
+
+### When would a user call this tool?
+
+Run this tool on barcode-tagged BAM files before deduplication (`isoseq3 groupdedup`).
+This provides substantial runtime improvements compared to `isoseq3 dedup`.
+
+## Usage
+
+### (with barcode-set in barcodes.txt)
+```
+isoseq3 correct --barcodes barcodes.txt input.bam output.bam
+```
+
+#### Tags
+This requires the existance of XC and XU barcode tags.
+The program will fail if either are missing.
+
+We also add or update the following tags:
+
+| Tag | Type | Short Name | Value |
+| --- | ---- | ---------- | ----- |
+|CR| string  | Cell Raw | Raw (uncorrected) barcode. |
+|CB| string  | Cell Barcode | Corrected cell/group barcode. |
+|XC| string  | Cell Barcode | Original Cell barcode. |
+|nc| int     | Number of Candidates | Number of candidate barcodes. |
+|oc| string  | Other Choices | String representation of other potential barcodes. |
+|gp| int     | Group Passes | Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing. |
+|nb| int     | Number of Barcode Mismatches | Edit distance from the barcode for the read to the barcode to which it was reassigned. This is -1 if the barcode could not be corrected, and the edit distance otherwise. (This means 0 for an exact match.) |
+
+<img src="../../doc/img/isoseq.png"/>