From f8184bcc549dbe3ebadaf98c775e6660c2866b0c Mon Sep 17 00:00:00 2001 From: Daniel <104648079+PB-DB@users.noreply.github.com> Date: Thu, 12 May 2022 13:05:08 -0700 Subject: [PATCH] Docs Update: BAM Tags, changelog. (#20) --- docs/changelog.md | 23 +++++++++++--- docs/general-faq.md | 18 +++++++---- docs/index.md | 2 +- docs/isoseq-tags.md | 31 ++++++++++++++++++ docs/umi/cli-workflow.md | 65 +++++++++++++++++++++++++++++++++++--- docs/umi/isoseq-bcstats.md | 29 +++++++++++++++++ docs/umi/isoseq-correct.md | 62 ++++++++++++++++++++++++++++++++++++ 7 files changed, 215 insertions(+), 15 deletions(-) create mode 100644 docs/isoseq-tags.md create mode 100644 docs/umi/isoseq-bcstats.md create mode 100644 docs/umi/isoseq-correct.md diff --git a/docs/changelog.md b/docs/changelog.md index 99cbc89..a023cff 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -6,7 +6,22 @@ nav_order: 99 # Version changelog - * **3.4.0** + * **3.7.0** + * Adding `bcstats`, `correct`, and `groupdedup` to CLI + * `bcstats` emits frequency statistics for 10x barcodes + * `correct` uses a truth-set to correct sequencing errors in cell barcodes + * `groupdedup` provides substantial performance improvements over dedup + * Support SEGMENT read type + + * 3.6.0 + * Adding `tag` and `dedup` to CLI + + * 3.5.0 + * SMRT Link release 11.0 + * Remove support for CLR data and disable `polish` step + * Enable `cluster --use-qvs` as always on + + * 3.4.0 * SMRT Link release 10.0.0 * Add support for UMI and cell barcode handling, by adding `tag` and `dedup` * Add `refine --min-rq` to support RQ filtering for unfiltered @@ -22,7 +37,7 @@ nav_order: 99 * 3.2.1 * Fix a gff index 1-off bug in `collapse` * We have removed implicit dependencies from the bioconda recipe. Please - install `pbccs`, `lima`, and `pbcoretools` as needed. + install `pbccs`, `lima`, and `pbcoretools` as needed * 3.2.0 * **`polish` dropped support for RS II datasets!** @@ -31,7 +46,7 @@ nav_order: 99 * Add `refine --min-polya-length` * Add `cluster --singletons` to output unclustered FLNCs; potential sample prep artifacts! - * Fix minimap2 bugs. Outputs might change slightly. + * Fix minimap2 bugs. Outputs might change slightly * 3.1.2 * Reduce `polish` memory footprint @@ -44,4 +59,4 @@ nav_order: 99 * 3.1.0 * We outsourced the poly(A) tail removal and concatemer detection into a new tool called `refine`. Your custom `primers.fasta` is used in this step to - detect concatemers. + detect concatemers diff --git a/docs/general-faq.md b/docs/general-faq.md index 630e60c..957159b 100644 --- a/docs/general-faq.md +++ b/docs/general-faq.md @@ -8,18 +8,24 @@ nav_order: 5 ## BAM tags explained Following BAM tags are being used: - - `ib` Barcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the ZMW counts, delimited by comma. Example: `0,1,20;0,3,5` - - `ic` Sum of number of passes from all ZMWs used to create consensus - - `im` ZMW names associated with this isoform - - `is` Number of ZMWs associated with this isoform + - `ib` Barcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the read counts, delimited by comma. Example: `0,1,20;0,3,5` + - `ic` Number of reads used to generate consensus. If less than `is`, this means that reads were down-sampled when consensus-calling + - `im` Read names associated with this isoform + - `is` Number of reads associated with this isoform - `it` List of barcodes/UMIs clipped during `tag` - `iz` Maximum number of subreads used for polishing - `rq` Predicted accuracy for polished isoform - `XA` Order of `tag` names - - `XC` barcode sequence `tag` + - `XC` Cell/group barcode sequence `tag` + - `CB` Cell/group barcode sequence `tag`. This is an alias for XC, but its presence indicates that the barcode has been corrected + - `CR` Raw cell/group barcode sequence `tag` - `XG` PacBio's `GGG` UMI suffix `tag` - `XM` UMI sequence `tag` - - `XO` overhang sequence `tag` + - `XO` Overhang sequence `tag` + - `nb` Edit distance between corrected cell/group barcode and raw cell/group barcode + - `gp` Pass/fail for cell/group barcode correction using a truth-set. 1 for pass, 0 for fail + - `nc` Number of known cell/group barcodes in the truth-set sharing the shortest distance from the raw barcode. If this number is > 1, this indicates ambiguity in remapping + - `oc` Original known cell/group barcodes in the truth-set sharing the shortest distance from the raw barcode Quality values are capped at `93`. diff --git a/docs/index.md b/docs/index.md index a5305b9..070bee4 100644 --- a/docs/index.md +++ b/docs/index.md @@ -26,7 +26,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie for information on Installation, Support, License, Copyright, and Disclaimer. ## Latest Version -Version **3.4.0**: [Full changelog here](/changelog) +Version **3.7.0**: [Full changelog here](/changelog) ## What's new! New documentation is up, a 1:1 port from the original GitHub docs with minor diff --git a/docs/isoseq-tags.md b/docs/isoseq-tags.md new file mode 100644 index 0000000..01ca356 --- /dev/null +++ b/docs/isoseq-tags.md @@ -0,0 +1,31 @@ +--- +layout: default +title: BAM Tags +nav_order: 8 +--- + +#### Iso-seq Tags + +| Tag | Type | Short Name | Relevant Executable | Value | +| --- | ---- | ---------- | ----- | ----- | +|CR| string | Cell Raw | `correct` | Raw (uncorrected) barcode. | +|CB| string | Cell Barcode | `correct` | Corrected cell/group barcode. | +|UR| string | UMI Raw | None currently | Molecular/UMI barcode. | +|UB| string | UMI Barcode | None currently | Corrected molecular/UMI barcode. | +|XM| string | UMI Barcode | `tag` | Corrected molecular/UMI barcode. | +|XC| string | Cell Barcode | `tag`, `correct` | Original Cell barcode. | +|XA| string | tag name order| `tag`, `correct` | Order of tags names. | +|nc| int | Number of Candidates | `correct` | Number of candidate barcodes. | +|oc| string | Other Choices | `correct` | String representation of other potential barcodes. | +|gp| int | Group Passes | `correct` | Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing. | +|nb| int | Barcode Distance | `correct` | Edit distance from the barcode for the read to the barcode to which it was reassigned. This is 0 if the barcode matches exactly, -1 if the barcode could not be rescued, and the edit distance otherwise. | +|ic| int | input-consensus | `dedup`, `groupdedup` | Number of reads used to generate consensus. If less than `is`, this means that reads were down-sampled when consensus-calling. | +|is| int | input-sequences | `dedup`, `groupdedup` | Number of reads associated with isoform. | +|XO| string | X Overhang | `tag` | Overhang sequence tag. | +|XG| string | X GGG | `tag` | PacBio's GGG UMI suffix tag | +|rq | float | read quality | | Predicted accuracy for polished isoform | +|iz | int | maximum subreads used | | maximum number of subreads used for polishing | +|it | string | trimmed | `tag` | List of barcodes/UMIs clipped during tag | +|im | string | names | `dedup`, `groupdedup` | List of names of input reads used in generating consensus | + + diff --git a/docs/umi/cli-workflow.md b/docs/umi/cli-workflow.md index b12b223..0c76b05 100644 --- a/docs/umi/cli-workflow.md +++ b/docs/umi/cli-workflow.md @@ -135,10 +135,54 @@ If you used more than one SMRT cells, merge all of your `.fltnc.bam` file $ ls movie1.fltnc.bam movie2.fltnc.bam movieN.fltnc.bam > fltnc.fofn -## Step 5 - Deduplication + +## Step 5 - Cell Barcode Correction +This step identifies 10x cell barcode errors and correct them. The tool uses the 10x cell barcode whitelist to reassign erroneous barcodes based on edit distance. + + +**Method** + +First, the *correct* tool builds a Locality-Sensitive Hashing (LSH) index over the 10x whitelist barcode subsequences. +In the second step, *correct* uses the LSH index to map raw input barcodes to their nearest barcodes in the truth-set. + +For each input HiFI read containing a 10x cell barcode: + - If the barcode is in the whitelist, it is unchanged. + - If the barcode is not found in the whitelist, the index is queried for the closest match in the whitelist. + - Edit distance is calculated between all retrieved whitelist cell barcodes and the input barcode. + - The barcode with the lowest edit distance and lowest hamming distance is output. + - By default, if the edit distance between the cell barcode and whitelist barcode is > 2, the read is marked as failing. + - If no candidates were found, the barcode is unchanged, and the read is marked as failing. + +**Input** The input file for correct is one FLTNC file: + - .fltnc.bam + +**Output** The following output files of correct contain reads with corrected cell barcodes: + - .bam + - .bam.pbi + +Example invocation: + $ isoseq correct --barcodes barcode_set.txt flnc.bam flnc.corrected.bam + + +## Step 6 - Deduplication This step performs PCR deduplicatation via clustering by UMI and cell barcodes (if available). -After deduplication, *dedup* generates one consensus sequence per founder molecule, -using a QV guided consensus approach. + +We provide two methods: *dedup* and *groupdedup*. + +They perform nearly identical functionality. The key difference is that *groupdedup* only deduplicates +reads sharing a cell barcode and *groupdedup* requires both barcode correction with the *correct* tool and sorting by cell barcode (tag "CB"). +(Sorting a BAM by cell barcode may be efficiently accomplished by `samtools sort -t CB`.) + +This is because sequencing errors introduce erroneous barcodes, yielding spurious reads. +*dedup* allows for barcode errors through pairwise barcode alignment, but *groupdedup* assumes that barcodes are correct. +Performing this correction step allows this faster *groupdedup* step to reasonably make this assumption while +also allowing for mismatches using the index. + +This can provide over 200x speed-ups, as well as substantially reducing RAM requirements. + + +After deduplication, *dedup* and *groupdedup* generate one consensus sequence per founder molecule, +using a QV guided consensus. **Method** @@ -148,11 +192,16 @@ Perform all vs all comparison and cluster two reads if: * pairwise concordance is at least 97% * alignment starts/ends within 5 bp of the other read * no more than 5 bps are deleted or inserted in a window of 20 bp (like in isoseq cluster) + * *groupdedup* only: these reads have the same cell barcode **Input** The input file for *dedup* is one FLTNC file: - `.fltnc.bam` or `fltnc.fofn` +The input file for *groupdedup* is one FLTNC file, sorted by 10x cell barcode tag: + - `.tagsort.bam` + + **Output** The following output files of *dedup* contain polished isoforms: - `.bam` @@ -161,6 +210,14 @@ The following output files of *dedup* contain polished isoforms: - `.bam.pbi` - `.transcriptset.xml` -Example invocation: +The following output files of *groupdedup* contain polished isoforms: + - `.bam` + - `.bam.pbi` + +Example invocation (*dedup*): $ isoseq dedup fltnc.fofn dedup.bam --verbose + +Example invocation (*groupdedup*): + + $ isoseq groupdedup fltnc.tagsort.bam dedup.bam diff --git a/docs/umi/isoseq-bcstats.md b/docs/umi/isoseq-bcstats.md new file mode 100644 index 0000000..e68c34d --- /dev/null +++ b/docs/umi/isoseq-bcstats.md @@ -0,0 +1,29 @@ +--- +layout: default +parent: Single cell +title: Barcode Statistics +nav_order: 7 +--- + +*** + +`isoseq3 bcstats` emits statistics for each barcode: + +1. Barcode sequence +2. Number of reads matching the barcode +3. Frequency Rank (within barcodes) +4. Number of unique molecular barcodes matching this barcode +5. Whether the barcode is Group/Cell barcode or a Molecular Barcode/UMI + +If `--json` is unset, JSON summary information is written to stderr ("/dev/stderr"). +Similarly, if '-o' is unset, output TSV information is written to stdout ("/dev/stdout"). + +```bash +# Example: +isoseq3 bcstats --json sample.bcstats.json -o sample.bcstats.tsv sample.bam +``` + +In default behavior, the program only emits stats on group barcodes. +Adding `--umi` will cause stats for the full molecular barcodes to be emitted as well. + + diff --git a/docs/umi/isoseq-correct.md b/docs/umi/isoseq-correct.md new file mode 100644 index 0000000..08c1f1e --- /dev/null +++ b/docs/umi/isoseq-correct.md @@ -0,0 +1,62 @@ +--- +layout: default +parent: Single cell +title: Barcode Correction Documentation via correct +nav_order: 6 +--- + +## Barcode Correction Documentation + +### Why Barcode Correction? + +Single-cell, spatially-resolved, and other barcoded sequencing applications +rely on the accuracy of the cell or group barcode, which is typically chosen from a set of +known candidates, often referred to as a "whitelist". + +This contrasts with the uniformly randomly-generated molecular barcodes (a.k.a. UMIs, "Unique molecular identifiers"). + +This tool uses the set of known candidates to correct sequencing errors in cell barcode identification. There are two primary benefits: + +1. Increased yield +2. Improved accuracy in downstream deduplication. + +By correcting errors in cell barcodes, the total number of usable reads is increased (typically ~5%). + +And, once cell barcodes are corrected, the downstream groupdedup software tool can perform deduplication much more efficiently +than standard deduplication. This is because only reads sharing a cell barcode are compared, which dramatically reduces the search space compared to exhaustive pairwise comparisons. + +### What does Barcode Correction do? + +The tool takes a list of true barcodes and builds a locality-sensitive hashing (LSH) index over that set to facilitate fast nearest-neighbor queries. + +This remaps reads with cell barcodes to their nearest-neighbors within the truth set. + +### When would a user call this tool? + +Run this tool on barcode-tagged BAM files before deduplication (`isoseq3 groupdedup`). +This provides substantial runtime improvements compared to `isoseq3 dedup`. + +## Usage + +### (with barcode-set in barcodes.txt) +``` +isoseq3 correct --barcodes barcodes.txt input.bam output.bam +``` + +#### Tags +This requires the existance of XC and XU barcode tags. +The program will fail if either are missing. + +We also add or update the following tags: + +| Tag | Type | Short Name | Value | +| --- | ---- | ---------- | ----- | +|CR| string | Cell Raw | Raw (uncorrected) barcode. | +|CB| string | Cell Barcode | Corrected cell/group barcode. | +|XC| string | Cell Barcode | Original Cell barcode. | +|nc| int | Number of Candidates | Number of candidate barcodes. | +|oc| string | Other Choices | String representation of other potential barcodes. | +|gp| int | Group Passes | Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing. | +|nb| int | Number of Barcode Mismatches | Edit distance from the barcode for the read to the barcode to which it was reassigned. This is -1 if the barcode could not be corrected, and the edit distance otherwise. (This means 0 for an exact match.) | + +