From fe09b8fccc6ad05d7937bda8b8051ce2dd0a77db Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Armin=20To=CC=88pfer?= Date: Wed, 23 Sep 2020 11:58:06 +0200 Subject: [PATCH] Version 2.0.0 --- README.md | 59 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 48 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 97616eb..349a28b 100644 --- a/README.md +++ b/README.md @@ -64,11 +64,11 @@ The sort order is defined by the barcode indices, lowest first. *Lima* offers the following features: * Process both, CLR subreads and CCS reads - * BAM in- and output + * BAM, FASTA, FASTQ in- and output * Extensive reports that allow in-depth quality control * Clip barcode sequences and annotate `bq` and `bc` tags * Agnostic of input barcode sequence orientation - * Split output BAM files by barcode + * Split output files by barcode * Full PacBio dataset support * Peek into the first N ZMWs and get average barcode score * Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold @@ -76,7 +76,7 @@ The sort order is defined by the barcode indices, lowest first. * Double demux to remove PCR primers after barcode demultiplexing ## Latest Version -Version **1.11.0**: [Full changelog here](#full-changelog) +Version **2.0.0**: [Full changelog here](#full-changelog) ## Execution @@ -86,13 +86,13 @@ Version **1.11.0**: [Full changelog here](#full-changelog) Run on CLR subread data: - lima movie.subreads.bam barcodes.fasta prefix.bam - lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml + $ lima movie.subreads.bam barcodes.fasta prefix.bam + $ lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml Run on CCS data: - lima --ccs movie.ccs.bam barcodes.fasta prefix.bam - lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml + $ lima --ccs movie.ccs.bam barcodes.fasta prefix.bam + $ lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml If you do not need to import the demultiplexed data into SMRT Link, it is advised to use `--no-pbi`, omit the pbi index file, to minimize time to result. @@ -109,8 +109,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result. ### Example execution - lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \ - m54317_180718_075644.demux.subreadset.xml --different --peek-guess + $ lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \ + m54317_180718_075644.demux.subreadset.xml --different --peek-guess ## Input data @@ -119,6 +119,8 @@ unaligned CCS reads, generated by [CCS](https://github.com/PacificBiosciences/cc both in the PacBio enhanced BAM format. If you want to demux RSII data, first use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json` with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed. +In addition, CCS reads input are also supported as FASTA or FASTQ, optionally +gzipped. Barcodes are provided as a FASTA file, one entry per barcode sequence, **no duplicate** sequences, only upper-case bases, @@ -159,7 +161,7 @@ prefix as the output file, omitting suffixes `.bam`, `.subreadset.xml`, and `.consensusreadset.xml`. The report infix is `lima`. Example: - lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml + $ lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml For all output files, the prefix will be `/my/path/m54007_170702_064558_demux.` @@ -167,6 +169,38 @@ For all output files, the prefix will be `/my/path/m54007_170702_064558_demux.` The first file `prefix.bam` contains clipped records, annotated with barcode tags, that passed filters. +### FASTA/Q +Alternatively, if output file is fasta or fastq, the header of each sequence +contains all tags, separated by a single whitespace, that would be present in +the BAM format. Example FASTQ header: + + @m54006_171006_044150/4588126/ccs bc=3,3 bl=CGCGCGTGTGTGCGTG bq=100 bt=CGCGCGTGTGTGCGTG bx=16,16 cx=12 qe=2235 ql=p\tttropqorrtnnH qs=16 qt=G^\IGR]K8S>>^\^p + +### In- and output compatibility matrix: + +For CLR data, only XML and BAM are valid in- and output file types. + +For CCS data, use following compatibility matrix: + +| In/Out | XML | BAM | FASTQ | FASTA | +| ------ | :-: | :-: | :---: | :---: | +| XML | YES | YES | YES | YES | +| BAM | YES | YES | YES | YES | +| FASTQ | no | no | YES | YES | +| FASTA | no | no | no | YES | + +This means, you can use CCS FASTQ reads as input and FASTA as output, but +not BAM as output. + +Working example: + + $ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.fastq --same + +Failing example: + + $ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.bam --same + FATAL -|- Unsupported combination of FASTQ input and BAM output. + ### Report The second file is `prefix.lima.report`, a tab-separated file about each ZMW, unfiltered. This report contains any information necessary to investigate the demultiplexing @@ -1069,7 +1103,10 @@ any parameters now, but worth future investigation. ## Full Changelog - * **1.11.0**: + * **2.0.0**: + * Add support for FASTA and FASTQ + * Fix `-k` with by-strand HiFi reads + * 1.11.0: * Add barcode to read groups, use one barcode pair per RG * Fix double demux, used to clip wrongly for the second round of demuxing * 1.10.0: