From 0d5a75af33433e00325cb78ee3be8f2c07bd8c60 Mon Sep 17 00:00:00 2001
From: Rob Patro <rob.patro@gmail.com>
Date: Fri, 6 Dec 2024 21:17:59 -0500
Subject: [PATCH] update docs

---
 docs/source/atac.rst     | 46 +++++++++++++++++++++++++++++++++++++++-
 docs/source/conf.py      |  9 ++++----
 docs/source/overview.rst |  4 ++--
 3 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/docs/source/atac.rst b/docs/source/atac.rst
index fb61852..e90d50f 100644
--- a/docs/source/atac.rst
+++ b/docs/source/atac.rst
@@ -1,4 +1,48 @@
+****
 atac
-====
+****
 
 The ``atac`` command exposes the functionality of ``alevin-fry`` for processing RAD files containing scATAC-seq data.  The ``atac`` command sets the *mode* of ``alevin-fry``, and this command itself takes one of several various sub-commands (``generate-permit-list`` and ``sort`` being the primary ones).
+
+generate-permit-list (atac)
+===========================
+
+This command takes as input an output directory containing a RAD file (created by ``piscem``), and it determines what cell barcodes should be associated with "true" cells, which should be corrected to
+some "true" barcode, and which should simply be ignored / discarded. 
+
+This command has 4 required arguments; the path to an input directory ``--input``,
+the path to an output directory ``--output-dir`` (which will be created if it
+doesn't exist), and a path to the barcode permit-list file. The functioning of this argument is as follows:
+
+* ``--unfiltered-pl <plist>``: This option accepts as an argument a list of *possible* barcodes for the sample.  For example, this is the flag you should use if you wish to provide an "external permit list", like the 10x v2 or 10x v3 permit lists. Unilike with the ``--valid-bc`` flag, the list passed to this argument is the set of all possible barcodes for the technology being processed, and it is likely that most of the barcodes in the file may not correspond to cells present in this particular sample.  When using this argument, you may also pass the ``--min-reads`` argument to determine the minimum frequency with which a barcode must be seen in order to be retained.  The algorithm used here will pass over the input records (mapped reads) and count how many times each of the barcodes in the unfiltered permit list occur exactly.  Any barcode ocurring >= ``min-reads`` times will be considered as a present cell.  Subsequently, all barcodes that did not match a present cell will be searched (at an edit distance of up to 1) againt the barcodes determined to correspond to present cells.  If an initially non-matching barcode has a unique neighbor among the barcodes for present cells, it will be corrected to that barcode, but if it has no 1-edit neighbor, or if it has 2 or more 1-edit neighbors among that list (i.e. it's correction would be ambiguous), then the record is discarded.
+
+
+output
+------
+
+The ``generate-permit-list`` command outputs a number of different files in the output directory.  Not all files are relevant to users of ``alevin-fry``, but the files are described here.
+
+1. The file ``bin_lens.bin`` is a binary file that records the lengths of the bins used for creating temporary files for sorting.
+
+2. The file ``bin_recs.bin`` is a binary file that encodes where records should be routed during the sorting phase.
+
+3. The file ``permit_freq.bin`` is a binary file that encodes information about the frequency of occurrence of different barcodes in the permit list.
+
+4. The file ``permit_map.bin`` is a binary file (a serde serialized HashMap) that maps each barcode in the input RAD file that is within an edit distance of 1 to some *true* barcode to the barcode to which it corrects.  This allows the ``collate`` command to group together all of the read records corresponding to the same *corrected* barcode.
+
+4. The file ``generate_permit_list.json`` that is a JSON file containing information about the run of the command.
+
+
+sort (atac)
+===========
+
+This command takes as input the directory containing the original RAD file (created by ``piscem``) and the output directory generated by the ``generate-permit-list`` command above.  It parses the input RAD file, buckets and then sorts the records by genomic location, and produces a globally-sorted BED file for downstream analysis.  The process is highly multi-threaded, and the number of threads can be chosen by passing the appropriate argument to the ``--threads`` command.  The output BED file can *optionally* be compressed if the ``--compress`` flag is passed to the ``sort`` command.  The output of the ``sort`` command id described below.
+
+output
+------
+
+The ``sort`` command outputs the following files:
+
+1. The ``sort.json`` file is a JSON file containing information about how the ``sort`` command was run.
+
+2. The ``map.bed`` file (or ``map.bed.gz`` if the ``--compress`` flag was passed) contains the output filed in BED format that can be provided to a peak caller like `MACS <https://github.com/macs3-project/MACS/>`_.
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 595706b..4815530 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -18,11 +18,11 @@
 # -- Project information -----------------------------------------------------
 
 project = 'alevin-fry'
-copyright = '2021-2022, Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, Rob Patro'
-author = 'Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, Rob Patro'
+copyright = '2021-2024, Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, Noor Pratap Singh, Rob Patro'
+author = 'Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, Noor Pratap Singh, Rob Patro'
 
 # The full version, including alpha/beta/rc tags
-release = '0.7.0'
+release = '0.11.0'
 
 master_doc = 'index'
 
@@ -31,8 +31,7 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = [ 'sphinx.ext.autosectionlabel'
-]
+extensions = []
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
diff --git a/docs/source/overview.rst b/docs/source/overview.rst
index 37fa9a2..9fac792 100644
--- a/docs/source/overview.rst
+++ b/docs/source/overview.rst
@@ -3,7 +3,7 @@ Overview
 
 `alevin-fry`` is a suite of tools for the rapid, accurate and memory-frugal processing single-cell and single-nucleus sequencing data. It consumes RAD files generated by `salmon alevin`, and performs common operations like generating permit lists, and estimating the number of distinct molecules from each gene within each cell. The focus in `alevin-fry`` is on safety, accuracy and efficiency (in terms of both time and memory usage).
 
-You can read the paper describing alevin fry, "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" `here <https://www.nature.com/articles/s41592-022-01408-3>`_, and the pre-print `on bioRxiv <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`_.
+You can read the paper describing alevin fry, "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" `in Nature Methods <https://www.nature.com/articles/s41592-022-01408-3>`_, and the pre-print `on bioRxiv <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`_.
 
 Other resources for alevin-fry
 ==============================
@@ -38,4 +38,4 @@ The `fishpond <https://mikelove.github.io/fishpond/>`_ package contains many met
 by `salmon <https://github.com/COMBINE-lab/salmon>`_ and `alevin-fry` into R easy.  In particular, you can find documentation on the 
 `loadFry function here <https://mikelove.github.io/fishpond/reference/loadFry.html>`_.  This makes it easy to import USA-mode quantification 
 results into a `SingleCellExperiment <https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html>`_ object, and to properly 
-extract or combine the spliced, unspliced, and ambiguous count components.
\ No newline at end of file
+extract or combine the spliced, unspliced, and ambiguous count components.