Skip to content

Commit

Permalink
Merge pull request #44 from COMBINE-lab/optional_pyranges
Browse files Browse the repository at this point in the history
Optional pyranges
  • Loading branch information
DongzeHE authored Dec 31, 2024
2 parents ad07985 + 3cdefdf commit c8c94ba
Show file tree
Hide file tree
Showing 7 changed files with 48 additions and 26 deletions.
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
# pyroe

## About `pyroe`
The main purpose of `pyroe` is to provide the python interface for loading the quantification results of single-cell sequencing data generated by [`alevin-fry`](https://github.com/COMBINE-lab/alevin-fry) and [`simpleaf`](https://github.com/COMBINE-lab/simpleaf).
- The major function of pyroe is the [`load_fry`](https://pyroe.readthedocs.io/en/latest/processing_fry_quants.html#load-fry-full-usage) function, which loads the quantification results into an [`anndata`](https://anndata.readthedocs.io/en/latest/) object to perform downstream analysis provided by [`scanpy`](https://scanpy.readthedocs.io/en/stable/). It provides many options for constructing the final `anndata` object by combining the count matrices representing difference splicing statuses differently.
- Moreover, `pyroe` provides the interface for the [`quantaf`](https://combine-lab.github.io/quantaf/) project, which is a database containing the quantification results of many publicly available datasets.


### Background
[`Alevin-fry`](https://github.com/COMBINE-lab/alevin-fry) is a fast, accurate, and memory frugal quantification tool for preprocessing single-cell RNA-sequencing data. Detailed information can be found in the alevin-fry [pre-print](https://www.biorxiv.org/content/10.1101/2021.06.29.450377v2), and [paper](https://www.nature.com/articles/s41592-022-01408-3).

The `pyroe` package provides useful functions for analyzing single-cell or single-nucleus RNA-sequencing data using `alevin-fry`. The documentation for `pyroe` has its own dedicated website. Please visit the [ReadTheDocs pyroe website here](https://pyroe.readthedocs.io).
[`simpleaf`](https://github.com/COMBINE-lab/simpleaf) provides a simple and easy-to-use interface for running `alevin-fry`, and also more advanced features such as designing and executing custom workflows for single-cell data analysis. ([Paper](https://doi.org/10.1093/bioinformatics/btad614) and [Documentation](https://simpleaf.readthedocs.io/en/latest/))

## Major Updates
Since Pyroe v0.10.0, the functionality for creating augmented transcriptome references and generating gene ID to gene name file has been moved to the [`roers`](https://github.com/COMBINE-lab/roers) packge, which is automatically installed together with [`simpleaf`](https://github.com/COMBINE-lab/alevin-fry). For all our users, we recommend using the simplified command line interface provided in [`simpleaf`](https://simpleaf.readthedocs.io/en/latest/) to process your single-cell sequencing data. The [`simpleaf index`](https://simpleaf.readthedocs.io/en/latest/index-command.html) command will automatically generate the augmented transcriptome reference (including the gene ID to gene name file), indexing the reference for you.
11 changes: 9 additions & 2 deletions bin/pyroe
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
#!/usr/bin/env python

import logging

from pyroe import make_splici_txome, make_spliceu_txome
from pyroe import id_to_name

if make_spliceu_txome is None or make_splici_txome is None or id_to_name is None:
raise ImportError("To run pyroe CLI, Please install pyranges, biopython and bedtools.")
from pyroe import fetch_processed_quant
from pyroe import convert
from pyroe import id_to_name
from pyroe import output_formats

# because of pyranges, we need to ignore FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


if __name__ == "__main__":
import argparse
import sys
Expand Down
17 changes: 9 additions & 8 deletions docs/source/building_splici_index.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
#################################################################################
Preparing an expanded transcriptome reference for quantification with alevin-fry
(Deprecated since v0.10.0) Preparing an expanded transcriptome reference for quantification with alevin-fry
#################################################################################

The USA mode in alevin-fry requires an expanded index reference, in which sequences represent spliced and unspliced transcripts. Pyroe provides CLI programs and python functions to build the pre-defined expanded references, the spliced + intronic (*splici*) reference, which includes the spliced transcripts plus the (merged and collapsed) intronic sequences of each gene and the spliced + unspliced (*spliceu*) reference, which consists of the spliced transcripts plus the unspliced transcript (genes' entire genomic interval) of each gene. The ``make_splici_txome()`` and ``make_spliceu_txome()`` python functions are designed to make the *splici* and *spliceu* reference by taking a genome FASTA file and a gene annotation GTF file as the input. Furthermore, the

Preparing a *spliced+intronic* transcriptome reference
-------------------------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The *splici* index reference of a given species consists of the transcriptome of the species, i.e., the spliced transcripts and the intronic sequences of the species. Within a gene, if the flanked intronic sequences overlap with each other, the overlapped intronic sequences will be collapsed as a single intronic sequence to make sure each base will appear only once in the intronic sequences of each gene.

Expand All @@ -29,8 +29,8 @@ The `pyroe make-spliced+intronic` program writes three files to your specified o
* A three-column transcript-name-to-gene-name file that stores the name of each reference sequence in the splici index reference, their corresponding gene name, and the splicing status (`S` for spliced and `U` for unspliced) of those transcripts.
* A two-column TSV file that maps gene ids (used as the keys in eventual alevin-fry output) to gene names. This can later be used with the ``pyroe convert`` command line program to convert gene ids to gene names in the count matrix.

Full usage
^^^^^^^^^^
**Full usage**


.. code::
Expand Down Expand Up @@ -120,7 +120,7 @@ The ``pyroe make-spliced+intronic`` command line program calls the ``make_splici
Nothing will be returned. The splici reference files will be written to disk.
Preparing a *spliced+unspliced* transcriptome reference
-------------------------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Recently, `He et al., 2023 <https://www.biorxiv.org/content/10.1101/2023.01.04.522742>`_ introduced the spliced + unspliced (*spliceu*) index in alevin-fry. This requires the spliced + unspliced transcriptome reference, where the unspliced transcripts of each gene represent the entire genomic interval of that gene. Details about the *spliceu* can be found in `the preprint <https://www.biorxiv.org/content/10.1101/2023.01.04.522742>`_. To make the spliceu reference using pyroe, one can call the ``make_spliceu_txome()`` python function or ``pyroe make-spliced+unspliced`` or its alias ``pyroe make-spliceu`` from the command line. The following example shows the shell command of building a spliceu reference from a given reference set in the directory ``spliceu_txome``.

Expand All @@ -132,8 +132,8 @@ Recently, `He et al., 2023 <https://www.biorxiv.org/content/10.1101/2023.01.04.5
spliceu_txome \
--filename-prefix spliceu
Full usage
^^^^^^^^^^
**Full usage**


.. code::
Expand Down Expand Up @@ -208,7 +208,8 @@ The ``pyroe make-spliced+unspliced`` command line program calls the ``make_splic
Notes on the input gene annotation GTF files for building an expanded reference
----------------------------------------------------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pyroe builds expanded transcriptome references, the spliced + intronic (*splici*) and the spliced + unspliced (*spliceu*) transcriptome reference, based on a genome build FASTA file and a gene annotation GTF file.

The input GTF file will be processed before extracting unspliced sequences. If pyroe finds invalid records, a ``clean_gtf.gtf`` file will be generated in the specified output directory. **Note** : The features extracted in the spliced + unspliced transcriptome will not necessarily be those present in the ``clean_gtf.gtf`` file — as this command will prefer the input in the user-provided file wherever possible. One can rerun pyroe using the ``clean_gtf.gtf`` file if needed. More specifically:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/geneid_to_name.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Generating a gene id to gene name mapping
(Deprecated since v0.10.0) Generating a gene id to gene name mapping
=========================================

It is often useful to perform analyses with gene *names* rather than gene *identifiers*. The `convert <https://pyroe.readthedocs.io/en/latest/converting_quants.html>`_ command of ``pyroe`` allows you to specify an id to name mapping so that the converted output matrix will be labeled with gene names rather than identifiers. However, you must provide it with a 2-column tab-separated file mapping IDs to names. This command can help you with that task.
Expand Down
12 changes: 9 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,24 @@ Welcome to the documentation for pyroe
What is pyroe?
===================

The pyroe package provides useful functions for analyzing single-cell or single-nucleus RNA-sequencing data using `alevin-fry`. Since `simpleaf` version 0.14.0, `roers <https://github.com/COMBINE-lab/roers>`_, instead of pyroe, became as the default augmented reference constructor for `alevin-fry` and `simpleaf`. Now, the main purpose of `pyroe` is to provide the function `load_fry` to load `alevin-fry` quantification results into Python as an `anndata <http://anndata.readthedocs.io/>`_ object, so as to be compatible with `scanpy <https://scanpy.readthedocs.io/en/stable/index.html>`_. If you have trouble installing `pyroe`, you can also define the ``load_fry`` function in your own Python script, the definition of ``load_fry`` can be found at here: `load_fry <https://github.com/COMBINE-lab/pyroe/blob/main/src/pyroe/load_fry.py>`_.
The pyroe package provides useful functions for analyzing single-cell or single-nucleus RNA-sequencing data using `alevin-fry`.
The main purpose of `pyroe` is to provide the function `load_fry` to load `alevin-fry` and `simpleaf` quantification results into Python as an `anndata <http://anndata.readthedocs.io/>`_ object, so as to perform downstream analysis provided by `scanpy <https://scanpy.readthedocs.io/en/stable/index.html>`_. Moreover, `pyroe` also provides functions to fetch the pre-computed quantification results from the `quantaf <https://combine-lab.github.io/quantaf/>`_ database.

In previous versions (before v 0.10.0), pyroe also provided the functions to construct the augmented transcriptome references. Since `simpleaf` version 0.14.0, `roers <https://github.com/COMBINE-lab/roers>`_, instead of pyroe, became the default augmented reference constructor for `alevin-fry` and `simpleaf`. If you would like to use the deprecated functions to construct the augmented references, please install an older version of pyroe. Notice that old versions of pyroe are compatitble with pandas version less than 2.0.0. So, we suggest you to install the old versions of pyroe in a conda environment with a isolated environment, so as to not affect the other packages in your system.

**To note that** although pyroe is available on bioconda and can be easily installed, if you encounter any problem during installation, you can define the `load_fry` function locally in your python script by copying the function definition defined `here <https://github.com/COMBINE-lab/pyroe/blob/main/src/pyroe/load_fry.py>`_. The only dependency of `load_fry` is `scanpy <https://scanpy.readthedocs.io/en/stable/installation.html>`_.


.. toctree::
:maxdepth: 2
:caption: Contents:

installing
building_splici_index
processing_fry_quants
converting_quants
fetching_processed_quants
building_splici_index
geneid_to_name
converting_quants
LICENSE.rst

Indices and tables
Expand Down
10 changes: 2 additions & 8 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,14 @@ classifiers =
packages = find:
package_dir =
= src
scripts =
bin/pyroe
# scripts =
# bin/pyroe
python_requires = >=3.7
include_package_data = True
install_requires =
pandas >= 1.3.0, < 2.2.0
pyranges == 0.0.129
biopython >= 1.77
packaging >= 21.0
scanpy >= 1.8.2

# [options.extras_require]
# scanpy =
# scanpy >= 1.8.2

[options.packages.find]
where = src
Expand Down
11 changes: 9 additions & 2 deletions src/pyroe/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
__version__ = "0.10.0"

from pyroe.load_fry import load_fry
from pyroe.make_txome import make_splici_txome, make_spliceu_txome
from pyroe.fetch_processed_quant import fetch_processed_quant
from pyroe.load_processed_quant import load_processed_quant
from pyroe.ProcessedQuant import ProcessedQuant
from pyroe.convert import convert
from pyroe.id_to_name import id_to_name
from pyroe.pyroe_utils import output_formats


# try:
# from pyroe.make_txome import make_splici_txome, make_spliceu_txome
# from pyroe.id_to_name import id_to_name
# except ImportError:
# make_splici_txome = None
# make_spliceu_txome = None
# id_to_name = None

# flake8: noqa

0 comments on commit c8c94ba

Please sign in to comment.