Merge pull request #84 from KevinMenden/development

Release v1.1.0 PR
KevinMenden · Mar 25, 2021 · 3028486 · 3028486
2 parents 6a7354e + 238c2dc
commit 3028486
Show file tree

Hide file tree

Showing 21 changed files with 483 additions and 550 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,15 @@
 # Scaden Changelog
 
+## Version 1.1.0
+
+* Reduced memory usage of `scaden simulate` significantly by performing simulation for one dataset at a time.
+* Using `.h5ad` format to store simulated data
+* Allow reading data in `.h5ad` format for improved performance (courtesy of @eboileau)
+* Improved logging and using rich progress bar for training
+* Gene subsetting is now done only when merging datasets, which will allow to generate different combinations
+of simulated datasets
+* Added `scaden merge` command which allows merging of previously created datasets  
+
 ### Version 1.0.2
 
 * General improvement of logging using the 'rich' library for colorized output

diff --git a/README.md b/README.md
@@ -1,12 +1,10 @@
 ![Scaden](docs/img/scaden_logo.png)
 
 
-![Scaden version](https://img.shields.io/badge/scaden-v1.0.2-cyan)
-
+![Scaden version](https://img.shields.io/badge/scaden-v1.1.0-cyan)
 ![MIT](https://img.shields.io/badge/License-MIT-black)
 ![Install with pip](https://img.shields.io/badge/Install%20with-pip-blue)
-![Install with Bioconda](https://img.shields.io/badge/Install%20with-conda-green)
-![Downloads](https://static.pepy.tech/personalized-badge/scaden?period=total&units=international_system&left_color=blue&right_color=green&left_text=Downloads)
+[![Downloads](https://pepy.tech/badge/scaden)](https://pepy.tech/project/scaden)
 ![Docker](https://github.com/kevinmenden/scaden/workflows/Docker/badge.svg)
 ![Scaden CI](https://github.com/kevinmenden/scaden/workflows/Scaden%20CI/badge.svg)
 
@@ -39,7 +37,8 @@ To install Scaden via pip, simply run the following command:
 
 
 ### Bioconda
-You can also install Scaden via bioconda, using:
+Bioconda installation is currently not supported for the newest Scaden versions, but this will hopefully change soon.
+It is therefore highly recommended to install via pip.
 
 `conda install -c bioconda scaden`
 

diff --git a/docs/blog.md b/docs/blog.md
@@ -0,0 +1,14 @@
+# Scaden Blog
+Apart from the changelog, this is a more informal section where I will inform about new features
+that have been (or will be) implemented in Scaden.
+
+# Scaden v1.1.0 - Performance Improvements and `scaden merge` tool (21.03.2021)
+
+Scaden v1.1.0 brings significantly improved memory consumption for the data simulation step, which was a frequently asked for feature.
+Now, instead of using about 4 GB of memory to simulate a small dataset, Scaden only uses 1 GB. Memory usage does not increase
+with the number of datasets anymore. This will allow to create datasets from large collections of scRNA-seq datasets without 
+needing excessive memory. Furthermore, Scaden now stores the simulated data in `.h5ad` format with the full list of genes.
+This way you can simulate from a scRNA-seq dataset once and combine it with other datasets in the future. To help with this,
+I've added the `scaden merge` command, which takes a list of datasets or a directory with `.h5ad` datasets and creates
+a new training dataset from it.
+
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,5 +1,16 @@
 # Scaden Changelog
 
+## Version 1.1.0
+
+* Reduced memory usage of `scaden simulate` significantly by performing simulation for one dataset at a time.
+* Using `.h5ad` format to store simulated data
+* Allow reading data in `.h5ad` format for improved performance (courtesy of @eboileau)
+* Improved logging and using rich progress bar for training
+* Gene subsetting is now done only when merging datasets, which will allow to generate different combinations
+of simulated datasets
+* Added `scaden merge` command which allows merging of previously created datasets  
+
+
 ### Version 1.0.2
 
 * General improvement of logging using the 'rich' library for colorized output

diff --git a/docs/index.md b/docs/index.md
@@ -8,3 +8,7 @@ at the [DZNE Tübingen](https://www.dzne.de/en/about-us/sites/tuebingen/) and th
 
 A paper describing Scaden has been published in Science Advances:
 [Deep-learning based cell composition analysis from tissue expression profiles](https://advances.sciencemag.org/content/6/30/eaba2619)
+
+For information about how to install Scaden, go to the [Installation](installation.md) section. Look in the [Usage](usage.md)
+section for general help with Scaden usage. In the [Datasets](datasets.md) section you'll find a list of prepared training datasets.
+You can also have a look in the [Blog](blog.md) section, where I summarize new features that are added to Scaden.
diff --git a/docs/installation.md b/docs/installation.md
@@ -10,7 +10,8 @@ To install Scaden via pip, simply run the following command:
 
 
 ## Bioconda
-You can also install Scaden via bioconda, using::
+Bioconda installation is currently not supported for the newest Scaden versions, but this will hopefully change soon.
+It is therefore highly recommended to install via pip.
 
 `conda install -c bioconda scaden`
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -120,13 +120,20 @@ An example for a pattern would be `*_counts.txt`. This pattern would find the fo
 
 Make sure to include an `*` in your pattern!
 
-This command will create the artificial samples in the current working directory. You can also specificy an output directory using the `--out` parameter. Scaden will also directly create a .h5ad file in this directory, which is the file you will need for training. By default, this file will be called `data.h5ad`, however you can change the prefix using the `--prefix` flag.
+This command will create the artificial samples in the current working directory. You can also specificy an output directory using the `--out` parameter.
+Scaden will also directly create a .h5ad file in this directory, which is the file you will need for training.
+By default, this file will be called `data.h5ad`, however you can change the prefix using the `--prefix` flag.
+
+Alternatively, you can manually merge `.h5ad` files that have been created with `scaden simulate` from v1.1.0 on using
+the `scaden merge` command. Either point it to a directory of `.h5ad` files, or give it a comma-separated list of files
+to merge. Type `scaden merge --help` for details.
 
 ## File Formats
 For Scaden to work properly, your input files have to be correctly formatted. As long as you use Scadens inbuilt functionality to generate the training data, you should have no problem 
 with formatting there. The prediction file, however, you have to format yourself. This should be a file of shape m X n, where m are your features (genes) and n your samples. So each row corresponds to 
 a gene, and each column to a sample. Leave the column name for the genes empy (just put a `\t` there). This is a rather standard format to store gene expression tables, so you should have not much work assuring that the
-format fits.
+format fits. Since version `v1.1.0` it is also possible to load data for simulation in `.h5ad` format for improved performance. In this case, the AnnData object should have
+a `Celltype` column in the `obs` field.
 
 Your data can either be raw counts or normalized, just make sure that they are not in logarithmic space already. When loading a prediction file, Scaden applies its scaling procedure to it, which involves taking the logarithm of your counts.
 So as long as they are not already in logarithmic space, Scaden will be able to handle both raw and normalized counts / expression values.

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -4,5 +4,6 @@ nav:
   - Installation: installation.md
   - Usage: usage.md
   - Datasets: datasets.md
+  - Blog: blog.md
   - Changelog: changelog.md
 theme: readthedocs
diff --git a/scaden/__main__.py b/scaden/__main__.py
@@ -5,12 +5,13 @@
 import rich.logging
 import logging
 import os
+import tensorflow as tf
 from scaden.train import training
 from scaden.predict import prediction
 from scaden.process import processing
 from scaden.simulate import simulation
 from scaden.example import exampleData
-
+from scaden.merge import merge_datasets
 """
 
 author: Kevin Menden
@@ -30,7 +31,7 @@
     )
 )
 
-os.environ["TF_CPP_MIN_LOG_LEVEL"] = "0"
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
 
 
 def main():
@@ -146,7 +147,7 @@ def predict(data_path, model_dir, outname, seed):
     "--var_cutoff",
     default=0.1,
     help="Filter out genes with a variance less than the specified cutoff. A low cutoff is recommended,"
-    "this should only remove genes that are obviously uninformative.",
+         "this should only remove genes that are obviously uninformative.",
 )
 def process(data_path, prediction_data, processed_path, var_cutoff):
     """ Process a dataset for training """
@@ -185,15 +186,22 @@ def process(data_path, prediction_data, processed_path, var_cutoff):
     "-u",
     multiple=True,
     default=["unknown"],
-    help="Specifiy cell types to merge into the unknown category. Specify this flag for every cell type you want to merge in unknown. [default: unknown]",
+    help="Specifiy cell types to merge into the unknown category. Specify this flag for every cell type you want to "
+         "merge in unknown. [default: unknown]",
 )
 @click.option(
     "--prefix",
     "-p",
     default="data",
     help="Prefix to append to training .h5ad file [default: data]",
 )
-def simulate(out, data, cells, n_samples, pattern, unknown, prefix):
+@click.option(
+    "--data-format",
+    "-f",
+    default="txt",
+    help="Data format of scRNA-seq data, can be 'txt' or 'h5ad' [default: 'txt']",
+)
+def simulate(out, data, cells, n_samples, pattern, unknown, prefix, data_format):
     """ Create artificial bulk RNA-seq data from scRNA-seq dataset(s)"""
     simulation(
         simulate_dir=out,
@@ -203,21 +211,37 @@ def simulate(out, data, cells, n_samples, pattern, unknown, prefix):
         pattern=pattern,
         unknown_celltypes=unknown,
         out_prefix=prefix,
+        fmt=data_format
     )
 
 
+"""
+Merge simulated datasets
+"""
+
+
+@cli.command()
+@click.option("--data", "-d", default=".", help="Directory containing simulated datasets (in .h5ad format)")
+@click.option("--prefix", "-p", default="data", help="Prefix of output file [default: data]")
+@click.option("--files", "-f", default=None, help="Comma-separated list of filenames to merge")
+def merge(data, prefix, files):
+    """ Merge simulated datasets into on training dataset """
+    merge_datasets(data_dir=data, prefix=prefix, files=files)
+
+
 """
 Generate example data
 """
 
 
 @cli.command()
-@click.option("--out", "-o", default="./", help="Directory to store output files in")
 @click.option("--cells", "-c", default=10, help="Number of cells [default: 10]")
+@click.option("--types", "-t", default=5, help="Number of cell types [default: 5]")
 @click.option("--genes", "-g", default=100, help="Number of genes [default: 100]")
 @click.option("--out", "-o", default="./", help="Output directory [default: ./]")
 @click.option(
     "--samples", "-n", default=10, help="Number of bulk samples [default: 10]"
 )
-def example(cells, genes, samples, out):
-    exampleData(n_cells=cells, n_genes=genes, n_samples=samples, out_dir=out)
+def example(cells, genes, samples, out, types):
+    """ Generate an example dataset """
+    exampleData(n_cells=cells, n_genes=genes, n_samples=samples, out_dir=out, n_types=types)
diff --git a/scaden/example.py b/scaden/example.py
@@ -2,23 +2,26 @@
 Generate random example data which allows for testing and
 to give users examples for the input format
 """
-import string
 import random
 import os
 import logging
 import pandas as pd
 import numpy as np
+import sys
 
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
 
 
-def exampleData(n_cells=10, n_genes=100, n_samples=10, out_dir="./"):
+def exampleData(n_cells=10, n_genes=100, n_samples=10, n_types=5, out_dir="./"):
     """
     Generate an example scRNA-seq count file
     :param n: number of cells
     :param g: number of genes
     """
+    if n_types > n_cells:
+        logger.error("You can't specifiy more cell types than cells!")
+        sys.exit(1)
 
     # Generate example scRNA-seq data
     counts = np.random.randint(low=0, high=1000, size=(n_cells, n_genes))
@@ -28,7 +31,7 @@ def exampleData(n_cells=10, n_genes=100, n_samples=10, out_dir="./"):
     df = pd.DataFrame(counts, columns=gene_names)
 
     # Generate example celltype labels
-    celltypes = ["celltype"] * np.random.randint(low=2, high=n_cells - 1)
+    celltypes = ["celltype"] * np.random.randint(n_types)
     for i in range(len(celltypes)):
         celltypes[i] = celltypes[i] + str(i)
     celltype_list = random.choices(celltypes, k=n_cells)

diff --git a/scaden/merge.py b/scaden/merge.py
@@ -0,0 +1,18 @@
+from scaden.simulation import BulkSimulator
+
+"""
+Merge simulate datasets
+"""
+
+
+def merge_datasets(data_dir, prefix, files=None):
+
+    bulk_simulator = BulkSimulator()
+
+    if files:
+        files = files.split(",")
+
+    # Merge the resulting datasets
+    bulk_simulator.merge_datasets(data_dir=data_dir,
+                                  files=files,
+                                  out_name=prefix + ".h5ad")
diff --git a/scaden/model/scaden.py b/scaden/model/scaden.py
@@ -10,11 +10,12 @@
 import pandas as pd
 from anndata import read_h5ad
 import collections
-from .functions import dummy_labels, sample_scaling
-from tqdm import tqdm
+from .functions import sample_scaling
+from rich.progress import Progress, BarColumn
 
 logger = logging.getLogger(__name__)
-
+tf.get_logger().setLevel('ERROR')
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
 
 class Scaden(object):
     """
@@ -297,25 +298,33 @@ def train(self, input_path, train_datasets):
         )
 
         # Training loop
-        pbar = tqdm(range(self.num_steps))
-        for step, _ in enumerate(pbar):
+        progress_bar = Progress(
+            "[bold blue]{task.description}",
+            "[bold cyan]Step: {task.fields[step]}, Loss: {task.fields[loss]}",
+            BarColumn(bar_width=None),
+        )
+
+        training_progress = progress_bar.add_task(self.model_name, total=self.num_steps, step=0, loss=1)
+        with progress_bar:
+
+            for step in range(self.num_steps):
+
+                x, y = self.data_iter.get_next()
 
-            x, y = self.data_iter.get_next()
+                with tf.GradientTape() as tape:
+                    self.logits = self.model(x, training=True)
+                    loss = self.compute_loss(self.logits, y)
 
-            with tf.GradientTape() as tape:
-                self.logits = self.model(x, training=True)
-                loss = self.compute_loss(self.logits, y)
+                grads = tape.gradient(loss, self.model.trainable_weights)
 
-            grads = tape.gradient(loss, self.model.trainable_weights)
+                optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
 
-            optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
+                progress_bar.update(training_progress, advance=1, step=step, loss=f"{loss:.4f}")
 
-            desc = f"Step: {step}, Loss: {loss:.4f}"
-            pbar.set_description(desc=desc)
+                # Collect garbage after 100 steps - otherwise runs out of memory
+                if step % 100 == 0:
+                    gc.collect()
 
-            # Collect garbage after 100 steps - otherwise runs out of memory
-            if step % 100 == 0:
-                gc.collect()
 
         # Save the trained model
         self.model.save(self.model_dir)
@@ -326,11 +335,10 @@ def train(self, input_path, train_datasets):
             os.path.join(self.model_dir, "genes.txt"), sep="\t"
         )
 
-    def predict(self, input_path, out_name="scaden_predictions.txt"):
+    def predict(self, input_path):
         """
         Perform prediction with a pre-trained model
-        :param out_dir: path to store results in
-        :param training_data: the dataset used for training
+        :param input_path: prediction data path
         :return:
         """
         # Load signature genes and celltype labels
@@ -347,4 +355,4 @@ def predict(self, input_path, out_name="scaden_predictions.txt"):
         pred_df = pd.DataFrame(
             predictions, columns=self.labels, index=self.sample_names
         )
-        return pred_df
+        return pred_df
diff --git a/scaden/predict.py b/scaden/predict.py
@@ -52,7 +52,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
     )
     # Predict ratios
     preds_256 = cdn256.predict(
-        input_path=data_path, out_name="scaden_predictions_m256.txt"
+        input_path=data_path
     )
 
     # Mid model predictions
@@ -65,7 +65,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
     )
     # Predict ratios
     preds_512 = cdn512.predict(
-        input_path=data_path, out_name="scaden_predictions_m512.txt"
+        input_path=data_path
     )
 
     # Large model predictions
@@ -78,7 +78,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
     )
     # Predict ratios
     preds_1024 = cdn1024.predict(
-        input_path=data_path, out_name="scaden_predictions_m1024.txt"
+        input_path=data_path
     )
 
     # Average predictions

diff --git a/scaden/preprocessing/__init__.py b/scaden/preprocessing/__init__.py