Skip to content

Commit

Permalink
Merge pull request #84 from KevinMenden/development
Browse files Browse the repository at this point in the history
Release v1.1.0 PR
  • Loading branch information
KevinMenden authored Mar 25, 2021
2 parents 6a7354e + 238c2dc commit 3028486
Show file tree
Hide file tree
Showing 21 changed files with 483 additions and 550 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# Scaden Changelog

## Version 1.1.0

* Reduced memory usage of `scaden simulate` significantly by performing simulation for one dataset at a time.
* Using `.h5ad` format to store simulated data
* Allow reading data in `.h5ad` format for improved performance (courtesy of @eboileau)
* Improved logging and using rich progress bar for training
* Gene subsetting is now done only when merging datasets, which will allow to generate different combinations
of simulated datasets
* Added `scaden merge` command which allows merging of previously created datasets

### Version 1.0.2

* General improvement of logging using the 'rich' library for colorized output
Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
![Scaden](docs/img/scaden_logo.png)


![Scaden version](https://img.shields.io/badge/scaden-v1.0.2-cyan)

![Scaden version](https://img.shields.io/badge/scaden-v1.1.0-cyan)
![MIT](https://img.shields.io/badge/License-MIT-black)
![Install with pip](https://img.shields.io/badge/Install%20with-pip-blue)
![Install with Bioconda](https://img.shields.io/badge/Install%20with-conda-green)
![Downloads](https://static.pepy.tech/personalized-badge/scaden?period=total&units=international_system&left_color=blue&right_color=green&left_text=Downloads)
[![Downloads](https://pepy.tech/badge/scaden)](https://pepy.tech/project/scaden)
![Docker](https://github.com/kevinmenden/scaden/workflows/Docker/badge.svg)
![Scaden CI](https://github.com/kevinmenden/scaden/workflows/Scaden%20CI/badge.svg)

Expand Down Expand Up @@ -39,7 +37,8 @@ To install Scaden via pip, simply run the following command:


### Bioconda
You can also install Scaden via bioconda, using:
Bioconda installation is currently not supported for the newest Scaden versions, but this will hopefully change soon.
It is therefore highly recommended to install via pip.

`conda install -c bioconda scaden`

Expand Down
14 changes: 14 additions & 0 deletions docs/blog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Scaden Blog
Apart from the changelog, this is a more informal section where I will inform about new features
that have been (or will be) implemented in Scaden.

# Scaden v1.1.0 - Performance Improvements and `scaden merge` tool (21.03.2021)

Scaden v1.1.0 brings significantly improved memory consumption for the data simulation step, which was a frequently asked for feature.
Now, instead of using about 4 GB of memory to simulate a small dataset, Scaden only uses 1 GB. Memory usage does not increase
with the number of datasets anymore. This will allow to create datasets from large collections of scRNA-seq datasets without
needing excessive memory. Furthermore, Scaden now stores the simulated data in `.h5ad` format with the full list of genes.
This way you can simulate from a scRNA-seq dataset once and combine it with other datasets in the future. To help with this,
I've added the `scaden merge` command, which takes a list of datasets or a directory with `.h5ad` datasets and creates
a new training dataset from it.

11 changes: 11 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# Scaden Changelog

## Version 1.1.0

* Reduced memory usage of `scaden simulate` significantly by performing simulation for one dataset at a time.
* Using `.h5ad` format to store simulated data
* Allow reading data in `.h5ad` format for improved performance (courtesy of @eboileau)
* Improved logging and using rich progress bar for training
* Gene subsetting is now done only when merging datasets, which will allow to generate different combinations
of simulated datasets
* Added `scaden merge` command which allows merging of previously created datasets


### Version 1.0.2

* General improvement of logging using the 'rich' library for colorized output
Expand Down
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,7 @@ at the [DZNE Tübingen](https://www.dzne.de/en/about-us/sites/tuebingen/) and th

A paper describing Scaden has been published in Science Advances:
[Deep-learning based cell composition analysis from tissue expression profiles](https://advances.sciencemag.org/content/6/30/eaba2619)

For information about how to install Scaden, go to the [Installation](installation.md) section. Look in the [Usage](usage.md)
section for general help with Scaden usage. In the [Datasets](datasets.md) section you'll find a list of prepared training datasets.
You can also have a look in the [Blog](blog.md) section, where I summarize new features that are added to Scaden.
3 changes: 2 additions & 1 deletion docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ To install Scaden via pip, simply run the following command:


## Bioconda
You can also install Scaden via bioconda, using::
Bioconda installation is currently not supported for the newest Scaden versions, but this will hopefully change soon.
It is therefore highly recommended to install via pip.

`conda install -c bioconda scaden`

Expand Down
11 changes: 9 additions & 2 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,13 +120,20 @@ An example for a pattern would be `*_counts.txt`. This pattern would find the fo

Make sure to include an `*` in your pattern!

This command will create the artificial samples in the current working directory. You can also specificy an output directory using the `--out` parameter. Scaden will also directly create a .h5ad file in this directory, which is the file you will need for training. By default, this file will be called `data.h5ad`, however you can change the prefix using the `--prefix` flag.
This command will create the artificial samples in the current working directory. You can also specificy an output directory using the `--out` parameter.
Scaden will also directly create a .h5ad file in this directory, which is the file you will need for training.
By default, this file will be called `data.h5ad`, however you can change the prefix using the `--prefix` flag.

Alternatively, you can manually merge `.h5ad` files that have been created with `scaden simulate` from v1.1.0 on using
the `scaden merge` command. Either point it to a directory of `.h5ad` files, or give it a comma-separated list of files
to merge. Type `scaden merge --help` for details.

## File Formats
For Scaden to work properly, your input files have to be correctly formatted. As long as you use Scadens inbuilt functionality to generate the training data, you should have no problem
with formatting there. The prediction file, however, you have to format yourself. This should be a file of shape m X n, where m are your features (genes) and n your samples. So each row corresponds to
a gene, and each column to a sample. Leave the column name for the genes empy (just put a `\t` there). This is a rather standard format to store gene expression tables, so you should have not much work assuring that the
format fits.
format fits. Since version `v1.1.0` it is also possible to load data for simulation in `.h5ad` format for improved performance. In this case, the AnnData object should have
a `Celltype` column in the `obs` field.

Your data can either be raw counts or normalized, just make sure that they are not in logarithmic space already. When loading a prediction file, Scaden applies its scaling procedure to it, which involves taking the logarithm of your counts.
So as long as they are not already in logarithmic space, Scaden will be able to handle both raw and normalized counts / expression values.
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,6 @@ nav:
- Installation: installation.md
- Usage: usage.md
- Datasets: datasets.md
- Blog: blog.md
- Changelog: changelog.md
theme: readthedocs
40 changes: 32 additions & 8 deletions scaden/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
import rich.logging
import logging
import os
import tensorflow as tf
from scaden.train import training
from scaden.predict import prediction
from scaden.process import processing
from scaden.simulate import simulation
from scaden.example import exampleData

from scaden.merge import merge_datasets
"""
author: Kevin Menden
Expand All @@ -30,7 +31,7 @@
)
)

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "0"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"


def main():
Expand Down Expand Up @@ -146,7 +147,7 @@ def predict(data_path, model_dir, outname, seed):
"--var_cutoff",
default=0.1,
help="Filter out genes with a variance less than the specified cutoff. A low cutoff is recommended,"
"this should only remove genes that are obviously uninformative.",
"this should only remove genes that are obviously uninformative.",
)
def process(data_path, prediction_data, processed_path, var_cutoff):
""" Process a dataset for training """
Expand Down Expand Up @@ -185,15 +186,22 @@ def process(data_path, prediction_data, processed_path, var_cutoff):
"-u",
multiple=True,
default=["unknown"],
help="Specifiy cell types to merge into the unknown category. Specify this flag for every cell type you want to merge in unknown. [default: unknown]",
help="Specifiy cell types to merge into the unknown category. Specify this flag for every cell type you want to "
"merge in unknown. [default: unknown]",
)
@click.option(
"--prefix",
"-p",
default="data",
help="Prefix to append to training .h5ad file [default: data]",
)
def simulate(out, data, cells, n_samples, pattern, unknown, prefix):
@click.option(
"--data-format",
"-f",
default="txt",
help="Data format of scRNA-seq data, can be 'txt' or 'h5ad' [default: 'txt']",
)
def simulate(out, data, cells, n_samples, pattern, unknown, prefix, data_format):
""" Create artificial bulk RNA-seq data from scRNA-seq dataset(s)"""
simulation(
simulate_dir=out,
Expand All @@ -203,21 +211,37 @@ def simulate(out, data, cells, n_samples, pattern, unknown, prefix):
pattern=pattern,
unknown_celltypes=unknown,
out_prefix=prefix,
fmt=data_format
)


"""
Merge simulated datasets
"""


@cli.command()
@click.option("--data", "-d", default=".", help="Directory containing simulated datasets (in .h5ad format)")
@click.option("--prefix", "-p", default="data", help="Prefix of output file [default: data]")
@click.option("--files", "-f", default=None, help="Comma-separated list of filenames to merge")
def merge(data, prefix, files):
""" Merge simulated datasets into on training dataset """
merge_datasets(data_dir=data, prefix=prefix, files=files)


"""
Generate example data
"""


@cli.command()
@click.option("--out", "-o", default="./", help="Directory to store output files in")
@click.option("--cells", "-c", default=10, help="Number of cells [default: 10]")
@click.option("--types", "-t", default=5, help="Number of cell types [default: 5]")
@click.option("--genes", "-g", default=100, help="Number of genes [default: 100]")
@click.option("--out", "-o", default="./", help="Output directory [default: ./]")
@click.option(
"--samples", "-n", default=10, help="Number of bulk samples [default: 10]"
)
def example(cells, genes, samples, out):
exampleData(n_cells=cells, n_genes=genes, n_samples=samples, out_dir=out)
def example(cells, genes, samples, out, types):
""" Generate an example dataset """
exampleData(n_cells=cells, n_genes=genes, n_samples=samples, out_dir=out, n_types=types)
9 changes: 6 additions & 3 deletions scaden/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,26 @@
Generate random example data which allows for testing and
to give users examples for the input format
"""
import string
import random
import os
import logging
import pandas as pd
import numpy as np
import sys

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


def exampleData(n_cells=10, n_genes=100, n_samples=10, out_dir="./"):
def exampleData(n_cells=10, n_genes=100, n_samples=10, n_types=5, out_dir="./"):
"""
Generate an example scRNA-seq count file
:param n: number of cells
:param g: number of genes
"""
if n_types > n_cells:
logger.error("You can't specifiy more cell types than cells!")
sys.exit(1)

# Generate example scRNA-seq data
counts = np.random.randint(low=0, high=1000, size=(n_cells, n_genes))
Expand All @@ -28,7 +31,7 @@ def exampleData(n_cells=10, n_genes=100, n_samples=10, out_dir="./"):
df = pd.DataFrame(counts, columns=gene_names)

# Generate example celltype labels
celltypes = ["celltype"] * np.random.randint(low=2, high=n_cells - 1)
celltypes = ["celltype"] * np.random.randint(n_types)
for i in range(len(celltypes)):
celltypes[i] = celltypes[i] + str(i)
celltype_list = random.choices(celltypes, k=n_cells)
Expand Down
18 changes: 18 additions & 0 deletions scaden/merge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from scaden.simulation import BulkSimulator

"""
Merge simulate datasets
"""


def merge_datasets(data_dir, prefix, files=None):

bulk_simulator = BulkSimulator()

if files:
files = files.split(",")

# Merge the resulting datasets
bulk_simulator.merge_datasets(data_dir=data_dir,
files=files,
out_name=prefix + ".h5ad")
48 changes: 28 additions & 20 deletions scaden/model/scaden.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@
import pandas as pd
from anndata import read_h5ad
import collections
from .functions import dummy_labels, sample_scaling
from tqdm import tqdm
from .functions import sample_scaling
from rich.progress import Progress, BarColumn

logger = logging.getLogger(__name__)

tf.get_logger().setLevel('ERROR')
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

class Scaden(object):
"""
Expand Down Expand Up @@ -297,25 +298,33 @@ def train(self, input_path, train_datasets):
)

# Training loop
pbar = tqdm(range(self.num_steps))
for step, _ in enumerate(pbar):
progress_bar = Progress(
"[bold blue]{task.description}",
"[bold cyan]Step: {task.fields[step]}, Loss: {task.fields[loss]}",
BarColumn(bar_width=None),
)

training_progress = progress_bar.add_task(self.model_name, total=self.num_steps, step=0, loss=1)
with progress_bar:

for step in range(self.num_steps):

x, y = self.data_iter.get_next()

x, y = self.data_iter.get_next()
with tf.GradientTape() as tape:
self.logits = self.model(x, training=True)
loss = self.compute_loss(self.logits, y)

with tf.GradientTape() as tape:
self.logits = self.model(x, training=True)
loss = self.compute_loss(self.logits, y)
grads = tape.gradient(loss, self.model.trainable_weights)

grads = tape.gradient(loss, self.model.trainable_weights)
optimizer.apply_gradients(zip(grads, self.model.trainable_weights))

optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
progress_bar.update(training_progress, advance=1, step=step, loss=f"{loss:.4f}")

desc = f"Step: {step}, Loss: {loss:.4f}"
pbar.set_description(desc=desc)
# Collect garbage after 100 steps - otherwise runs out of memory
if step % 100 == 0:
gc.collect()

# Collect garbage after 100 steps - otherwise runs out of memory
if step % 100 == 0:
gc.collect()

# Save the trained model
self.model.save(self.model_dir)
Expand All @@ -326,11 +335,10 @@ def train(self, input_path, train_datasets):
os.path.join(self.model_dir, "genes.txt"), sep="\t"
)

def predict(self, input_path, out_name="scaden_predictions.txt"):
def predict(self, input_path):
"""
Perform prediction with a pre-trained model
:param out_dir: path to store results in
:param training_data: the dataset used for training
:param input_path: prediction data path
:return:
"""
# Load signature genes and celltype labels
Expand All @@ -347,4 +355,4 @@ def predict(self, input_path, out_name="scaden_predictions.txt"):
pred_df = pd.DataFrame(
predictions, columns=self.labels, index=self.sample_names
)
return pred_df
return pred_df
6 changes: 3 additions & 3 deletions scaden/predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
)
# Predict ratios
preds_256 = cdn256.predict(
input_path=data_path, out_name="scaden_predictions_m256.txt"
input_path=data_path
)

# Mid model predictions
Expand All @@ -65,7 +65,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
)
# Predict ratios
preds_512 = cdn512.predict(
input_path=data_path, out_name="scaden_predictions_m512.txt"
input_path=data_path
)

# Large model predictions
Expand All @@ -78,7 +78,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
)
# Predict ratios
preds_1024 = cdn1024.predict(
input_path=data_path, out_name="scaden_predictions_m1024.txt"
input_path=data_path
)

# Average predictions
Expand Down
Empty file removed scaden/preprocessing/__init__.py
Empty file.
Loading

0 comments on commit 3028486

Please sign in to comment.