Merge pull request #7 from jlab/dev

dev
jlab · Nov 11, 2024 · 4b11aa2 · 4b11aa2
2 parents e2a70d2 + 941574e
commit 4b11aa2
Show file tree

Hide file tree

Showing 10 changed files with 561 additions and 94 deletions.
diff --git a/.github/workflows/github_tests.yml b/.github/workflows/github_tests.yml
@@ -13,6 +13,10 @@ jobs:
     steps:
     - name: Checkout Repo
       uses: actions/checkout@v2
+      with:
+          lfs: true
+    - name: Checkout LFS objects
+      run: git lfs checkout
 
     - name: Set up Python
       uses: actions/setup-python@v2
@@ -27,7 +31,7 @@ jobs:
 
     - name: Run tests with pytest
       run: |
-        $CONDA/bin/pytest tests --doctest-modules --cov=src/marbel --cov-report=xml
+          $CONDA/bin/pytest tests --doctest-modules --cov=src/marbel --cov-report=xml
 
     - name: Convert coverage to lcov format
       run: |

diff --git a/README.md b/README.md
@@ -2,47 +2,72 @@
 
 # marbel (MetAtranscriptomic Reference Builder Evaluation Library)
 
-This project generates an in silico metatranscriptomic dataset based on specified parameters.
+This project generates an *in silico* metatranscriptomic dataset based on specified parameters.
 
 ## Installation
 
-### Conda build and install (recommended)
+### Install guide for development purposes
 
-It is recomended to install the package with conda install.
+#### Install git-lfs (absolutely necessary)
 
-Build the package with:
+Before cloning the repo you need to have git-lfs installed! If you do not have git-lfs and root rights install with
 
-`conda build . `
+```
+sudo apt-get install git-lfs
+```
 
-For this you need to have conda-build installed `(conda install conda-build`)
+If you already cloned the repo, remove it, install git-lfs and clone again.
 
-Create new environment and install package:
+#### Install miniconda (if not installed already)
 
 ```
-conda create -n marbel
-conda activate marbel
-conda install --use-local marbel
-```
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 
-### Install by hand (for development purposes)
+bash Miniconda3-latest-Linux-x86_64.sh
+```
 
-You need to install [R](https://www.r-project.org/about.html) and the R library polyester. Polyester can be installed with
+#### Create conda env
 
 ```
-R
-if (!require("BiocManager", quietly = TRUE))
-    install.packages("BiocManager")
-BiocManager::install("polyester")
+conda create -n marbel python=3.10 r-base
+conda activate marbel
+```
 
+#### Instal g++ (Optional, for performance)
 
 ```
+sudo apt-get install g++
+```
+
+#### Clone repository
 
-Install the package:
+git clone https://github.com/jlab/marbel.git
+
+#### Install the package:
 
 ```
+cd marbel
 pip install -e .
 ```
 
+### (Not ready, this is for later) nda build and install
+
+It is recomended to install the package with conda install.
+
+Build the package with:
+
+`conda build . `
+
+For this you need to have conda-build installed `(conda install conda-build`)
+
+Create new environment and install package:
+
+```
+conda create -n marbel
+conda activate marbel
+conda install --use-local marbel
+```
+
 ## Usage
 
 To get help on how to use the script, run:
@@ -54,20 +79,70 @@ marbel --help
 ### Command Line Arguments
 
 ```
-Usage: marbel [OPTIONS] 
-Options:
- --n-species                 INTEGER                 Number of species to be drawn for the metatranscriptomic in silico dataset [default: 20]
- --n-orthogroups             INTEGER                 Number of orthologous groups to be drawn for the metatranscriptomic in silico dataset [default: 1000]
- --n-samples                 <INTEGER INTEGER>...    Number of samples to be created for the metatranscriptomic in silico datasetthe first number is the number of samples for group 1 and the second is the number of samples for group 2 [default: 10, 10]  
-  --outdir                   TEXT    Output directory for the metatranscriptomic in silico dataset [default: simulated_reads]
-  --max-phylo-distance        TEXT    Maximum mean phylogenetic distance for orthologous groups. Specify stricter limit to avoid groups with a more diverse phylogenetic distance. [default: None]
-  --min-identity              FLOAT   Minimum mean sequence identity score for orthologous groups. Specify for more stringent identity requirements. [default: None]
-  --deg-ratio                <FLOAT FLOAT>... Ratio of up- and down-regulated genes. The first value is the ratio of up-regulated genes, the second represents the ratio of down-regulated genes [default: 0.1, 0.1]
-  --seed                      INTEGER Seed for sampling. Set for reproducibility [default: None]
-  --read-length               INTEGER Read length for the generated reads [default: 100]
-  --output-format             [fastq.gz|fastq|fasta] Output format for the reads [default: fastq.gz]
-  --version                           Show the version and exit.
-  --help                              Show this message and exit.
+# Usage: marbel [OPTIONS]
+
+## Options:
+- `--n-species` **INTEGER**  
+  Number of species to be drawn for the metatranscriptomic in silico dataset.  
+  **[default: 20]**
+
+- `--n-orthogroups` **INTEGER**  
+  Number of orthologous groups to be drawn for the metatranscriptomic in silico dataset.  
+  **[default: 1000]**
+
+- `--n-samples` **<INTEGER INTEGER>...**  
+  Number of samples to be created for the metatranscriptomic in silico dataset. The first number represents the number of samples for group 1, and the second is for group 2.  
+  **[default: 10, 10]**
+
+- `--outdir` **TEXT**  
+  Output directory for the metatranscriptomic in silico dataset.  
+  **[default: simulated_reads]**
+
+- `--max-phylo-distance` **[phylum|class|order|family|genus]**  
+  Maximum mean phylogenetic distance for orthologous groups. Specify a stricter limit to avoid groups with a more diverse phylogenetic distance.  
+  **[default: None]**
+
+- `--min-identity` **FLOAT**  
+  Minimum mean sequence identity score for orthologous groups. Specify for more stringent identity requirements.  
+  **[default: None]**
+
+- `--dge-ratio` **FLOAT**  
+  Ratio of up- and down-regulated genes. The first value is the ratio of up-regulated genes, and the second represents the ratio of down-regulated genes.  
+  **[default: 0.1]**
+
+- `--seed` **INTEGER**  
+  Seed for sampling. Set for reproducibility.  
+  **[default: None]**
+
+- `--error-model` **[basic|perfect|HiSeq|NextSeq|NovaSeq|Miseq-20|Miseq-24|Miseq-28|Miseq-32]**  
+  Sequencer model for the reads. Use `basic` or `perfect` (no errors) for custom read length.  
+  **[default: HiSeq]**
+
+- `--compressed / --no-compressed`  
+  Compress the output FASTQ files.  
+  **[default: compressed]**
+
+- `--read-length` **INTEGER**  
+  Read length for the generated reads. Only available when using `error_model` basic or perfect.  
+  **[default: None]**
+
+- `--library-size` **INTEGER**  
+  Library size for the reads.  
+  **[default: 100000]**
+
+- `--library-size-distribution` **[poisson|uniform|negative_binomial]**  
+  Distribution for the library size.  
+  **[default: uniform]**
+
+- `--threads` **INTEGER**  
+  Number of threads to be used.  
+  **[default: 10]**
+
+- `--version`  
+  Show the version and exit.
+
+- `--help`  
+  Show this message and exit.
 
 ```
 

diff --git a/environment.yml b/environment.yml
@@ -4,7 +4,12 @@ channels:
   - conda-forge
   - defaults
 dependencies:
+  - pandas
+  - numpy
   - flake8
   - pytest
   - pytest-cov
   - coverage >= 6  # to ensure lcov option is available
+  - pip:
+    - ./
+
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "marbel"
-version = "0.0.1"
+version = "0.0.2"
 authors = [
   { name="Timo Wentong Lin", email="[email protected]" },
 
@@ -41,3 +41,6 @@ include = [
     "src/marbel/data/orthologues_processed_combined_all.parquet",
     "src/marbel/data/EDGAR_all_species.newick",
 ]
+
+[tool.hatch.metadata]
+allow-direct-references = true
diff --git a/requirements.txt b/requirements.txt
@@ -1,8 +1,9 @@
-arviz==0.18.0
+arviz
 pymc
 typer
 rpy2
 biopython
 pyarrow
 typing_extensions
-ete3
+ete3
+InSilicoSeq @ git+https://github.com/jlab/InSilicoSeq.git