Skip to content

khnam-ng/genom_classify

This branch is 1 commit behind parakawa/genom_classify:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

dbc0f79 · Jan 13, 2025

History

4 Commits
Jan 13, 2025
Jan 13, 2025
Jan 11, 2025
Jan 13, 2025
Jan 11, 2025

Repository files navigation

Genom Classify

Project Overview

This project processes genomic data from bacterial and archaeal organisms retrieved from the NCBI database, cleans the data, and trains Random Forest models using varying k-mer sizes and chunk sizes. The pipeline evaluates model performance and saves results for further analysis.

Setup Instructions

1. Create a Conda Environment

conda create -n genom_classify python=3.12 -y

2. Activate the Environment

conda activate genom_classify

3. Install Dependencies

pip install -r requirements.txt

Running the Project

Step 1: Data Processing

To process the genomic data, run the following command from the root of the project:

python data_processing.py

This script performs the following tasks:

  • Retrieves more than 100 bacterial and archaeal organisms' genomes from the NCBI database.
  • Filters out genomes with invalid sequences (e.g., sequences containing letters other than A, C, G, or T).
  • Randomly selects 10 organisms from the filtered data and saves them to memory for use in the modeling pipeline.

Step 2: Model Training and Evaluation

Run the main pipeline using:

python main.py

This script trains Random Forest models using varying configurations of:

  • k-mer sizes: [2, 3, 4, 5, 6, 7, 8]
  • chunk sizes: [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000]

For each combination of parameters, the pipeline:

  1. Trains a Random Forest model.
  2. Saves the trained model to the models/ directory.
  3. Generates test gene CSV files in the test_genes/ directory.
  4. Outputs a grid_search_results.csv file with accuracy metrics for each configuration.

Recommendation

Running the full pipeline with all k-mer sizes and chunk sizes takes a significant amount of time. For initial testing, modify main.py to restrict:

  • k-mer sizes: [2, 3, 4]
  • chunk sizes: [500, 1000, 1500]

This will significantly reduce runtime while allowing you to validate the pipeline functionality.

Project Structure

  • data_processing.py: Handles data retrieval, filtering, and selection.
  • main.py: Main script for model training and evaluation.
  • models/: Directory where trained models are saved.
  • test_genes/: Directory where test gene CSV files are saved.
  • grid_search_results.csv: Contains accuracy metrics for each model configuration.
  • requirements.txt: Lists all Python dependencies required for the project.

Future Steps

Once the pipeline is validated and running efficiently, you can expand the parameter grid or experiment with different machine learning models and hyperparameters to improve performance.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%