Genom Classify

Project Overview

This project processes genomic data from bacterial and archaeal organisms retrieved from the NCBI database, cleans the data, and trains Random Forest models using varying k-mer sizes and chunk sizes. The pipeline evaluates model performance and saves results for further analysis.

Setup Instructions

1. Create a Conda Environment

conda create -n genom_classify python=3.12 -y

2. Activate the Environment

conda activate genom_classify

3. Install Dependencies

pip install -r requirements.txt

Running the Project

Step 1: Data Processing

To process the genomic data, run the following command from the root of the project:

python data_processing.py

This script performs the following tasks:

Retrieves more than 100 bacterial and archaeal organisms' genomes from the NCBI database.
Filters out genomes with invalid sequences (e.g., sequences containing letters other than A, C, G, or T).
Randomly selects 10 organisms from the filtered data and saves them to memory for use in the modeling pipeline.

Step 2: Model Training and Evaluation

Run the main pipeline using:

python main.py

This script trains Random Forest models using varying configurations of:

k-mer sizes: [2, 3, 4, 5, 6, 7, 8]
chunk sizes: [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000]

For each combination of parameters, the pipeline:

Trains a Random Forest model.
Saves the trained model to the models/ directory.
Generates test gene CSV files in the test_genes/ directory.
Outputs a grid_search_results.csv file with accuracy metrics for each configuration.

Recommendation

Running the full pipeline with all k-mer sizes and chunk sizes takes a significant amount of time. For initial testing, modify main.py to restrict:

k-mer sizes: [2, 3, 4]
chunk sizes: [500, 1000, 1500]

This will significantly reduce runtime while allowing you to validate the pipeline functionality.

Project Structure

data_processing.py: Handles data retrieval, filtering, and selection.
main.py: Main script for model training and evaluation.
models/: Directory where trained models are saved.
test_genes/: Directory where test gene CSV files are saved.
grid_search_results.csv: Contains accuracy metrics for each model configuration.
requirements.txt: Lists all Python dependencies required for the project.

Future Steps

Once the pipeline is validated and running efficiently, you can expand the parameter grid or experiment with different machine learning models and hyperparameters to improve performance.

Name	Name	Last commit message	Last commit date
Latest commit parakawa avoid overfitting Jan 13, 2025 dbc0f79 · Jan 13, 2025 History 4 Commits
src	src	avoid overfitting	Jan 13, 2025
.gitignore	.gitignore	avoid overfitting	Jan 13, 2025
README.md	README.md	modify readme	Jan 11, 2025
main.py	main.py	avoid overfitting	Jan 13, 2025
requirements.txt	requirements.txt	add pipeline	Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genom Classify

Project Overview

Setup Instructions

1. Create a Conda Environment

2. Activate the Environment

3. Install Dependencies

Running the Project

Step 1: Data Processing

Step 2: Model Training and Evaluation

Recommendation

Project Structure

Future Steps

About

Releases

Packages

Languages

khnam-ng/genom_classify

Folders and files

Latest commit

History

Repository files navigation

Genom Classify

Project Overview

Setup Instructions

1. Create a Conda Environment

2. Activate the Environment

3. Install Dependencies

Running the Project

Step 1: Data Processing

Step 2: Model Training and Evaluation

Recommendation

Project Structure

Future Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages