MethyLYZR

MethyLYZR is a Naïve Bayesian framework for rapid brain tumor classification from sparse epigenomic data.

This repository provides code to apply the model and instructions on how to set it up with a pre-trained Naïve Bayes model for classification of brain tumor samples from sparse methylation data into 91 CNS tumor classes defined by Capper et al. (2018) and 3 metastases classes (i.e., from breast cancer, lung cancer, and melanoma).

Please refer to our publication for detailed information: Rapid brain tumor classification from sparse epigenomic data.

📌 Contents

Requirements
- Hardware Requirements
- Software Requirements
Installation Guide
- Setting Up the Environment
- Data Requirements
Recommended Preprocessing
Running MethyLYZR for Tumor Classification (MethyLYZR.py)
- Demo
Data Availability
Intraoperative Workflow Video
Citation
License

Requirements

Hardware Requirements

MethyLYZR.py requires a 'standard' personal computer with minimum 4 GB RAM or a compute server. While MethyLYZR.py does not utilize parallel processing and can technically run on systems with a single-core CPU, we recommend a minimum of a dual-core processor to ensure the system remains responsive for multitasking. When working with large datasets, additional memory is advisable. The runtimes stated below are generated using a computer (Apple iMac) with 64 GB RAM, 10 cores@3 GHz.

Software Requirements

OS Requirements

The code has been tested on the following systems:

macOS 13.2.1
Internal Linux distribution

Python Dependencies

The Python package dependencies are listed in requirements.txt.

Installation Guide

Before proceeding with the installation, make sure that Python 3.5 or higher is installed. The code was developed and tested using Python version 3.7.11.

Setting Up the Environment

Follow these steps to set up an environment with all required Python packages:

Create a virtual environment for isolation of the dependencies

python -m venv venv

Activate the environment

source venv/bin/activate

Install the required dependencies using the requirements.txt file

pip install -r requirements.txt

Data Requirements

Download the required datasets from here. Ensure that the files are stored in the /model folder before running the model.

Recommended Preprocessing

Basecalling of Raw Nanopore signals

First, the raw Nanopore signals should be processed into bases using the dorado basecall server from ONT (https://github.com/nanoporetech/dorado/) implemented in the sequencing software MinKnow, using the high-accuracy basecalling model with 5mC modifications (dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg).

The obtained reads should then be mapped to human reference genome GRCH38.p13 using, e.g., minimap2, and saved into BAM files. Information on modified bases should use the MM and ML tags defined in the Sequence Alignment/Map Optional Fields Specification.

Processing of BAM files (bam2feather.py)

The extraction of methylation values from the BAM files into the required file format can be done using the provided Python script bam2feather.py. In short, the script first filters for primary alignments with a minimal mapping quality of 10, as reported by minimap2. Positions are further filtered on loci corresponding to a genomic position on the Illumina Infinium Human Methylation 450K BeadChip. The per read and CpG methylation probability is calculated using the SAM MM and ML tags. The output is written to disk as a feather file in the format required for MethyLYZR prediction (see below).

⚠️ Note: When intending to filter methylation calls by read start times, it is crucial to verify the accuracy of the read start times as provided by the basecaller. We have observed that discrepancies in software versions can lead to inaccuracies in the reconstructed timing of base calls. Therefore, we strongly advise a thorough validation of read start times against your basecalling software's output to ensure the reliability of methylation timing analysis.

Input arguments for bam2feather.py

Parameter	Description
-i, --inputs	directory with BAM files (required)
-s, --sample	sample (file) name (required)
-o, --output	output directory (required)
--sites	file with CpG site annotation in BED format (required)
-r, --recursive	recursive monitor subdirectories (default: False, action: store_true)
--io_threads	number of threads used for io-handling (default: 2)
--methylation_threads	number of threads used for methylation data extraction (default: 4)
--filter	string (present in the path) for filtering on barcodes

An example run of the bam2feather.py script:

bam2feather.py -i input_dir_bams -s sample1 -r --sites data/EPIC_CpGannotation.bed -o output_dir --methylation_threads 24 --io_threads 8

Output Feather table format

The bam2feather.py script outputs a feather file, in the following format:

Column	Description
epic_id	EPIC array probe ID associated with CpG site
methylation	methylation probability [0,1]
scores_per_read	number of measurements (covered by reference) on same read
read_no	read number
binary_methylation	0 if methylation < 0.8, else 1
read_id	read identifier
pct_identity	percent read identity to reference
start_time	time in seconds after start of sequencing
run_id	sequencing run identifier

💡 This feather file format is the required input format for MethyLYZR.py.

Running MethyLYZR for Tumor Classification (MethyLYZR.py)

To classify a given Nanopore sample using the pre-trained Naïve Bayes model for CNS tumor class prediction, use:

python MethyLYZR.py -i sampleX.feather -s sampleX -o results

Pre-trained Model

Components making up the trained Naïve Bayes model stored in model/:

centroids (betas_mean.feather)
- mean methylation profiles per class
feature weights (W_RELIEF.feather)
- class-specific weights derived from ReliefF-based method
class priors (class_priors.csv)
- prior probabilities per class derived from relative frequencies in training data

This model encompasses 91 CNS + 3 metastasis classes.

Input Arguments for MethyLYZR.py

Parameter	Description
-i, --inputs	methylation data input feather file (required)
-s, --sample	sample name (required)
-o, --output	output directory (required)
-c, --centroids	directory of pre-trained class centroids (default: 'model/betas_mean.feather')
-w, --weights	directory of pre-trained class centroids (default: 'model/W_RELIEF.feather')
-p, --priors	directory of pre-trained class priors (default: 'model/class_priors.csv')
--minNoise	minimum noise value added to centroids (default: 0.05)
--methLowerBound	lower bound for calling methylated loci (default: 0.8)
--methUpperBound	upper bound for calling unmethylated loci (default: 0.2)

What it does

reads the pre-trained model (centroids + feature weights + class priors)
loads sample from corresponding feather file
preprocesses the methylation measurements
- filtering for (methylation probability < methUpperBound) OR (methylation probabilty > methLowerBound)
- filtering for number of features per read <= 10
- calculating noise
- calculating read weights
- binarize methylation probabilities
predicts posterior probabilities for all 94 classes using Naïve Bayesian framework
outputs
- a ranked list with the posterior probabilites for all classes as a CSV
- a barplot showing the posterior probabilities of the top 5 hits as a PDF

Posterior Probabilities

The probabilities of predictions output by MethyLYZR.py range from 0 to 1 and reflect the certainty of the prediction.

Scores > 0.6: Predictions with probabilities above 0.6 can be regarded as high-certainty predictions.
Scores <= 0.6: Predictions with scores below 0.6 are considered to have lower certainty. These results suggest that the model has lower confidence in the prediction, and they should be treated accordingly.

Demo

This demo illustrates how to use MethyLYZR to process sample data and generate output files. We use IEG14_4d93b1bb47f2b6315c552126160915afc2a89ca7.feather as input to produce a CSV table of the predictions and a PDF with visualized top 5 classes.

Preprocess the input data using bam2feather.py

python bam2feather.py -i IEG14_HG38/ -s IEG14_4d93b1bb47f2b6315c552126160915afc2a89ca7 -r --sites data/EPIC_CpGannotation.bed -o demo_outputs --methylation_threads 24 --io_threads 8

Follow these steps to replicate the demo:

Run MethyLYZR.py on the preprocessed data for test sample IEG14 (:hourglass_flowing_sand: Runtime: 59.7 seconds)

python MethyLYZR.py -i data/ONT_15min_runs/IEG14_4d93b1bb47f2b6315c552126160915afc2a89ca7.feather -s IEG14 -o results \
-c model/betas_mean.feather -w model/W_RELIEF.feather -p model/class_priors.csv --minNoise 0.05 --methLowerBound 0.8 --methUpperBound 0.2

This will print the following log to the command line

-------------------------------------------------------
  M e t h y L Y Z R
-------------------------------------------------------

Reading model from model/betas_mean.feather
                     model/W_RELIEF.feather
                     model/class_priors.csv
-------------------------------------------------------
Sample: IEG14
-------------------------------------------------------

Loading data from data/ONT_15min_runs/IEG14_4d93b1bb47f2b6315c552126160915afc2a89ca7.feather

Preprocessing data...
 - Filtering values...
 - Calculating noise...
 - Calculating read weights...

Predicting classes...

Saving results to results/MethyLYZR_IEG14.csv
                    results/MethyLYZR_IEG14.pdf
-------------------------------------------------------
Done.
-------------------------------------------------------

Review the output generated by this demo.

📋 Ranked table with posteriors of all classes
📊 Barplot with top 5 predicted classes

Data Availability

This dataset provides methylation data from ONT (R9 and R10) and PacBio sequencing, stored in feather format. These data were generated from brain tumor biopsies and can be used directly as input for MethyLYZR.py, or filtered by timestamps to post-hoc simulate shorter sequencing runs.

Intraoperative Workflow Video

This video provides a showcase of the intraoperative setup with real-time recordings of 10 clinical demonstrator samples processed under intraoperative clinical conditions at the Point-of-Care. It demonstrates the time required for:

DNA extraction
Library preparation
Sequencing

Citation

Brändl, B., Steiger, M., Kubelt, C., Rohrandt, C., Zhu, Z., Evers, M., Wang, G., Schuldt, B., Afflerbach, A.K., Wong, D., Lum, A., Halldorsson, S., Djirackor, L., Leske, H., Magadeeva, S., Smičius, R., Quedenau, C., Schmidt, N.O., Schüller, U., Vik-Mo, E.O., Proescholdt, M., Riemenschneider, M.J., Zadeh, G., Ammerpohl, O., Yip, S., Synowitz, M., van Bömmel, A., Kretzmer, H., & Müller, F.-J. Rapid brain tumor classification from sparse epigenomic data. Nat. Med. (2025). https://doi.org/10.1038/s41591-024-03435-3

License

All source code and binary files are available under the following provisions:

Any copyright or patent right is owned by and proprietary material of the Max-Planck-Gesellschaft zur Förderung der Wissenschaften e.V. ("MPG"). MPG makes no representations or warranties of any kind concerning the source code, neither express nor implied, and the absence of any legal or actual defects, whether discoverable or not.

The source code and binary files available are subject to a non-exclusive, revocable, non-transferable, and limited right to use for the exclusive purpose of undertaking academic or not-for-profit research, which is granted hereby. Use of the Code or any part thereof for commercial or clinical purposes and alterations are strictly prohibited in the absence of a Commercial License Agreement from MPG (Contact: [email protected]).

Download and/or use of the source code and binary files under the aforementioned research license comes with the acceptance of the license agreement provided with the download. Any further distribution is prohibited.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
img		img
model		model
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
MethyLYZR.py		MethyLYZR.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MethyLYZR

📌 Contents

Requirements

Hardware Requirements

Software Requirements

OS Requirements

Python Dependencies

Installation Guide

Setting Up the Environment

Data Requirements

Recommended Preprocessing

Basecalling of Raw Nanopore signals

Processing of BAM files (bam2feather.py)

Input arguments for bam2feather.py

An example run of the bam2feather.py script:

Output Feather table format

Running MethyLYZR for Tumor Classification (MethyLYZR.py)

Pre-trained Model

Input Arguments for MethyLYZR.py

What it does

Posterior Probabilities

Demo

Data Availability

Intraoperative Workflow Video

Citation

License

About

Releases

Packages

Languages

License

marasteiger/MethyLYZR

Folders and files

Latest commit

History

Repository files navigation

MethyLYZR

📌 Contents

Requirements

Hardware Requirements

Software Requirements

OS Requirements

Python Dependencies

Installation Guide

Setting Up the Environment

Data Requirements

Recommended Preprocessing

Basecalling of Raw Nanopore signals

Processing of BAM files (bam2feather.py)

Input arguments for bam2feather.py

An example run of the bam2feather.py script:

Output Feather table format

Running MethyLYZR for Tumor Classification (MethyLYZR.py)

Pre-trained Model

Input Arguments for MethyLYZR.py

What it does

Posterior Probabilities

Demo

Data Availability

Intraoperative Workflow Video

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages