Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev-22 use same application code for local and swarm training #24

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
897c361
started splitting 3dcnn_ptl application code in swarm and standalone …
oleschwen Jan 31, 2025
19049f5
renamed method to better reflect its purpose, pass site name and numb…
oleschwen Feb 3, 2025
d3c5ee3
script to run local training in the same docker container as the swar…
oleschwen Feb 3, 2025
a87ac50
run swarm or test/local training controlled by flag in minimal example
oleschwen Feb 5, 2025
f01ef5b
control swarm/local training via environment variable for minimal exa…
oleschwen Feb 5, 2025
339ed08
set environment variables for swarm vs local training in startup kit
oleschwen Feb 7, 2025
392caf6
added environment variable in call script
oleschwen Feb 7, 2025
20c941d
local training not implemented yet
oleschwen Feb 7, 2025
963897e
copy source code into image, clean up folder
oleschwen Feb 10, 2025
132190c
testing image does not need a separate copy any more
oleschwen Feb 10, 2025
a7dd55d
tried using single main.py for local and swarm training, untested so far
oleschwen Feb 10, 2025
3067a0b
simplify updating image by copying later
oleschwen Feb 10, 2025
3e202db
fixed missing import, typo, conversion
oleschwen Feb 10, 2025
913f9cf
set environment variables in Docker container, run with output captured
oleschwen Feb 10, 2025
2373841
use fixed versions of apt and python packages and install nvflare dep…
oleschwen Feb 11, 2025
6a43443
include minimal training in docker startup script, refactored, and us…
oleschwen Feb 11, 2025
b61bb86
use correct site name
oleschwen Feb 11, 2025
5d0932a
removed outdated script, now that local training is called via the st…
oleschwen Feb 11, 2025
c19ed27
check if test data directory exists and log message (rather than cras…
oleschwen Feb 11, 2025
9de26a8
fixed variable capitalization
oleschwen Feb 11, 2025
b34c11d
partially described how swarm participants can run a pre-flight check…
oleschwen Feb 11, 2025
c893f86
removed hard-coded paths
oleschwen Feb 11, 2025
624a72e
fixed commands in documentation and used clearer argument name
oleschwen Feb 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 105 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@

# MediSwarm

## Introduction
# Introduction
MediSwarm is an open-source project dedicated to advancing medical deep learning through swarm intelligence, leveraging the NVFlare platform. Developed in collaboration with the Odelia consortium, this repository aims to create a decentralized and collaborative framework for medical research and applications.

## Key Features
Expand All @@ -11,21 +8,21 @@ MediSwarm is an open-source project dedicated to advancing medical deep learning
- **Collaborative Research:** Facilitates collaboration among medical researchers and institutions for enhanced outcomes.
- **Extensible Framework:** Designed to support various medical applications and easily integrate with existing workflows.

### Prerequisites
#### Hardware recommendations
## Prerequisites
### Hardware recommendations
* 64 GB of RAM (32 GB is the absolute minimum)
* 16 CPU cores (8 is the absolute minimum)
* an NVIDIA GPU with 48 GB of RAM (24 GB is the minimum)
* 8 TB of Storage (4 TB is the absolute minimum)

We demonstrate that the system can run on lightweight hardware like this. For less than 10k EUR, you can configure systems from suppliers like Lambda, Dell Precision, and Dell Alienware.

#### Operating System
### Operating System
* Ubuntu 20.04 LTS

## Usage for Developers
# Usage for Developers

### Setup
## Setup

0. **Clone the repository:**

Expand All @@ -34,7 +31,7 @@ We demonstrate that the system can run on lightweight hardware like this. For le
cd MediSwarm
```

### Running the Application
## Running the Application

1. **CIFAR-10 example:**
See [cifar10/README.md](application/jobs/cifar10/README.md)
Expand All @@ -43,35 +40,117 @@ We demonstrate that the system can run on lightweight hardware like this. For le
3. **3D CNN for classifying breast tumors:**
See [3dcnn_ptl/README.md](application/jobs/3dcnn_ptl/README.md)

## Running Tests
### Running Tests

1. Build the required docker image (TODO should this use images from the registry?)
```bash
docker build -t nvflare-pt-dev:3dcnn . -f docker_config/Dockerfile_3dcnn
docker build -t nvflare-pt-dev:testing . -f docker_config/Dockerfile_testing
```
1. Build the testing docker image
```bash
docker build -t nvflare-pt-dev:3dcnn . -f docker_config/Dockerfile_3dcnn
docker build -t nvflare-pt-dev:testing . -f docker_config/Dockerfile_testing
```
2. Run the Tests via
```bash
./runTestsInDocker.sh
```
```bash
./runTestsInDocker.sh
```
3. You should see
1. several expected errors and warnings printed from unit tests that should succeed overall, and a coverage report
2. output of a successful simulation run with two nodes
3. output of a successful proof-of-concept run run with two nodes
1. several expected errors and warnings printed from unit tests that should succeed overall, and a coverage report
2. output of a successful simulation run with two nodes
3. output of a successful proof-of-concept run run with two nodes
4. Optionally, uncomment running NVFlare unit tests in `_runTestsInsideDocker.sh`

## License

## Contributing Application Code

* take a look at application/jobs/minimal_training_pytorch_cnn for a minimal example how pytorch code can be adapted to work with NVFlare
* take a look at application/jobs/3dcnn_ptl for a more relastic example of pytorch code that can run in the swarm
* TODO more detailed instructions

## Setting up a Swarm

* currently described (here)[/application/jobs/3dcnn_ptl/README.md]

# Usage for Swarm Participants

## Setup

1. TODO compute node according to spec, installation of docker, openvpn, …

## Prepare Dataset

* TODO which data is expected in which folder structure + table structure

## Prepare Training Participation

1. TODO steps until startup kit has been extracted

## Run Pre-Flight Check

1. Directories
```bash
export SITE_NAME=<the name of your site> # TODO should be defined above, also needed for dataset location
export DATADIR=<path to where the directory $SITE_NAME containing your local data is stored>
export SCRATCHDIR=<path to where the training can store temporary files>
```
2. From the directory where you unpacked the startup kit,
```bash
cd $SITE_NAME/startup
```
3. Verify that your Docker/GPU setup is working
```bash
./docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --dummy_training
```
* This will pull the Docker image, which might take a while.
* The “training” itself should take less than minute and does not yield a meaningful classification performance.
4. Verify that your local data can be accessed and the model can be trained locally
```bash
./docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --dummy_training
```
* Training time depends on the size of the local dataset
* TODO update call when handling of the number of epochs has been implemented

## Start Swarm Node

1. From the directory where you unpacked the startup kit
```bash
cd $SITE_NAME/startup # skip this if you just ran the pre-flight check
```
2. Start the client
```bash
rm -rf ../pid.fl ../daemon_pid.fl nohup.out # clean up potential leftovers from previous run
./docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --start_client
```
3. Console output is captured in `nohup.out`, which may have been created by the root user in the container, so make it readable:
```bash
sudo chmod a+r nohup.out
```
4. Output files
* TODO describe

## Run Local Training

1. From the directory where you unpacked the startup kit
```bash
cd $SITE_NAME/startup
```
2. Start local training
```bash
/docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --local_training
```
* TODO update when handling of the number of epochs has been implemented
3. Output files
* TODO describe

# License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Maintainers
# Maintainers
[Jeff](https://github.com/Ultimate-Storm)
[Ole Schwen](mailto:[email protected])
[Steffen Renisch](mailto:[email protected])

## Contributing
# Contributing
Feel free to dive in! [Open an issue](https://github.com/KatherLab/MediSwarm/issues) or submit pull requests.

## Credits
# Credits
This project utilizes platforms and resources from the following repositories:

- **[NVFLARE](https://github.com/NVIDIA/NVFlare)**: NVFLARE (NVIDIA Federated Learning Application Runtime Environment) is an open-source framework that provides a robust and scalable platform for federated learning applications. We have integrated NVFLARE to efficiently handle the federated learning aspects of our project.
Expand Down
11 changes: 9 additions & 2 deletions _runTestsInsideDocker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,24 @@

# run unit tests of ODELIA swarm learning and report coverage
export MPLCONFIGDIR=/tmp
cd tests/unit_tests/controller
PYTHONPATH=/workspace/controller/controller python3 -m coverage run --source=/workspace/controller/controller -m unittest discover
cd /MediSwarm/tests/unit_tests/controller
PYTHONPATH=/MediSwarm/controller/controller python3 -m coverage run --source=/MediSwarm/controller/controller -m unittest discover
coverage report -m
rm .coverage

# run standalone version of minimal example
cd /workspace/application/jobs/minimal_training_pytorch_cnn/app/custom/
export TRAINING_MODE="local_training"
./main.py

# run simulation mode for minimal example
cd /workspace
export TRAINING_MODE="swarm"
nvflare simulator -w /tmp/minimal_training_pytorch_cnn -n 2 -t 2 application/jobs/minimal_training_pytorch_cnn -c simulated_node_0,simulated_node_1

# run proof-of-concept mode for minimal example
cd /workspace
export TRAINING_MODE="swarm"
nvflare poc prepare -c poc_client_0 poc_client_1
nvflare poc prepare-jobs-dir -j application/jobs/
nvflare poc start -ex [email protected]
Expand Down
154 changes: 27 additions & 127 deletions application/jobs/3dcnn_ptl/app/custom/main.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,150 +1,50 @@
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Subset
from collections import Counter
import torch
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
import nvflare.client.lightning as flare
from data.datamodules import DataModule
from model_selector import select_model
from env_config import load_environment_variables, load_prediction_modules, prepare_dataset, generate_run_directory
import nvflare.client as flare_util
#!/usr/bin/env python3

import os
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

flare_util.init()
import nvflare.client.lightning as flare
import nvflare.client as flare_util
import torch

SITE_NAME=flare.get_site_name()
import threedcnn_ptl

#TODO: Set max_epochs based on the data set size
NUM_EPOCHS_FOR_SITE = { "TUD_1": 2,
"TUD_2": 4,
"TUD_3": 8,
"MEVIS_1": 2,
"MEVIS_2": 4,
"UKA": 2,
}
TRAINING_MODE = os.getenv("TRAINING_MODE")

if SITE_NAME in NUM_EPOCHS_FOR_SITE.keys():
MAX_EPOCHS = NUM_EPOCHS_FOR_SITE[SITE_NAME]
if TRAINING_MODE == "swarm":
flare_util.init()
SITE_NAME=flare.get_site_name()
NUM_EPOCHS = threedcnn_ptl.get_num_epochs_per_round(SITE_NAME)
elif TRAINING_MODE == "local_training":
SITE_NAME=os.getenv("SITE_NAME")
NUM_EPOCHS = int(os.getenv("NUM_EPOCHS"))
else:
MAX_EPOCHS = 5
raise Exception(f"Illegal TRAINING_MODE {TRAINING_MODE}")

print(f"Site name: {SITE_NAME}")
print(f"Max epochs set to: {MAX_EPOCHS}")

def main():
"""
Main function for training and evaluating the model using NVFlare and PyTorch Lightning.
"""
logger = threedcnn_ptl.set_up_logging()
try:
env_vars = load_environment_variables()
logger.info(f'Model name: {env_vars["model_name"]}')

predict, prediction_flag = load_prediction_modules(env_vars['prediction_flag'])
ds, task_data_name = prepare_dataset(env_vars['task_data_name'], env_vars['data_dir'], site_name=SITE_NAME)
path_run_dir = generate_run_directory(env_vars['scratch_dir'], env_vars['task_data_name'], env_vars['model_name'], env_vars['local_compare_flag'])

accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
logger.info(f"Using {accelerator} for training")

labels = ds.get_labels()

# Generate indices and perform stratified split
indices = list(range(len(ds)))
train_indices, val_indices = train_test_split(indices, test_size=0.2, stratify=labels, random_state=42)

# Create training and validation subsets
ds_train = Subset(ds, train_indices)
ds_val = Subset(ds, val_indices)

# Extract training labels using the train_indices
train_labels = [labels[i] for i in train_indices]
label_counts = Counter(train_labels)

# Calculate the total number of samples in the training set
total_samples = len(train_labels)

# Print the percentage of the training set for each label
for label, count in label_counts.items():
percentage = (count / total_samples) * 100
logger.info(f"Label '{label}': {percentage:.2f}% of the training set, Exact count: {count}")

logger.info(f"Total number of different labels in the training set: {len(label_counts)}")

adsValData = DataLoader(ds_val, batch_size=2, shuffle=False)
logger.info(f'adsValData type: {type(adsValData)}')

train_size = len(ds_train)
val_size = len(ds_val)
logger.info(f'Train size: {train_size}')
logger.info(f'Val size: {val_size}')

max_epochs = env_vars['max_epochs']
#cal_max_epochs = cal_max_epochs(max_epochs, cal_weightage(train_size))
#logger.info(f"Max epochs set to: {cal_max_epochs}")

dm = DataModule(
ds_train=ds_train,
ds_val=ds_val,
batch_size=1,
num_workers=16,
pin_memory=True,
)

# Initialize the model
model_name = env_vars['model_name']
model = select_model(model_name)
logger.info(f"Using model: {model_name}")

to_monitor = "val/AUC_ROC"
min_max = "max"
log_every_n_steps = 1

checkpointing = ModelCheckpoint(
dirpath=str(path_run_dir),
monitor=to_monitor,
save_last=True,
save_top_k=2,
mode=min_max,
)

trainer = Trainer(
accelerator=accelerator,
precision=16,
default_root_dir=str(path_run_dir),
callbacks=[checkpointing],
enable_checkpointing=True,
check_val_every_n_epoch=1,
log_every_n_steps=log_every_n_steps,
max_epochs=MAX_EPOCHS,
num_sanity_val_steps=2,
logger=TensorBoardLogger(save_dir=path_run_dir)
)
data_module, model, checkpointing, trainer, path_run_dir, env_vars = threedcnn_ptl.prepare_training(logger, NUM_EPOCHS, SITE_NAME)

flare.patch(trainer) # Patch trainer to enable swarm learning
torch.autograd.set_detect_anomaly(True)
if TRAINING_MODE == "swarm":
flare.patch(trainer) # Patch trainer to enable swarm learning
torch.autograd.set_detect_anomaly(True)

logger.info(f"Site name: {flare.get_site_name()}")
logger.info(f"Site name: {SITE_NAME}")

while flare.is_running():
input_model = flare.receive()
logger.info(f"Current round: {input_model.current_round}")
while flare.is_running():
input_model = flare.receive()
logger.info(f"Current round: {input_model.current_round}")

logger.info("--- Validate global model ---")
trainer.validate(model, datamodule=dm)
threedcnn_ptl.validate_and_train(logger, data_module, model, trainer)

logger.info("--- Train new model ---")
trainer.fit(model, datamodule=dm)
elif TRAINING_MODE == "preflight_check" or TRAINING_MODE == "local_training":
threedcnn_ptl.validate_and_train(logger, data_module, model, trainer)

model.save_best_checkpoint(trainer.logger.log_dir, checkpointing.best_model_path)
predict(path_run_dir, os.path.join(env_vars['data_dir'], task_data_name, 'test'), model_name, last_flag=False, prediction_flag=prediction_flag)
logger.info('Training completed successfully')
threedcnn_ptl.finalize_training(logger, model, checkpointing, trainer, path_run_dir, env_vars)
except Exception as e:
logger.error(f"Error in main function: {e}")
raise
Expand Down
Loading