KatherLab · oleschwen · Jan 31, 2025 · Feb 3, 2025 · Feb 3, 2025 · Feb 5, 2025
diff --git a/README.md b/README.md
@@ -1,7 +1,4 @@
-
-# MediSwarm
-
-## Introduction
+# Introduction
 MediSwarm is an open-source project dedicated to advancing medical deep learning through swarm intelligence, leveraging the NVFlare platform. Developed in collaboration with the Odelia consortium, this repository aims to create a decentralized and collaborative framework for medical research and applications.
 
 ## Key Features
@@ -11,21 +8,21 @@ MediSwarm is an open-source project dedicated to advancing medical deep learning
 - **Collaborative Research:** Facilitates collaboration among medical researchers and institutions for enhanced outcomes.
 - **Extensible Framework:** Designed to support various medical applications and easily integrate with existing workflows.
 
-### Prerequisites
-#### Hardware recommendations
+## Prerequisites
+### Hardware recommendations
 * 64 GB of RAM (32 GB is the absolute minimum)
 * 16 CPU cores (8 is the absolute minimum)
 * an NVIDIA GPU with 48 GB of RAM (24 GB is the minimum)
 * 8 TB of Storage (4 TB is the absolute minimum)
 
 We demonstrate that the system can run on lightweight hardware like this. For less than 10k EUR, you can configure systems from suppliers like Lambda, Dell Precision, and Dell Alienware.
 
-#### Operating System
+### Operating System
 * Ubuntu 20.04 LTS
 
-## Usage for Developers
+# Usage for Developers
 
-### Setup
+## Setup
 
 0. **Clone the repository:**
 
@@ -34,7 +31,7 @@ We demonstrate that the system can run on lightweight hardware like this. For le
     cd MediSwarm
     ```
 
-### Running the Application
+## Running the Application
 
 1. **CIFAR-10 example:**
    See [cifar10/README.md](application/jobs/cifar10/README.md)
@@ -43,35 +40,117 @@ We demonstrate that the system can run on lightweight hardware like this. For le
 3. **3D CNN for classifying breast tumors:**
    See [3dcnn_ptl/README.md](application/jobs/3dcnn_ptl/README.md)
 
-## Running Tests
+### Running Tests
 
-1. Build the required docker image (TODO should this use images from the registry?)
-    ```bash
-    docker build -t nvflare-pt-dev:3dcnn   . -f docker_config/Dockerfile_3dcnn
-    docker build -t nvflare-pt-dev:testing . -f docker_config/Dockerfile_testing
-    ```
+1. Build the testing docker image
+   ```bash
+   docker build -t nvflare-pt-dev:3dcnn   . -f docker_config/Dockerfile_3dcnn
+   docker build -t nvflare-pt-dev:testing . -f docker_config/Dockerfile_testing
+   ```
 2. Run the Tests via
-    ```bash
-    ./runTestsInDocker.sh
-    ```
+   ```bash
+   ./runTestsInDocker.sh
+   ```
 3. You should see
-  1. several expected errors and warnings printed from unit tests that should succeed overall, and a coverage report
-  2. output of a successful simulation run with two nodes
-  3. output of a successful proof-of-concept run run with two nodes
+   1. several expected errors and warnings printed from unit tests that should succeed overall, and a coverage report
+   2. output of a successful simulation run with two nodes
+   3. output of a successful proof-of-concept run run with two nodes
 4. Optionally, uncomment running NVFlare unit tests in `_runTestsInsideDocker.sh`
 
-## License
+
+## Contributing Application Code
+
+* take a look at application/jobs/minimal_training_pytorch_cnn for a minimal example how pytorch code can be adapted to work with NVFlare
+* take a look at application/jobs/3dcnn_ptl for a more relastic example of pytorch code that can run in the swarm
+* TODO more detailed instructions
+
+## Setting up a Swarm
+
+* currently described (here)[/application/jobs/3dcnn_ptl/README.md]
+
+# Usage for Swarm Participants
+
+## Setup
+
+1. TODO compute node according to spec, installation of docker, openvpn, …
+
+## Prepare Dataset
+
+* TODO which data is expected in which folder structure + table structure
+
+## Prepare Training Participation
+
+1. TODO steps until startup kit has been extracted
+
+## Run Pre-Flight Check
+
+1. Directories
+   ```bash
+   export SITE_NAME=<the name of your site>  # TODO should be defined above, also needed for dataset location
+   export DATADIR=<path to where the directory $SITE_NAME containing your local data is stored>
+   export SCRATCHDIR=<path to where the training can store temporary files>
+   ```
+2. From the directory where you unpacked the startup kit,
+   ```bash
+   cd $SITE_NAME/startup
+   ```
+3. Verify that your Docker/GPU setup is working
+   ```bash
+   ./docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --dummy_training
+   ```
+   * This will pull the Docker image, which might take a while.
+   * The “training” itself should take less than minute and does not yield a meaningful classification performance.
+4. Verify that your local data can be accessed and the model can be trained locally
+   ```bash
+   ./docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --dummy_training
+   ```
+   * Training time depends on the size of the local dataset
+   * TODO update call when handling of the number of epochs has been implemented
+
+## Start Swarm Node
+
+1. From the directory where you unpacked the startup kit
+   ```bash
+   cd $SITE_NAME/startup  # skip this if you just ran the pre-flight check
+   ```
+2. Start the client
+   ```bash
+   rm -rf ../pid.fl ../daemon_pid.fl nohup.out  # clean up potential leftovers from previous run
+   ./docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --start_client
+   ```
+3. Console output is captured in `nohup.out`, which may have been created by the root user in the container, so make it readable:
+   ```bash
+   sudo chmod a+r nohup.out
+   ```
+4. Output files
+   * TODO describe
+
+## Run Local Training
+
+1. From the directory where you unpacked the startup kit
+   ```bash
+   cd $SITE_NAME/startup
+   ```
+2. Start local training
+   ```bash
+   /docker.sh --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU all --local_training
+   ```
+   * TODO update when handling of the number of epochs has been implemented
+3. Output files
+   * TODO describe
+
+# License
 This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 
-## Maintainers
+# Maintainers
 [Jeff](https://github.com/Ultimate-Storm)
 [Ole Schwen](mailto:[email protected])
 [Steffen Renisch](mailto:[email protected])
 
-## Contributing
+# Contributing
 Feel free to dive in! [Open an issue](https://github.com/KatherLab/MediSwarm/issues) or submit pull requests.
 
-## Credits
+# Credits
 This project utilizes platforms and resources from the following repositories:
 
 - **[NVFLARE](https://github.com/NVIDIA/NVFlare)**: NVFLARE (NVIDIA Federated Learning Application Runtime Environment) is an open-source framework that provides a robust and scalable platform for federated learning applications. We have integrated NVFLARE to efficiently handle the federated learning aspects of our project.

diff --git a/_runTestsInsideDocker.sh b/_runTestsInsideDocker.sh
@@ -8,17 +8,24 @@
 
 # run unit tests of ODELIA swarm learning and report coverage
 export MPLCONFIGDIR=/tmp
-cd tests/unit_tests/controller
-PYTHONPATH=/workspace/controller/controller python3 -m coverage run --source=/workspace/controller/controller -m unittest discover
+cd /MediSwarm/tests/unit_tests/controller
+PYTHONPATH=/MediSwarm/controller/controller python3 -m coverage run --source=/MediSwarm/controller/controller -m unittest discover
 coverage report -m
 rm .coverage
 
+# run standalone version of minimal example
+cd /workspace/application/jobs/minimal_training_pytorch_cnn/app/custom/
+export TRAINING_MODE="local_training"
+./main.py
+
 # run simulation mode for minimal example
 cd /workspace
+export TRAINING_MODE="swarm"
 nvflare simulator -w /tmp/minimal_training_pytorch_cnn -n 2 -t 2 application/jobs/minimal_training_pytorch_cnn -c simulated_node_0,simulated_node_1
 
 # run proof-of-concept mode for minimal example
 cd /workspace
+export TRAINING_MODE="swarm"
 nvflare poc prepare -c poc_client_0 poc_client_1
 nvflare poc prepare-jobs-dir -j application/jobs/
 nvflare poc start -ex [email protected]

diff --git a/application/jobs/3dcnn_ptl/app/custom/main.py b/application/jobs/3dcnn_ptl/app/custom/main.py
@@ -1,150 +1,50 @@
-from sklearn.model_selection import train_test_split
-from torch.utils.data import DataLoader, Subset
-from collections import Counter
-import torch
-from pytorch_lightning import Trainer
-from pytorch_lightning.callbacks import ModelCheckpoint
-from pytorch_lightning.loggers import TensorBoardLogger
-import nvflare.client.lightning as flare
-from data.datamodules import DataModule
-from model_selector import select_model
-from env_config import load_environment_variables, load_prediction_modules, prepare_dataset, generate_run_directory
-import nvflare.client as flare_util
+#!/usr/bin/env python3
 
 import os
-import logging
 
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-flare_util.init()
+import nvflare.client.lightning as flare
+import nvflare.client as flare_util
+import torch
 
-SITE_NAME=flare.get_site_name()
+import threedcnn_ptl
 
-#TODO: Set max_epochs based on the data set size
-NUM_EPOCHS_FOR_SITE = { "TUD_1":   2,
-                        "TUD_2":   4,
-                        "TUD_3":   8,
-                        "MEVIS_1": 2,
-                        "MEVIS_2": 4,
-                        "UKA":     2,
-                      }
+TRAINING_MODE = os.getenv("TRAINING_MODE")
 
-if SITE_NAME in NUM_EPOCHS_FOR_SITE.keys():
-    MAX_EPOCHS = NUM_EPOCHS_FOR_SITE[SITE_NAME]
+if TRAINING_MODE == "swarm":
+    flare_util.init()
+    SITE_NAME=flare.get_site_name()
+    NUM_EPOCHS = threedcnn_ptl.get_num_epochs_per_round(SITE_NAME)
+elif TRAINING_MODE == "local_training":
+    SITE_NAME=os.getenv("SITE_NAME")
+    NUM_EPOCHS = int(os.getenv("NUM_EPOCHS"))
 else:
-    MAX_EPOCHS = 5
+    raise Exception(f"Illegal TRAINING_MODE {TRAINING_MODE}")
 
-print(f"Site name: {SITE_NAME}")
-print(f"Max epochs set to: {MAX_EPOCHS}")
 
 def main():
     """
     Main function for training and evaluating the model using NVFlare and PyTorch Lightning.
     """
+    logger = threedcnn_ptl.set_up_logging()
     try:
-        env_vars = load_environment_variables()
-        logger.info(f'Model name: {env_vars["model_name"]}')
-
-        predict, prediction_flag = load_prediction_modules(env_vars['prediction_flag'])
-        ds, task_data_name = prepare_dataset(env_vars['task_data_name'], env_vars['data_dir'], site_name=SITE_NAME)
-        path_run_dir = generate_run_directory(env_vars['scratch_dir'], env_vars['task_data_name'], env_vars['model_name'], env_vars['local_compare_flag'])
-
-        accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
-        logger.info(f"Using {accelerator} for training")
-
-        labels = ds.get_labels()
-
-        # Generate indices and perform stratified split
-        indices = list(range(len(ds)))
-        train_indices, val_indices = train_test_split(indices, test_size=0.2, stratify=labels, random_state=42)
-
-        # Create training and validation subsets
-        ds_train = Subset(ds, train_indices)
-        ds_val = Subset(ds, val_indices)
-
-        # Extract training labels using the train_indices
-        train_labels = [labels[i] for i in train_indices]
-        label_counts = Counter(train_labels)
-
-        # Calculate the total number of samples in the training set
-        total_samples = len(train_labels)
-
-        # Print the percentage of the training set for each label
-        for label, count in label_counts.items():
-            percentage = (count / total_samples) * 100
-            logger.info(f"Label '{label}': {percentage:.2f}% of the training set, Exact count: {count}")
-
-        logger.info(f"Total number of different labels in the training set: {len(label_counts)}")
-
-        adsValData = DataLoader(ds_val, batch_size=2, shuffle=False)
-        logger.info(f'adsValData type: {type(adsValData)}')
-
-        train_size = len(ds_train)
-        val_size = len(ds_val)
-        logger.info(f'Train size: {train_size}')
-        logger.info(f'Val size: {val_size}')
-
-        max_epochs = env_vars['max_epochs']
-        #cal_max_epochs = cal_max_epochs(max_epochs, cal_weightage(train_size))
-        #logger.info(f"Max epochs set to: {cal_max_epochs}")
-
-        dm = DataModule(
-            ds_train=ds_train,
-            ds_val=ds_val,
-            batch_size=1,
-            num_workers=16,
-            pin_memory=True,
-        )
-
-        # Initialize the model
-        model_name = env_vars['model_name']
-        model = select_model(model_name)
-        logger.info(f"Using model: {model_name}")
-
-        to_monitor = "val/AUC_ROC"
-        min_max = "max"
-        log_every_n_steps = 1
-
-        checkpointing = ModelCheckpoint(
-            dirpath=str(path_run_dir),
-            monitor=to_monitor,
-            save_last=True,
-            save_top_k=2,
-            mode=min_max,
-        )
-
-        trainer = Trainer(
-            accelerator=accelerator,
-            precision=16,
-            default_root_dir=str(path_run_dir),
-            callbacks=[checkpointing],
-            enable_checkpointing=True,
-            check_val_every_n_epoch=1,
-            log_every_n_steps=log_every_n_steps,
-            max_epochs=MAX_EPOCHS,
-            num_sanity_val_steps=2,
-            logger=TensorBoardLogger(save_dir=path_run_dir)
-        )
+        data_module, model, checkpointing, trainer, path_run_dir, env_vars = threedcnn_ptl.prepare_training(logger, NUM_EPOCHS, SITE_NAME)
 
-        flare.patch(trainer)  # Patch trainer to enable swarm learning
-        torch.autograd.set_detect_anomaly(True)
+        if TRAINING_MODE == "swarm":
+            flare.patch(trainer)  # Patch trainer to enable swarm learning
+            torch.autograd.set_detect_anomaly(True)
 
-        logger.info(f"Site name: {flare.get_site_name()}")
+            logger.info(f"Site name: {SITE_NAME}")
 
-        while flare.is_running():
-            input_model = flare.receive()
-            logger.info(f"Current round: {input_model.current_round}")
+            while flare.is_running():
+                input_model = flare.receive()
+                logger.info(f"Current round: {input_model.current_round}")
 
-            logger.info("--- Validate global model ---")
-            trainer.validate(model, datamodule=dm)
+                threedcnn_ptl.validate_and_train(logger, data_module, model, trainer)
 
-            logger.info("--- Train new model ---")
-            trainer.fit(model, datamodule=dm)
+        elif TRAINING_MODE == "preflight_check" or TRAINING_MODE == "local_training":
+            threedcnn_ptl.validate_and_train(logger, data_module, model, trainer)
 
-        model.save_best_checkpoint(trainer.logger.log_dir, checkpointing.best_model_path)
-        predict(path_run_dir, os.path.join(env_vars['data_dir'], task_data_name, 'test'), model_name, last_flag=False, prediction_flag=prediction_flag)
-        logger.info('Training completed successfully')
+        threedcnn_ptl.finalize_training(logger, model, checkpointing, trainer, path_run_dir, env_vars)
     except Exception as e:
         logger.error(f"Error in main function: {e}")
         raise