diff --git a/examples/getting_started/pt/nvflare_pt_getting_started.ipynb b/examples/getting_started/pt/nvflare_pt_getting_started.ipynb
index 696538d122..6d49374309 100644
--- a/examples/getting_started/pt/nvflare_pt_getting_started.ipynb
+++ b/examples/getting_started/pt/nvflare_pt_getting_started.ipynb
@@ -258,7 +258,11 @@
{
"cell_type": "markdown",
"id": "1b70da5d-ba8b-4e65-b47f-44bb9bddae4d",
- "metadata": {},
+ "metadata": {
+ "jupyter": {
+ "source_hidden": true
+ }
+ },
"source": [
"#### 2. Define a FedJob\n",
"The `FedJob` is used to define how controllers and executors are placed within a federated job using the `to(object, target)` routine.\n",
@@ -271,7 +275,11 @@
"cell_type": "code",
"execution_count": null,
"id": "13771bfb-901f-485a-9a23-84db1ccd5fe4",
- "metadata": {},
+ "metadata": {
+ "jupyter": {
+ "source_hidden": true
+ }
+ },
"outputs": [],
"source": [
"from src.net import Net\n",
@@ -368,7 +376,11 @@
{
"cell_type": "markdown",
"id": "9ac3f0a8-06bb-4bea-89d3-4a5fc5b76c63",
- "metadata": {},
+ "metadata": {
+ "jupyter": {
+ "source_hidden": true
+ }
+ },
"source": [
"#### 6. Run FL Simulation\n",
"Finally, we can run our FedJob in simulation using NVFlare's [simulator](https://nvflare.readthedocs.io/en/main/user_guide/nvflare_cli/fl_simulator.html) under the hood. We can also specify which GPU should be used to run this client, which is helpful for simulated environments. The results will be saved in the specified `workdir`."
@@ -438,7 +450,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.8.19"
+ "version": "3.11.7"
}
},
"nbformat": 4,
diff --git a/examples/tutorials/self-paced-training/.gitignore b/examples/tutorials/self-paced-training/.gitignore
new file mode 100644
index 0000000000..76a8eead5c
--- /dev/null
+++ b/examples/tutorials/self-paced-training/.gitignore
@@ -0,0 +1,3 @@
+.virtual_documents
+
+job_configs/
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.0_introduction/introduction.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.0_introduction/introduction.ipynb
index 9a2cfe64f7..277560b683 100644
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.0_introduction/introduction.ipynb
+++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.0_introduction/introduction.ipynb
@@ -11,9 +11,9 @@
],
"metadata": {
"kernelspec": {
- "display_name": "nvflare_example",
+ "display_name": "Python 3 (ipykernel)",
"language": "python",
- "name": "nvflare_example"
+ "name": "python3"
},
"language_info": {
"codemirror_mode": {
@@ -25,7 +25,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.2"
+ "version": "3.11.7"
}
},
"nbformat": 4,
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/figs/nvflare_brats18.png b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/figs/nvflare_brats18.png
deleted file mode 100644
index 577c046bb7..0000000000
Binary files a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/figs/nvflare_brats18.png and /dev/null differ
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/privacy_filtering.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/privacy_filtering.ipynb
index 6993c58b0b..515d1bdb66 100644
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/privacy_filtering.ipynb
+++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/privacy_filtering.ipynb
@@ -5,43 +5,23 @@
"id": "1398ef0a-f189-4d04-a8a9-276a17ab0f8b",
"metadata": {},
"source": [
- "# Federated Learning with Differential Privacy for BraTS18 Segmentation\n",
+ "# Federated Learning with Differential Privacy\n",
"\n",
- "Please make sure you set up virtual environment and follows [example root readme](../../README.md)\n",
+ "Please make sure you set up a virtual environment and follow [example root readme](../../README.md) before starting this notebook.\n",
+ "Then, install the requirements.\n",
"\n",
- "## Introduction to MONAI, BraTS and Differential Privacy"
+ "
NOTE Some of the cells below generate long text output. We're using
%%capture --no-display --no-stderr cell_output
to suppress this output. Comment or delete this line in the cells below to restore full output.
"
]
},
{
- "cell_type": "markdown",
- "id": "af5f3c69-aeba-4cea-89c8-d54fd6520ab1",
- "metadata": {},
- "source": [
- "### MONAI\n",
- "This example shows how to use [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on medical image applications.\n",
- "It uses [MONAI](https://github.com/Project-MONAI/MONAI),\n",
- "which is a PyTorch-based, open-source framework for deep learning in healthcare imaging, part of the PyTorch Ecosystem."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8a3b7e0b-9dbd-4d21-8b59-a3d08cf2b2bb",
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "5002e45c-f58e-4f68-bb5a-9626e084947f",
"metadata": {},
+ "outputs": [],
"source": [
- "### BraTS\n",
- "The application shown in this example is volumetric (3D) segmentation of brain tumor subregions from multimodal MRIs based on BraTS 2018 data.\n",
- "It uses a deep network model published by [Myronenko 2018](https://arxiv.org/abs/1810.11654) [1].\n",
- "\n",
- "The model is trained to segment 3 nested subregions of primary brain tumors (gliomas): the \"enhancing tumor\" (ET), the \"tumor core\" (TC), the \"whole tumor\" (WT) based on 4 aligned input MRI scans (T1c, T1, T2, FLAIR). \n",
- "\n",
- "
\\n\n",
- "\n",
- "- The ET is described by areas that show hyper intensity in T1c when compared to T1, but also when compared to \"healthy\" white matter in T1c. \n",
- "- The TC describes the bulk of the tumor, which is what is typically resected. The TC entails the ET, as well as the necrotic (fluid-filled) and the non-enhancing (solid) parts of the tumor. \n",
- "- The WT describes the complete extent of the disease, as it entails the TC and the peritumoral edema (ED), which is typically depicted by hyper-intense signal in FLAIR.\n",
- "\n",
- "To run this example, please make sure you have downloaded BraTS 2018 data, which can be obtained from [Multimodal Brain Tumor Segmentation Challenge (BraTS) 2018](https://www.med.upenn.edu/cbica/brats2018.html) [2-6]. Please download the data to [./dataset_brats18/dataset](./dataset_brats18/dataset). It should result in a sub-folder `./dataset_brats18/dataset/training`.\n",
- "In this example, we split BraTS18 dataset into [4 subsets](./dataset_brats18/datalist) for 4 clients. Each client requires at least a 12 GB GPU to run. "
+ "%%capture --no-display --no-stderr cell_output\n",
+ "!pip install -r requirements.txt"
]
},
{
@@ -50,29 +30,8 @@
"metadata": {},
"source": [
"### Differential Privacy (DP)\n",
- "[Differential Privacy (DP)](https://arxiv.org/abs/1910.00962) [7] is method for ensuring that Federated Learning (FL) preserves privacy by obfuscating the model updates sent from clients to the central server.\n",
- "This example shows the usage of a MONAI-based trainer for medical image applications with NVFlare, as well as the usage of DP filters in your FL training. DP is added as a filter in `config_fed_client.json`. Here, we use the \"Sparse Vector Technique\", i.e. the [SVTPrivacy](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.filters.svt_privacy.html) protocol, as utilized in [Li et al. 2019](https://arxiv.org/abs/1910.00962) [7] (see [Lyu et al. 2016](https://arxiv.org/abs/1603.01699) [8] for more information)."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "33306153-a8c5-4c2b-9eea-28c3e0d705a6",
- "metadata": {},
- "source": [
- "## Prepare local configs\n",
- "First, we add the image and datalist directory roots to `config_train.json` files for generating the absolute path to the dataset by replacing the `DATASET_ROOT` and `DATALIST_ROOT` placeholders. In the current folder structure, it will be `${PWD}/dataset_brats18/dataset` for `DATASET_ROOT` and `${PWD}/dataset_brats18/datalist` for `DATALIST_ROOT` but you can update the below `sed` commands if the data is located somewhere else."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "be8a7f47-2a93-4992-85c0-d597f4ecf3d5",
- "metadata": {},
- "outputs": [],
- "source": [
- "%%bash\n",
- "sed -i \"s|DATASET_ROOT|${PWD}/dataset_brats18/dataset|g\" config_train.json\n",
- "sed -i \"s|DATALIST_ROOT|${PWD}/dataset_brats18/datalist|g\" config_train.json"
+ "[Differential Privacy (DP)](https://arxiv.org/abs/1910.00962) [7] is a method for ensuring that Federated Learning (FL) preserves privacy by obfuscating the model updates sent from clients to the central server. \n",
+ "This example shows the usage of a CIFAR-10 training code with NVFlare, as well as the usage of DP filters in your FL training. DP is added as a filter in `config_fed_client.json`. Here, we use the \"Sparse Vector Technique\", i.e. the [SVTPrivacy](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.filters.svt_privacy.html) protocol, as utilized in [Li et al. 2019](https://arxiv.org/abs/1910.00962) [7] (see [Lyu et al. 2016](https://arxiv.org/abs/1603.01699) [8] for more information)."
]
},
{
@@ -83,33 +42,8 @@
"## Run experiments with FL simulator\n",
"### Training with FL simulator\n",
"FL simulator is used to simulate FL experiments or debug codes, not for real FL deployment.\n",
- "In this example, we assume four local GPUs with at least 12GB of memory are available."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8d256389-4112-46e6-86bd-115c9bf2e189",
- "metadata": {},
- "source": [
- "Then, we can run the FL simulator with 1 client for centralized training"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a9583165-af58-45ed-a86d-9fbfc74d80ca",
- "metadata": {},
- "outputs": [],
- "source": [
- "!python3 -u -m nvflare.private.fed.app.simulator.simulator './configs/brats_central' -w './workspace_brats/brats_central' -n 1 -t 1 -gpu 0"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f1121bc2-1118-4b64-8d61-06e1a49bc7ef",
- "metadata": {},
- "source": [
- "Similarly, run the FL simulator with 4 clients for federated learning by running"
+ "\n",
+ "First, train a model using the FedAvg algorithm with four clients without DP."
]
},
{
@@ -119,7 +53,7 @@
"metadata": {},
"outputs": [],
"source": [
- "!nvflare simulator './configs/brats_fedavg' -w './workspace_brats/brats_fedavg' -n 4 -t 4 -gpu 0,1,2,3"
+ "!nvflare simulator './configs/brats_fedavg' -w './workspace_brats/brats_fedavg' -n 4 -t 4"
]
},
{
@@ -137,7 +71,7 @@
"metadata": {},
"outputs": [],
"source": [
- "!nvflare simulator './configs/brats_fedavg_dp' -w './workspace_brats/brats_fedavg_dp' -n 4 -t 4 -gpu 0,1,2,3"
+ "!nvflare simulator './configs/brats_fedavg_dp' -w './workspace_brats/brats_fedavg_dp' -n 4 -t 4"
]
},
{
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/learners/supervised_learner.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/learners/supervised_learner.py
deleted file mode 100644
index e71c16f627..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/learners/supervised_learner.py
+++ /dev/null
@@ -1,329 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import copy
-from abc import abstractmethod
-
-import numpy as np
-import torch
-from torch.utils.tensorboard import SummaryWriter
-
-from nvflare.apis.dxo import DXO, DataKind, MetaKey, from_shareable
-from nvflare.apis.fl_constant import FLContextKey, ReturnCode
-from nvflare.apis.fl_context import FLContext
-from nvflare.apis.shareable import Shareable, make_reply
-from nvflare.apis.signal import Signal
-from nvflare.app_common.abstract.learner_spec import Learner
-from nvflare.app_common.app_constant import AppConstants, ValidateType
-
-
-class SupervisedLearner(Learner):
- def __init__(
- self,
- aggregation_epochs: int = 1,
- train_task_name: str = AppConstants.TASK_TRAIN,
- ):
- """Simple Supervised Trainer.
- This provides the basic functionality of a local learner: perform before-train validation on
- global model at the beginning of each round, perform local training, and send the updated weights.
- No model will be saved locally, tensorboard record for local loss and global model validation score.
- Enabled both FedAvg and FedProx
-
- Args:
- train_config_filename: directory of config file.
- aggregation_epochs: the number of training epochs for a round. Defaults to 1.
- train_task_name: name of the task to train the model.
-
- Returns:
- a Shareable with the updated local model after running `execute()`
- """
- super().__init__()
- # trainer init happens at the very beginning, only the basic info regarding the trainer is set here
- # the actual run has not started at this point
- self.aggregation_epochs = aggregation_epochs
- self.train_task_name = train_task_name
- # Epoch counter
- self.epoch_of_start_time = 0
- self.epoch_global = 0
- # FedProx related
- self.fedproxloss_mu = 0.0
- self.criterion_prox = None
-
- def initialize(self, parts: dict, fl_ctx: FLContext):
- # when a run starts, this is where the actual settings get initialized for trainer
-
- # set the paths according to fl_ctx
- engine = fl_ctx.get_engine()
- ws = engine.get_workspace()
- app_dir = ws.get_app_dir(fl_ctx.get_job_id())
-
- # get and print the args
- fl_args = fl_ctx.get_prop(FLContextKey.ARGS)
- self.client_id = fl_ctx.get_identity_name()
- self.log_info(
- fl_ctx,
- f"Client {self.client_id} initialized with args: \n {fl_args}",
- )
-
- # set local tensorboard writer for local validation score of global model
- self.writer = SummaryWriter(app_dir)
- # set the training-related contexts, this is task-specific
- self.train_config(fl_ctx)
-
- @abstractmethod
- def train_config(self, fl_ctx: FLContext):
- """Traning configurations customized to individual tasks
- This can be specified / loaded in any ways
- as long as they are made available for further training and validation
- some potential items include but not limited to:
- self.lr
- self.fedproxloss_mu
- self.model
- self.device
- self.optimizer
- self.criterion
- self.transform_train
- self.transform_valid
- self.transform_post
- self.train_loader
- self.valid_loader
- self.inferer
- self.valid_metric
- """
- raise NotImplementedError
-
- @abstractmethod
- def finalize(self, fl_ctx: FLContext):
- # collect threads, close files here
- pass
-
- def local_train(
- self,
- fl_ctx,
- train_loader,
- model_global,
- abort_signal: Signal,
- ):
- """Typical training logic
- Total local epochs: self.aggregation_epochs
- Load data pairs from train_loader: image / label
- Compute outputs with self.model
- Compute loss with self.criterion
- Add fedprox loss
- Update model
- """
- for epoch in range(self.aggregation_epochs):
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
- self.model.train()
- epoch_len = len(train_loader)
- self.epoch_global = self.epoch_of_start_time + epoch
- self.log_info(
- fl_ctx,
- f"Local epoch {self.client_id}: {epoch + 1}/{self.aggregation_epochs} (lr={self.lr})",
- )
- for i, batch_data in enumerate(train_loader):
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
- inputs = batch_data["image"].to(self.device)
- labels = batch_data["label"].to(self.device)
-
- # forward + backward + optimize
- outputs = self.model(inputs)
- loss = self.criterion(outputs, labels)
-
- # FedProx loss term
- if self.fedproxloss_mu > 0:
- fed_prox_loss = self.criterion_prox(self.model, model_global)
- loss += fed_prox_loss
-
- self.optimizer.zero_grad()
- loss.backward()
- self.optimizer.step()
- current_step = epoch_len * self.epoch_global + i
- self.writer.add_scalar("train_loss", loss.item(), current_step)
-
- def local_valid(
- self,
- model,
- valid_loader,
- abort_signal: Signal,
- tb_id=None,
- record_epoch=None,
- ):
- """Typical validation logic
- Load data pairs from train_loader: image / label
- Compute outputs with self.model
- Perform post transform (binarization, etc.)
- Compute evaluation metric with self.valid_metric
- Add score to tensorboard record with specified id
- """
- model.eval()
- with torch.no_grad():
- metric = 0
- for i, batch_data in enumerate(valid_loader):
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
- val_images = batch_data["image"].to(self.device)
- val_labels = batch_data["label"].to(self.device)
- # Inference
- val_outputs = self.inferer(val_images, model)
- val_outputs = self.transform_post(val_outputs)
- # Compute metric
- metric_score = self.valid_metric(y_pred=val_outputs, y=val_labels)
- metric += metric_score.item()
- # compute mean dice over whole validation set
- metric /= len(valid_loader)
- # tensorboard record id, add to record if provided
- if tb_id:
- self.writer.add_scalar(tb_id, metric_score, record_epoch)
- return metric
-
- def train(
- self,
- shareable: Shareable,
- fl_ctx: FLContext,
- abort_signal: Signal,
- ) -> Shareable:
- """Typical training task pipeline with potential HE and fedprox functionalities
- Get global model weights (potentially with HE)
- Prepare for fedprox loss
- Local training
- Return updated weights (model_diff)
- """
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
-
- # get round information
- current_round = shareable.get_header(AppConstants.CURRENT_ROUND)
- total_rounds = shareable.get_header(AppConstants.NUM_ROUNDS)
- self.log_info(fl_ctx, f"Current/Total Round: {current_round + 1}/{total_rounds}")
- self.log_info(fl_ctx, f"Client identity: {fl_ctx.get_identity_name()}")
-
- # update local model weights with received weights
- dxo = from_shareable(shareable)
- global_weights = dxo.data
-
- # Before loading weights, tensors might need to be reshaped to support HE for secure aggregation.
- local_var_dict = self.model.state_dict()
- model_keys = global_weights.keys()
- for var_name in local_var_dict:
- if var_name in model_keys:
- weights = global_weights[var_name]
- try:
- # reshape global weights to compute difference later on
- global_weights[var_name] = np.reshape(weights, local_var_dict[var_name].shape)
- # update the local dict
- local_var_dict[var_name] = torch.as_tensor(global_weights[var_name])
- except Exception as e:
- raise ValueError("Convert weight from {} failed with error: {}".format(var_name, str(e)))
- self.model.load_state_dict(local_var_dict)
-
- # local steps
- epoch_len = len(self.train_loader)
- self.log_info(fl_ctx, f"Local steps per epoch: {epoch_len}")
-
- # make a copy of model_global as reference for potential FedProx loss
- if self.fedproxloss_mu > 0:
- model_global = copy.deepcopy(self.model)
- for param in model_global.parameters():
- param.requires_grad = False
- else:
- model_global = None
-
- # local train
- self.local_train(
- fl_ctx=fl_ctx,
- train_loader=self.train_loader,
- model_global=model_global,
- abort_signal=abort_signal,
- )
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
- self.epoch_of_start_time += self.aggregation_epochs
-
- # compute delta model, global model has the primary key set
- local_weights = self.model.state_dict()
- model_diff = {}
- for name in global_weights:
- if name not in local_weights:
- continue
- model_diff[name] = np.subtract(local_weights[name].cpu().numpy(), global_weights[name], dtype=np.float32)
- if np.any(np.isnan(model_diff[name])):
- self.system_panic(f"{name} weights became NaN...", fl_ctx)
- return make_reply(ReturnCode.EXECUTION_EXCEPTION)
-
- # flush the tb writer
- self.writer.flush()
-
- # build the shareable
- dxo = DXO(data_kind=DataKind.WEIGHT_DIFF, data=model_diff)
- dxo.set_meta_prop(MetaKey.NUM_STEPS_CURRENT_ROUND, epoch_len)
-
- self.log_info(fl_ctx, "Local epochs finished. Returning shareable")
- return dxo.to_shareable()
-
- def validate(self, shareable: Shareable, fl_ctx: FLContext, abort_signal: Signal) -> Shareable:
- """Typical validation task pipeline with potential HE functionality
- Get global model weights (potentially with HE)
- Validation on local data
- Return validation score
- """
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
-
- # validation on global model
- model_owner = "global_model"
-
- # update local model weights with received weights
- dxo = from_shareable(shareable)
- global_weights = dxo.data
-
- # Before loading weights, tensors might need to be reshaped to support HE for secure aggregation.
- local_var_dict = self.model.state_dict()
- model_keys = global_weights.keys()
- n_loaded = 0
- for var_name in local_var_dict:
- if var_name in model_keys:
- weights = torch.as_tensor(global_weights[var_name], device=self.device)
- try:
- # update the local dict
- local_var_dict[var_name] = torch.as_tensor(torch.reshape(weights, local_var_dict[var_name].shape))
- n_loaded += 1
- except Exception as e:
- raise ValueError("Convert weight from {} failed with error: {}".format(var_name, str(e)))
- self.model.load_state_dict(local_var_dict)
- if n_loaded == 0:
- raise ValueError(f"No weights loaded for validation! Received weight dict is {global_weights}")
-
- # before_train_validate only, can extend to other validate types
- validate_type = shareable.get_header(AppConstants.VALIDATE_TYPE)
- if validate_type == ValidateType.BEFORE_TRAIN_VALIDATE:
- # perform valid before local train
- global_metric = self.local_valid(
- self.model,
- self.valid_loader,
- abort_signal,
- tb_id="val_metric_global_model",
- record_epoch=self.epoch_global,
- )
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
- self.log_info(fl_ctx, f"val_metric_global_model ({model_owner}): {global_metric:.4f}")
- # validation metrics will be averaged with weights at server end for best model record
- metric_dxo = DXO(data_kind=DataKind.METRICS, data={MetaKey.INITIAL_METRICS: global_metric}, meta={})
- metric_dxo.set_meta_prop(MetaKey.NUM_STEPS_CURRENT_ROUND, len(self.valid_loader))
- return metric_dxo.to_shareable()
- else:
- return make_reply(ReturnCode.VALIDATE_TYPE_UNKNOWN)
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/learners/supervised_monai_brats_learner.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/learners/supervised_monai_brats_learner.py
deleted file mode 100644
index 9c414aee44..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/learners/supervised_monai_brats_learner.py
+++ /dev/null
@@ -1,269 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import json
-import os
-
-import numpy as np
-import torch
-import torch.optim as optim
-from monai.data import CacheDataset, DataLoader, Dataset, load_decathlon_datalist
-from monai.inferers import SlidingWindowInferer
-from monai.losses import DiceLoss
-from monai.metrics import DiceMetric
-from monai.networks.nets.segresnet import SegResNet
-from monai.transforms import (
- Activations,
- AsDiscrete,
- Compose,
- ConvertToMultiChannelBasedOnBratsClassesd,
- DivisiblePadd,
- EnsureChannelFirstd,
- LoadImaged,
- NormalizeIntensityd,
- Orientationd,
- RandFlipd,
- RandScaleIntensityd,
- RandShiftIntensityd,
- RandSpatialCropd,
- Spacingd,
-)
-from pt.learners.supervised_learner import SupervisedLearner
-from pt.utils.custom_client_datalist_json_path import custom_client_datalist_json_path
-
-from nvflare.apis.fl_constant import ReturnCode
-from nvflare.apis.fl_context import FLContext
-from nvflare.apis.shareable import make_reply
-from nvflare.apis.signal import Signal
-from nvflare.app_common.app_constant import AppConstants
-from nvflare.app_opt.pt.fedproxloss import PTFedProxLoss
-
-
-class SupervisedMonaiBratsLearner(SupervisedLearner):
- def __init__(
- self,
- train_config_filename,
- aggregation_epochs: int = 1,
- train_task_name: str = AppConstants.TASK_TRAIN,
- ):
- """MONAI Learner for BraTS18 segmentation task.
- It inherits from SupervisedLearner.
-
- Args:
- train_config_filename: path for config file, this is an addition term for config loading
- aggregation_epochs: the number of training epochs for a round.
- train_task_name: name of the task to train the model.
-
- Returns:
- a Shareable with the updated local model after running `execute()`
- """
- super().__init__(
- aggregation_epochs=aggregation_epochs,
- train_task_name=train_task_name,
- )
- self.train_config_filename = train_config_filename
- self.config_info = None
-
- def train_config(self, fl_ctx: FLContext):
- """MONAI traning configuration
- Here, we use a json to specify the needed parameters
- """
-
- # Load training configurations json
- engine = fl_ctx.get_engine()
- ws = engine.get_workspace()
- app_config_dir = ws.get_app_config_dir(fl_ctx.get_job_id())
- train_config_file_path = os.path.join(app_config_dir, self.train_config_filename)
- if not os.path.isfile(train_config_file_path):
- self.log_error(
- fl_ctx,
- f"Training configuration file does not exist at {train_config_file_path}",
- )
- with open(train_config_file_path) as file:
- self.config_info = json.load(file)
-
- # Get the config_info
- self.lr = self.config_info["learning_rate"]
- self.fedproxloss_mu = self.config_info["fedproxloss_mu"]
- cache_rate = self.config_info["cache_dataset"]
- dataset_base_dir = self.config_info["dataset_base_dir"]
- datalist_json_path = self.config_info["datalist_json_path"]
- self.roi_size = self.config_info.get("roi_size", (224, 224, 144))
- self.infer_roi_size = self.config_info.get("infer_roi_size", (240, 240, 160))
-
- # Get datalist json
- datalist_json_path = custom_client_datalist_json_path(datalist_json_path, self.client_id)
-
- # Set datalist
- train_list = load_decathlon_datalist(
- data_list_file_path=datalist_json_path,
- is_segmentation=True,
- data_list_key="training",
- base_dir=dataset_base_dir,
- )
- valid_list = load_decathlon_datalist(
- data_list_file_path=datalist_json_path,
- is_segmentation=True,
- data_list_key="validation",
- base_dir=dataset_base_dir,
- )
- self.log_info(
- fl_ctx,
- f"Training Size: {len(train_list)}, Validation Size: {len(valid_list)}",
- )
-
- # Set the training-related context
- self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
- self.model = SegResNet(
- blocks_down=[1, 2, 2, 4],
- blocks_up=[1, 1, 1],
- init_filters=16,
- in_channels=4,
- out_channels=3,
- dropout_prob=0.2,
- ).to(self.device)
- self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr, weight_decay=1e-5)
- self.lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(self.optimizer, T_max=100)
- self.criterion = DiceLoss(
- smooth_nr=0,
- smooth_dr=1e-5,
- squared_pred=True,
- to_onehot_y=False,
- sigmoid=True,
- )
-
- if self.fedproxloss_mu > 0:
- self.log_info(fl_ctx, f"using FedProx loss with mu {self.fedproxloss_mu}")
- self.criterion_prox = PTFedProxLoss(mu=self.fedproxloss_mu)
-
- self.transform_train = Compose(
- [
- # load Nifti image
- LoadImaged(keys=["image", "label"]),
- EnsureChannelFirstd(keys="image"),
- ConvertToMultiChannelBasedOnBratsClassesd(keys="label"),
- Spacingd(
- keys=["image", "label"],
- pixdim=(1.0, 1.0, 1.0),
- mode=("bilinear", "nearest"),
- ),
- Orientationd(keys=["image", "label"], axcodes="RAS"),
- RandSpatialCropd(keys=["image", "label"], roi_size=self.roi_size, random_size=False),
- RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),
- RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=1),
- RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=2),
- NormalizeIntensityd(keys="image", nonzero=True, channel_wise=True),
- RandScaleIntensityd(keys="image", factors=0.1, prob=1.0),
- RandShiftIntensityd(keys="image", offsets=0.1, prob=1.0),
- ]
- )
- self.transform_valid = Compose(
- [
- LoadImaged(keys=["image", "label"]),
- EnsureChannelFirstd(keys="image"),
- ConvertToMultiChannelBasedOnBratsClassesd(keys="label"),
- Spacingd(
- keys=["image", "label"],
- pixdim=(1.0, 1.0, 1.0),
- mode=("bilinear", "nearest"),
- ),
- DivisiblePadd(keys=["image", "label"], k=32),
- Orientationd(keys=["image", "label"], axcodes="RAS"),
- NormalizeIntensityd(keys="image", nonzero=True, channel_wise=True),
- ]
- )
- self.transform_post = Compose([Activations(sigmoid=True), AsDiscrete(threshold=0.5)])
-
- # Set dataset
- if cache_rate > 0.0:
- self.train_dataset = CacheDataset(
- data=train_list,
- transform=self.transform_train,
- cache_rate=cache_rate,
- num_workers=1,
- )
- self.valid_dataset = CacheDataset(
- data=valid_list,
- transform=self.transform_valid,
- cache_rate=cache_rate,
- num_workers=1,
- )
- else:
- self.train_dataset = Dataset(
- data=train_list,
- transform=self.transform_train,
- )
- self.valid_dataset = Dataset(
- data=valid_list,
- transform=self.transform_valid,
- )
-
- self.train_loader = DataLoader(
- self.train_dataset,
- batch_size=1,
- shuffle=True,
- num_workers=1,
- )
- self.valid_loader = DataLoader(
- self.valid_dataset,
- batch_size=1,
- shuffle=False,
- num_workers=1,
- )
-
- # Set inferer and evaluation metric
- self.inferer = SlidingWindowInferer(roi_size=self.infer_roi_size, sw_batch_size=1, overlap=0.5)
- self.valid_metric = DiceMetric(include_background=True, reduction="mean")
-
- # Brats has 3 classes, so the metric computation needs some change
- def local_valid(
- self,
- model,
- valid_loader,
- abort_signal: Signal,
- tb_id=None,
- record_epoch=None,
- ):
- """Typical validation logic
- Load data pairs from train_loader: image / label
- Compute outputs with self.model
- Perform post transform (binarization, etc.)
- Compute evaluation metric with self.valid_metric
- Add score to tensorboard record with specified id
- """
- model.eval()
- with torch.no_grad():
- metric = 0
- ct = 0
- for i, batch_data in enumerate(valid_loader):
- if abort_signal.triggered:
- return make_reply(ReturnCode.TASK_ABORTED)
- val_images = batch_data["image"].to(self.device)
- val_labels = batch_data["label"].to(self.device)
- # Inference
- val_outputs = self.inferer(val_images, model)
- val_outputs = self.transform_post(val_outputs)
- # Compute metric
- metric_score = self.valid_metric(y_pred=val_outputs, y=val_labels)
- for sub_region in range(3):
- metric_score_single = metric_score[0][sub_region].item()
- if not np.isnan(metric_score_single):
- metric += metric_score_single
- ct += 1
- # compute mean dice over whole validation set
- metric /= ct
- # tensorboard record id, add to record if provided
- if tb_id:
- self.writer.add_scalar(tb_id, metric, record_epoch)
- return metric
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/utils/custom_client_datalist_json_path.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/utils/custom_client_datalist_json_path.py
deleted file mode 100644
index df685536ed..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/pt/utils/custom_client_datalist_json_path.py
+++ /dev/null
@@ -1,30 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-
-def custom_client_datalist_json_path(datalist_json_path: str, client_id: str) -> str:
- """
- Customize datalist_json_path for each client
- Args:
- datalist_json_path: root path containing all jsons
- client_id: e.g., site-2
- """
- # Customize datalist_json_path for each client
- datalist_json_path_client = os.path.join(
- datalist_json_path,
- client_id + ".json",
- )
- return datalist_json_path_client
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/requirements.txt b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/requirements.txt
deleted file mode 100644
index 4f2d4382b8..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/requirements.txt
+++ /dev/null
@@ -1,10 +0,0 @@
-nvflare~=2.5.0rc
-torch
-torchvision
-tensorboard
-monai
-tqdm
-nibabel
-tensorflow
-seaborn
-matplotlib
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/brats_3d_test_only.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/brats_3d_test_only.py
deleted file mode 100644
index 883dad7132..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/brats_3d_test_only.py
+++ /dev/null
@@ -1,150 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-import numpy as np
-import torch
-from monai.data import DataLoader, Dataset, load_decathlon_datalist
-from monai.inferers import SlidingWindowInferer
-from monai.metrics import DiceMetric
-from monai.networks.nets.segresnet import SegResNet
-from monai.transforms import (
- Activations,
- AsDiscrete,
- Compose,
- ConvertToMultiChannelBasedOnBratsClassesd,
- DivisiblePadd,
- EnsureChannelFirstd,
- LoadImaged,
- NormalizeIntensityd,
- Orientationd,
- Spacingd,
-)
-
-
-def main():
- parser = argparse.ArgumentParser(description="Model Testing")
- parser.add_argument("--model_path", type=str)
- parser.add_argument("--dataset_base_dir", default="../dataset_brats18/dataset", type=str)
- parser.add_argument("--datalist_json_path", default="../dataset_brats18/datalist/site-All.json", type=str)
- args = parser.parse_args()
-
- # Set basic settings and paths
- dataset_base_dir = args.dataset_base_dir
- datalist_json_path = args.datalist_json_path
- model_path = args.model_path
- infer_roi_size = (240, 240, 160)
-
- device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
-
- # Set datalists
- test_list = load_decathlon_datalist(
- data_list_file_path=datalist_json_path,
- is_segmentation=True,
- data_list_key="validation",
- base_dir=dataset_base_dir,
- )
- print(f"Testing Size: {len(test_list)}")
-
- # Network, optimizer, and loss
- model = SegResNet(
- blocks_down=[1, 2, 2, 4],
- blocks_up=[1, 1, 1],
- init_filters=16,
- in_channels=4,
- out_channels=3,
- dropout_prob=0.2,
- ).to(device)
- model_weights = torch.load(model_path)
- model_weights = model_weights["model"]
- model.load_state_dict(model_weights)
-
- # Inferer, evaluation metric
- inferer = SlidingWindowInferer(roi_size=infer_roi_size, sw_batch_size=1, overlap=0.5)
- valid_metric = DiceMetric(include_background=True, reduction="mean")
-
- transform = Compose(
- [
- LoadImaged(keys=["image", "label"]),
- EnsureChannelFirstd(keys="image"),
- ConvertToMultiChannelBasedOnBratsClassesd(keys="label"),
- Spacingd(
- keys=["image", "label"],
- pixdim=(1.0, 1.0, 1.0),
- mode=("bilinear", "nearest"),
- ),
- DivisiblePadd(keys=["image", "label"], k=32),
- Orientationd(keys=["image", "label"], axcodes="RAS"),
- NormalizeIntensityd(keys="image", nonzero=True, channel_wise=True),
- ]
- )
- transform_post = Compose([Activations(sigmoid=True), AsDiscrete(threshold=0.5)])
-
- # Set dataset
- test_dataset = Dataset(data=test_list, transform=transform)
- test_loader = DataLoader(
- test_dataset,
- batch_size=1,
- shuffle=False,
- num_workers=1,
- )
-
- # Train
- model.eval()
- with torch.no_grad():
- metric = 0
- metric_tc = 0
- metric_wt = 0
- metric_et = 0
- ct = 0
- ct_tc = 0
- ct_wt = 0
- ct_et = 0
- for i, batch_data in enumerate(test_loader):
- images = batch_data["image"].to(device)
- labels = batch_data["label"].to(device)
- # Inference
- outputs = inferer(images, model)
- outputs = transform_post(outputs)
- # Compute metric
- metric_score = valid_metric(y_pred=outputs, y=labels)
- if not np.isnan(metric_score[0][0].item()):
- metric += metric_score[0][0].item()
- ct += 1
- metric_tc += metric_score[0][0].item()
- ct_tc += 1
- if not np.isnan(metric_score[0][1].item()):
- metric += metric_score[0][1].item()
- ct += 1
- metric_wt += metric_score[0][1].item()
- ct_wt += 1
- if not np.isnan(metric_score[0][2].item()):
- metric += metric_score[0][2].item()
- ct += 1
- metric_et += metric_score[0][2].item()
- ct_et += 1
- # compute mean dice over whole validation set
- metric_tc /= ct_tc
- metric_wt /= ct_wt
- metric_et /= ct_et
- metric /= ct
- print(f"Test Dice: {metric:.4f}, Valid count: {ct}")
- print(f"Test Dice TC: {metric_tc:.4f}, Valid count: {ct_tc}")
- print(f"Test Dice WT: {metric_wt:.4f}, Valid count: {ct_wt}")
- print(f"Test Dice ET: {metric_et:.4f}, Valid count: {ct_et}")
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/plot_tensorboard_events.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/plot_tensorboard_events.py
deleted file mode 100644
index 075fda9ed6..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/plot_tensorboard_events.py
+++ /dev/null
@@ -1,117 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import glob
-import os
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-import tensorflow as tf
-
-# poc workspace
-client_results_root = "../workspace_brats/"
-
-# All sites used the same validation set for Brats, so only 1 site's record is needed
-site_num = 1
-client_pre = "app_site-"
-sites_fl = [str(site + 1) for site in range(site_num)]
-
-# Central vs. FedAvg vs. FedAvg_DP
-experiments = {
- "brats_central": {"tag": "val_metric_global_model", "site": "All"},
- "brats_fedavg": {"tag": "val_metric_global_model"},
- "brats_fedavg_dp": {"tag": "val_metric_global_model"},
-}
-
-weight = 0.8
-
-
-def smooth(scalars, weight): # Weight between 0 and 1
- last = scalars[0] # First value in the plot (first timestep)
- smoothed = list()
- for point in scalars:
- smoothed_val = last * weight + (1 - weight) * point # Calculate smoothed value
- smoothed.append(smoothed_val) # Save it
- last = smoothed_val # Anchor the last smoothed value
- return smoothed
-
-
-def read_eventfile(filepath, tags=["val_metric_global_model"]):
- data = {}
- for summary in tf.compat.v1.train.summary_iterator(filepath):
- for v in summary.summary.value:
- if v.tag in tags:
- if v.tag in data.keys():
- data[v.tag].append([summary.step, v.simple_value])
- else:
- data[v.tag] = [[summary.step, v.simple_value]]
- return data
-
-
-def add_eventdata(data, config, filepath, tag="val_metric_global_model"):
- event_data = read_eventfile(filepath, tags=[tag])
- assert len(event_data[tag]) > 0, f"No data for key {tag}"
-
- metric = []
- for e in event_data[tag]:
- # print(e)
- data["Config"].append(config)
- data["Epoch"].append(e[0])
- metric.append(e[1])
-
- metric = smooth(metric, weight)
- for entry in metric:
- data["Dice"].append(entry)
-
- print(f"added {len(event_data[tag])} entries for {tag}")
-
-
-def main():
- plt.figure()
- num_site = len(sites_fl)
- i = 1
- # add event files
-
- data = {"Config": [], "Epoch": [], "Dice": []}
-
- for site in sites_fl:
- # clear data for each site
- data = {"Config": [], "Epoch": [], "Dice": []}
- for config, exp in experiments.items():
- spec_site = exp.get("site", None)
- if spec_site is not None:
- record_path = os.path.join(
- client_results_root + config, "simulate_job", client_pre + spec_site, "events.*"
- )
- else:
- record_path = os.path.join(client_results_root + config, "simulate_job", client_pre + site, "events.*")
-
- eventfile = glob.glob(record_path, recursive=True)
- print(record_path, len(eventfile))
- assert len(eventfile) == 1, "No unique event file found!"
- eventfile = eventfile[0]
- print("adding", eventfile)
- add_eventdata(data, config, eventfile, tag=exp["tag"])
-
- ax = plt.subplot(1, num_site, i)
- ax.set_title(site)
- sns.lineplot(x="Epoch", y="Dice", hue="Config", data=data)
- # ax.set_xlim([0, 1000])
- i = i + 1
- plt.subplots_adjust(hspace=0.3)
- plt.show()
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/plot_tensorboard_events_poc.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/plot_tensorboard_events_poc.py
deleted file mode 100644
index f4b4ee55df..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/plot_tensorboard_events_poc.py
+++ /dev/null
@@ -1,125 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import glob
-import os
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-import tensorflow as tf
-
-# poc workspace
-client_results_root = "../workspace_brats"
-
-# All sites used the same validation set, so only 1 site's record is needed
-site_num = 1
-site_pre = "site-"
-
-# Central vs. FedAvg vs. FedAvg_DP
-experiments = {
- "brats_central": {"tag": "val_metric_global_model", "site": "All"},
- "brats_fedavg": {"tag": "val_metric_global_model"},
- "brats_fedavg_dp": {"tag": "val_metric_global_model"},
-}
-
-weight = 0.8
-
-
-def smooth(scalars, weight): # Weight between 0 and 1
- last = scalars[0] # First value in the plot (first timestep)
- smoothed = list()
- for point in scalars:
- smoothed_val = last * weight + (1 - weight) * point # Calculate smoothed value
- smoothed.append(smoothed_val) # Save it
- last = smoothed_val # Anchor the last smoothed value
- return smoothed
-
-
-def find_job_id(workdir, fl_app_name="prostate_central"):
- """Find the first matching experiment"""
- target_path = os.path.join(workdir, "*", "fl_app.txt")
- fl_app_files = glob.glob(target_path, recursive=True)
- assert len(fl_app_files) > 0, f"No `fl_app.txt` files found in workdir={workdir}."
- for fl_app_file in fl_app_files:
- with open(fl_app_file, "r") as f:
- _fl_app_name = f.read()
- if fl_app_name == _fl_app_name: # alpha will be matched based on value in config file
- job_id = os.path.basename(os.path.dirname(fl_app_file))
- return job_id
- raise ValueError(f"No job id found for fl_app_name={fl_app_name} in workdir={workdir}")
-
-
-def read_eventfile(filepath, tags=["val_metric_global_model"]):
- data = {}
- for summary in tf.compat.v1.train.summary_iterator(filepath):
- for v in summary.summary.value:
- if v.tag in tags:
- if v.tag in data.keys():
- data[v.tag].append([summary.step, v.simple_value])
- else:
- data[v.tag] = [[summary.step, v.simple_value]]
- return data
-
-
-def add_eventdata(data, config, filepath, tag="val_metric_global_model"):
- event_data = read_eventfile(filepath, tags=[tag])
- assert len(event_data[tag]) > 0, f"No data for key {tag}"
-
- metric = []
- for e in event_data[tag]:
- # print(e)
- data["Config"].append(config)
- data["Epoch"].append(e[0])
- metric.append(e[1])
-
- metric = smooth(metric, weight)
- for entry in metric:
- data["Dice"].append(entry)
-
- print(f"added {len(event_data[tag])} entries for {tag}")
-
-
-def main():
- plt.figure()
- i = 1
- # add event files
- data = {"Config": [], "Epoch": [], "Dice": []}
- for site in range(site_num):
- # clear data for each site
- site = site + 1
- data = {"Config": [], "Epoch": [], "Dice": []}
- for config, exp in experiments.items():
- job_id = find_job_id(workdir=client_results_root + "/site-1", fl_app_name=config)
- print(f"Found run {job_id} for {config}")
- spec_site = exp.get("site", None)
- if spec_site is not None:
- record_path = os.path.join(client_results_root, site_pre + spec_site, job_id, "*", "events.*")
- else:
- record_path = os.path.join(client_results_root, site_pre + str(site), job_id, "*", "events.*")
- eventfile = glob.glob(record_path, recursive=True)
- assert len(eventfile) == 1, "No unique event file found!"
- eventfile = eventfile[0]
- print("adding", eventfile)
- add_eventdata(data, config, eventfile, tag=exp["tag"])
-
- ax = plt.subplot(1, site_num, i)
- ax.set_title(site)
- sns.lineplot(x="Epoch", y="Dice", hue="Config", data=data)
- i = i + 1
- plt.subplots_adjust(hspace=0.3)
- plt.show()
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/testing_models_3d.sh b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/testing_models_3d.sh
deleted file mode 100755
index 0811a27e20..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/testing_models_3d.sh
+++ /dev/null
@@ -1,11 +0,0 @@
-workspace_path="../workspace_brats"
-dataset_path="../dataset_brats18/dataset"
-datalist_path="../dataset_brats18/datalist"
-
-echo "Centralized"
-python3 brats_3d_test_only.py --model_path "${workspace_path}/brats_central/simulate_job/app_server/best_FL_global_model.pt" --dataset_base_dir ${dataset_path} --datalist_json_path "${datalist_path}/site-All.json"
-echo "FedAvg"
-python3 brats_3d_test_only.py --model_path "${workspace_path}/brats_fedavg/simulate_job/app_server/best_FL_global_model.pt" --dataset_base_dir ${dataset_path} --datalist_json_path "${datalist_path}/site-All.json"
-echo "FedAvgDP"
-python3 brats_3d_test_only.py --model_path "${workspace_path}/brats_fedavg_dp/simulate_job/app_server/best_FL_global_model.pt" --dataset_base_dir ${dataset_path} --datalist_json_path "${datalist_path}/site-All.json"
-
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/testing_models_3d_poc.sh b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/testing_models_3d_poc.sh
deleted file mode 100755
index 05d4ba78da..0000000000
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.1_privacy_filter/result_stat/testing_models_3d_poc.sh
+++ /dev/null
@@ -1,11 +0,0 @@
-# Replace the job_ids with the ones from your workspace
-job_id_cen="b6c7e274-67f9-402e-8fcd-f54f3cea40a9"
-job_id_avg="9b05aa10-c79f-444c-b6b0-2ad7a3538e79"
-job_id_avg_dp="c98bdbaf-fdf6-4786-b905-b6a1ba7e398c"
-
-echo "Centralized"
-python3 brats_3d_test_only.py --model_path "${workspace_path}/server/transfer/${job_id_cen}/workspace/app_server/best_FL_global_model.pt" --dataset_base_dir ${dataset_path} --datalist_json_path "${datalist_path}/site-All.json"
-echo "FedAvg"
-python3 brats_3d_test_only.py --model_path "${workspace_path}/server/transfer/${job_id_avg}/workspace/app_server/best_FL_global_model.pt" --dataset_base_dir ${dataset_path} --datalist_json_path "${datalist_path}/site-All.json"
-echo "FedAvgDP"
-python3 brats_3d_test_only.py --model_path "${workspace_path}/server/transfer/${job_id_avg_dp}/workspace/app_server/best_FL_global_model.pt" --dataset_base_dir ${dataset_path} --datalist_json_path "${datalist_path}/site-All.json"
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/privacy_with_differential_privacy.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/privacy_with_differential_privacy.ipynb
index e52a52b58d..924fd7e69f 100644
--- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/privacy_with_differential_privacy.ipynb
+++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/privacy_with_differential_privacy.ipynb
@@ -1,9 +1,798 @@
{
"cells": [
+ {
+ "cell_type": "markdown",
+ "id": "1398ef0a-f189-4d04-a8a9-276a17ab0f8b",
+ "metadata": {},
+ "source": [
+ "# Federated Learning with Differential Privacy\n",
+ "\n",
+ "Please make sure you set up a virtual environment and follow [example root readme](../../README.md) before starting this notebook.\n",
+ "Then, install the requirements.\n",
+ "\n",
+ " NOTE Some of the cells below generate long text output. We're using
%%capture --no-display --no-stderr cell_output
to suppress this output. Comment or delete this line in the cells below to restore full output.
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "5002e45c-f58e-4f68-bb5a-9626e084947f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%capture --no-display --no-stderr cell_output\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install -r requirements.txt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bddd90a1-fe96-4f24-b360-bbe73b24e34a",
+ "metadata": {},
+ "source": [
+ "### Differential Privacy (DP)\n",
+ "[Differential Privacy (DP)](https://arxiv.org/abs/1910.00962) [7] is a rigorous mathematical framework designed to provide strong privacy guarantees when handling sensitive data. In the context of Federated Learning (FL), DP plays a crucial role in safeguarding user information by introducing randomness into the training process. Specifically, it ensures privacy by adding carefully calibrated noise to the model updates—such as gradients or weights—before they are transmitted from clients to the central server. This obfuscation mechanism makes it statistically difficult to infer whether any individual data point contributed to a particular update, thereby protecting user-specific information.\n",
+ "\n",
+ "By integrating DP into FL, even if an adversary gains access to the aggregated updates or models, the added noise prevents them from accurately deducing sensitive details about any individual client's data. Common approaches include \n",
+ "\n",
+ "1. **local differential privacy (LDP)**, where noise is added directly on the client side before updates are sent\n",
+ "2. **global differential privacy (GDP)**, where noise is injected after aggregation at the server.\n",
+ "\n",
+ "The balance between privacy and model utility is typically managed through a privacy budget (ϵ), which quantifies the trade-off between the level of noise added and the resulting model accuracy.\n",
+ "\n",
+ "\n",
+ "This example shows the usage of a CIFAR-10 training code with NVFlare, as well as the usage of **local** DP filters in your FL training. Here, we use the \"Sparse Vector Technique\", i.e. the [SVTPrivacy](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.filters.svt_privacy.html) protocol, as utilized in [Li et al. 2019](https://arxiv.org/abs/1910.00962) [7] (see [Lyu et al. 2016](https://arxiv.org/abs/1603.01699) [8] for more information). \n",
+ "\n",
+ "DP is added as a filter using the [FedJob API](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html#fedjob-api) you should have seen in prior chapters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9b0c692a-16dc-4ef9-a432-4b7375a2a7d6",
+ "metadata": {},
+ "source": [
+ "## Run experiments with FL simulator\n",
+ "FL simulator is used to simulate FL experiments or debug codes, not for real FL deployment.\n",
+ "\n",
+ "First, train a model using the FedAvg algorithm with four clients without DP."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "64c3fe64-3915-4c6a-9bed-694d205b0940",
+ "metadata": {},
+ "source": [
+ "#### 0. Download the CIFAR-10 data\n",
+ "First, we download the CIFAR-10 dataset to avoid clients overwriting each other's local dataset during this simulation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "1609d2de-d033-45a1-b9fa-1ba311bd00e0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Dataset CIFAR10\n",
+ " Number of datapoints: 50000\n",
+ " Root location: /tmp/nvflare/data\n",
+ " Split: Train"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import torchvision\n",
+ "DATASET_PATH = \"/tmp/nvflare/data\"\n",
+ "torchvision.datasets.CIFAR10(root=DATASET_PATH, train=True, download=True)\n",
+ "torchvision.datasets.CIFAR10(root=DATASET_PATH, train=False, download=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "85b05b1b-31d6-4e9a-a07f-90cf9fba37b7",
+ "metadata": {},
+ "source": [
+ "#### 1. Define a FedJob\n",
+ "The `FedJob` is used to define how controllers and executors are placed within a federated job using the `to(object, target)` routine.\n",
+ "\n",
+ "Here we use a PyTorch `BaseFedJob`, where we can define the job name and the initial global model.\n",
+ "The `BaseFedJob` automatically configures components for model persistence, model selection, and TensorBoard streaming for convenience."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "73b59f47-a0c5-4038-abf4-80aefc122c1b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from src.net import Net\n",
+ "\n",
+ "from nvflare.app_common.workflows.fedavg import FedAvg\n",
+ "from nvflare.app_opt.pt.job_config.base_fed_job import BaseFedJob\n",
+ "from nvflare.job_config.script_runner import ScriptRunner\n",
+ "\n",
+ "job = BaseFedJob(\n",
+ " name=\"cifar10_pt_fedavg\",\n",
+ " initial_model=Net(),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4e00fbca-4c8a-4e3f-b258-0f0601291aa4",
+ "metadata": {},
+ "source": [
+ "#### 2. Define the Controller Workflow\n",
+ "Define the controller workflow and send it to the server."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "3283f379-8d85-4a9d-9723-1ca926e10405",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "n_clients = 2\n",
+ "num_rounds = 5\n",
+ "\n",
+ "controller = FedAvg(\n",
+ " num_clients=n_clients,\n",
+ " num_rounds=num_rounds,\n",
+ ")\n",
+ "job.to(controller, \"server\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e43d8cd1-a6c3-476a-bb7b-3603f434d509",
+ "metadata": {},
+ "source": [
+ "That completes the components that need to be defined on the server."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "05a27daa-8d73-4bfb-9400-db9c33e24f7e",
+ "metadata": {},
+ "source": [
+ "#### 3. Add clients\n",
+ "Next, we can use the `ScriptRunner` and send it to each of the clients to run our training script.\n",
+ "\n",
+ "Note that our script could have additional input arguments, such as batch size or data path, but we don't use them here for simplicity."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "1d8bfa4b-307f-4880-abbc-6788abe0dc59",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for i in range(n_clients):\n",
+ " runner = ScriptRunner(\n",
+ " script=\"src/cifar10_fl.py\"\n",
+ " )\n",
+ " job.to(runner, f\"site-{i+1}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f4a13662-97bf-444b-bfcd-d70f67fc7e80",
+ "metadata": {},
+ "source": [
+ "That's it!\n",
+ "\n",
+ "#### 4. Optionally export the job\n",
+ "Now, we could export the job and submit it to a real NVFlare deployment using the [Admin client](https://nvflare.readthedocs.io/en/main/real_world_fl/operation.html) or [FLARE API](https://nvflare.readthedocs.io/en/main/real_world_fl/flare_api.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "9f8e3d58-a5bc-44f2-997e-2f653b36739e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "job.export_job(\"job_configs\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0f5135c9-13c2-4e3c-8e53-4047bccf42ec",
+ "metadata": {},
+ "source": [
+ "#### 5. Run FL Simulation\n",
+ "Finally, we can run our FedJob in simulation using NVFlare's [simulator](https://nvflare.readthedocs.io/en/main/user_guide/nvflare_cli/fl_simulator.html) under the hood. We can also specify which GPU should be used to run the clients, which is helpful for simulated environments. Here, we run all clients on the same GPU (tested with an NVIDIA A6000 GPU with 48 GB memory).\n",
+ "\n",
+ "The results will be saved in the specified `workdir`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4ebc6f36-4d0c-41ac-8540-32dac533043a",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2025-02-24 19:08:50,030 - SimulatorRunner - INFO - Create the Simulator Server.\n",
+ "2025-02-24 19:08:50,033 - CoreCell - INFO - server: creating listener on tcp://0:51787\n",
+ "2025-02-24 19:08:50,054 - CoreCell - INFO - server: created backbone external listener for tcp://0:51787\n",
+ "2025-02-24 19:08:50,054 - ConnectorManager - INFO - 3593243: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}\n",
+ "2025-02-24 19:08:50,055 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:10719] is starting\n",
+ "2025-02-24 19:08:50,555 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:10719\n",
+ "2025-02-24 19:08:50,555 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:51787] is starting\n",
+ "2025-02-24 19:08:50,557 - SimulatorServer - INFO - max_reg_duration=60.0\n",
+ "2025-02-24 19:08:50,638 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 49271\n",
+ "2025-02-24 19:08:50,638 - SimulatorRunner - INFO - Deploy the Apps.\n",
+ "2025-02-24 19:08:50,642 - SimulatorRunner - INFO - Create the simulate clients.\n",
+ "2025-02-24 19:08:50,644 - Communicator - INFO - Trying to register with server ...\n",
+ "2025-02-24 19:08:50,645 - ClientManager - INFO - authenticated client site-1\n",
+ "2025-02-24 19:08:50,645 - ClientManager - INFO - Client: New client site-1@192.168.1.203 joined. Sent token: 7423b93b-60c1-4566-af34-ded1e3541d75. Total clients: 1\n",
+ "2025-02-24 19:08:50,645 - Communicator - INFO - register RC: ok\n",
+ "2025-02-24 19:08:50,645 - FederatedClient - INFO - Successfully registered client:site-1 for project simulator_server. Token:7423b93b-60c1-4566-af34-ded1e3541d75 SSID:\n",
+ "2025-02-24 19:08:50,646 - Communicator - INFO - Trying to register with server ...\n",
+ "2025-02-24 19:08:50,647 - ClientManager - INFO - authenticated client site-2\n",
+ "2025-02-24 19:08:50,647 - ClientManager - INFO - Client: New client site-2@192.168.1.203 joined. Sent token: 1054b060-6584-40c0-993c-b21760a7e5f8. Total clients: 2\n",
+ "2025-02-24 19:08:50,647 - Communicator - INFO - register RC: ok\n",
+ "2025-02-24 19:08:50,647 - FederatedClient - INFO - Successfully registered client:site-2 for project simulator_server. Token:1054b060-6584-40c0-993c-b21760a7e5f8 SSID:\n",
+ "2025-02-24 19:08:50,647 - SimulatorRunner - INFO - Set the client status ready.\n",
+ "2025-02-24 19:08:50,647 - SimulatorRunner - INFO - Deploy and start the Server App.\n",
+ "2025-02-24 19:08:50,648 - Cell - INFO - Register blob CB for channel='server_command', topic='*'\n",
+ "2025-02-24 19:08:50,648 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'\n",
+ "2025-02-24 19:08:50,648 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.simulate_job\n",
+ "2025-02-24 19:08:50,653 - IntimeModelSelector - INFO - model selection weights control: {}\n",
+ "2025-02-24 19:08:51,838 - AuxRunner - INFO - registered aux handler for topic __sync_runner__\n",
+ "2025-02-24 19:08:51,838 - AuxRunner - INFO - registered aux handler for topic __job_heartbeat__\n",
+ "2025-02-24 19:08:51,839 - AuxRunner - INFO - registered aux handler for topic __task_check__\n",
+ "2025-02-24 19:08:51,839 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REQUEST\n",
+ "2025-02-24 19:08:51,839 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REPLY\n",
+ "2025-02-24 19:08:51,840 - ReliableMessage - INFO - enabled reliable message: max_request_workers=20 query_interval=2.0\n",
+ "2025-02-24 19:08:51,840 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: Server runner starting ...\n",
+ "2025-02-24 19:08:51,840 - TBAnalyticsReceiver - INFO - [identity=simulator_server, run=simulate_job]: Tensorboard records can be found in /tmp/nvflare/cifar10_pt_fedavg/server/simulate_job/tb_events you can view it using `tensorboard --logdir=/tmp/nvflare/cifar10_pt_fedavg/server/simulate_job/tb_events`\n",
+ "2025-02-24 19:08:51,840 - AuxRunner - INFO - registered aux handler for topic fed.event\n",
+ "2025-02-24 19:08:51,840 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: starting workflow controller () ...\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Initializing BaseModelController workflow.\n",
+ "2025-02-24 19:08:51,841 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Workflow controller () started\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Beginning model controller run.\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Start FedAvg.\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: loading initial model from persistor\n",
+ "2025-02-24 19:08:51,841 - PTFileModelPersistor - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Both source_ckpt_file_full_name and ckpt_preload_path are not provided. Using the default model weights initialized on the persistor side.\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Round 0 started.\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Sampled clients: ['site-1', 'site-2']\n",
+ "2025-02-24 19:08:51,841 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: Sending task train to ['site-1', 'site-2']\n",
+ "2025-02-24 19:08:51,842 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: scheduled task train\n",
+ "2025-02-24 19:08:52,703 - SimulatorClientRunner - INFO - Start the clients run simulation.\n",
+ "2025-02-24 19:08:53,705 - SimulatorClientRunner - INFO - Simulate Run client: site-1 on GPU group: 0\n",
+ "2025-02-24 19:08:53,705 - SimulatorClientRunner - INFO - Simulate Run client: site-2 on GPU group: 0\n",
+ "2025-02-24 19:08:54,735 - ClientTaskWorker - INFO - ClientTaskWorker started to run\n",
+ "2025-02-24 19:08:54,749 - ClientTaskWorker - INFO - ClientTaskWorker started to run\n",
+ "2025-02-24 19:08:54,811 - CoreCell - INFO - site-1.simulate_job: created backbone external connector to tcp://localhost:51787\n",
+ "2025-02-24 19:08:54,811 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:51787] is starting\n",
+ "2025-02-24 19:08:54,811 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:35308 => 127.0.0.1:51787] is created: PID: 3593265\n",
+ "2025-02-24 19:08:54,813 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00005 127.0.0.1:51787 <= 127.0.0.1:35308] is created: PID: 3593243\n",
+ "2025-02-24 19:08:54,818 - CoreCell - INFO - site-2.simulate_job: created backbone external connector to tcp://localhost:51787\n",
+ "2025-02-24 19:08:54,818 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:51787] is starting\n",
+ "2025-02-24 19:08:54,819 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:35322 => 127.0.0.1:51787] is created: PID: 3593266\n",
+ "2025-02-24 19:08:54,819 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00006 127.0.0.1:51787 <= 127.0.0.1:35322] is created: PID: 3593243\n",
+ "2025-02-24 19:08:56,629 - AuxRunner - INFO - registered aux handler for topic __end_run__\n",
+ "2025-02-24 19:08:56,629 - AuxRunner - INFO - registered aux handler for topic __end_run__\n",
+ "2025-02-24 19:08:56,630 - AuxRunner - INFO - registered aux handler for topic __do_task__\n",
+ "2025-02-24 19:08:56,630 - AuxRunner - INFO - registered aux handler for topic __do_task__\n",
+ "2025-02-24 19:08:56,630 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'\n",
+ "2025-02-24 19:08:56,630 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'\n",
+ "2025-02-24 19:08:57,135 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.simulate_job'], timeout=2.0\n",
+ "2025-02-24 19:08:57,136 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.simulate_job'], timeout=2.0\n",
+ "2025-02-24 19:08:57,149 - ClientRunner - INFO - [identity=site-2, run=simulate_job]: synced to Server Runner in 0.5139973163604736 seconds\n",
+ "2025-02-24 19:08:57,149 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REQUEST\n",
+ "2025-02-24 19:08:57,149 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REPLY\n",
+ "2025-02-24 19:08:57,149 - ReliableMessage - INFO - enabled reliable message: max_request_workers=20 query_interval=2.0\n",
+ "2025-02-24 19:08:57,150 - TaskScriptRunner - INFO - start task run() with full path: /tmp/nvflare/cifar10_pt_fedavg/site-2/simulate_job/app_site-2/custom/src/cifar10_fl.py\n",
+ "2025-02-24 19:08:57,151 - AuxRunner - INFO - registered aux handler for topic fed.event\n",
+ "2025-02-24 19:08:57,152 - ClientRunner - INFO - [identity=site-2, run=simulate_job]: client runner started\n",
+ "2025-02-24 19:08:57,152 - ClientTaskWorker - INFO - Initialize ClientRunner for client: site-2\n",
+ "2025-02-24 19:08:57,152 - ClientRunner - INFO - [identity=site-1, run=simulate_job]: synced to Server Runner in 0.5166869163513184 seconds\n",
+ "2025-02-24 19:08:57,152 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REQUEST\n",
+ "2025-02-24 19:08:57,153 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REPLY\n",
+ "2025-02-24 19:08:57,153 - ReliableMessage - INFO - enabled reliable message: max_request_workers=20 query_interval=2.0\n",
+ "2025-02-24 19:08:57,154 - TaskScriptRunner - INFO - start task run() with full path: /tmp/nvflare/cifar10_pt_fedavg/site-1/simulate_job/app_site-1/custom/src/cifar10_fl.py\n",
+ "2025-02-24 19:08:57,157 - AuxRunner - INFO - registered aux handler for topic fed.event\n",
+ "2025-02-24 19:08:57,157 - ClientRunner - INFO - [identity=site-1, run=simulate_job]: client runner started\n",
+ "2025-02-24 19:08:57,157 - ClientTaskWorker - INFO - Initialize ClientRunner for client: site-1\n",
+ "2025-02-24 19:08:57,161 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: assigned task to client site-1: name=train, id=8a2f7d61-32b7-4171-8a15-337ef647b7a9\n",
+ "2025-02-24 19:08:57,161 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: sent task assignment to client. client_name:site-1 task_id:8a2f7d61-32b7-4171-8a15-337ef647b7a9\n",
+ "2025-02-24 19:08:57,162 - GetTaskCommand - INFO - return task to client. client_name: site-1 task_name: train task_id: 8a2f7d61-32b7-4171-8a15-337ef647b7a9 sharable_header_task_id: 8a2f7d61-32b7-4171-8a15-337ef647b7a9\n",
+ "2025-02-24 19:08:57,163 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: assigned task to client site-2: name=train, id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff\n",
+ "2025-02-24 19:08:57,166 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: sent task assignment to client. client_name:site-2 task_id:9fb8f6f3-500f-4e60-89ed-486fa85e41ff\n",
+ "2025-02-24 19:08:57,166 - GetTaskCommand - INFO - return task to client. client_name: site-2 task_name: train task_id: 9fb8f6f3-500f-4e60-89ed-486fa85e41ff sharable_header_task_id: 9fb8f6f3-500f-4e60-89ed-486fa85e41ff\n",
+ "2025-02-24 19:08:57,176 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251471 Bytes) time: 0.023582 seconds\n",
+ "2025-02-24 19:08:57,176 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:08:57,176 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff\n",
+ "2025-02-24 19:08:57,176 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:08:57,177 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: execute for task (train)\n",
+ "2025-02-24 19:08:57,177 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251471 Bytes) time: 0.019749 seconds\n",
+ "2025-02-24 19:08:57,177 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:08:57,177 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=8a2f7d61-32b7-4171-8a15-337ef647b7a9\n",
+ "2025-02-24 19:08:57,178 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:08:57,178 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: execute for task (train)\n",
+ "2025-02-24 19:08:57,178 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: send data to peer\n",
+ "2025-02-24 19:08:57,178 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: send data to peer\n",
+ "2025-02-24 19:08:57,179 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: sending payload to peer\n",
+ "2025-02-24 19:08:57,183 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: sending payload to peer\n",
+ "2025-02-24 19:08:57,183 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: Waiting for result from peer\n",
+ "2025-02-24 19:08:57,188 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: Waiting for result from peer\n",
+ "2025-02-24 19:09:00,077 - nvflare.app_common.executors.task_script_runner - INFO - current_round=0\n",
+ "2025-02-24 19:09:00,077 - nvflare.app_common.executors.task_script_runner - INFO - current_round=0\n",
+ "2025-02-24 19:09:09,812 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 2.205\n",
+ "2025-02-24 19:09:09,933 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 2.213\n",
+ "2025-02-24 19:09:19,046 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 1.836\n",
+ "2025-02-24 19:09:19,373 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 1.839\n",
+ "2025-02-24 19:09:28,238 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 1.702\n",
+ "2025-02-24 19:09:28,424 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 1.674\n",
+ "2025-02-24 19:09:37,353 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 1.596\n",
+ "2025-02-24 19:09:37,429 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 1.554\n",
+ "2025-02-24 19:09:46,599 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 1.494\n",
+ "2025-02-24 19:09:46,676 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 1.515\n",
+ "2025-02-24 19:09:55,702 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 1.457\n",
+ "2025-02-24 19:09:56,237 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 1.464\n",
+ "2025-02-24 19:10:07,486 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 1.378\n",
+ "2025-02-24 19:10:08,068 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 1.379\n",
+ "2025-02-24 19:10:16,561 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 1.366\n",
+ "2025-02-24 19:10:17,116 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 1.331\n",
+ "2025-02-24 19:10:25,815 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 1.343\n",
+ "2025-02-24 19:10:26,357 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 1.311\n",
+ "2025-02-24 19:10:34,679 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 1.296\n",
+ "2025-02-24 19:10:35,449 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 1.321\n",
+ "2025-02-24 19:10:43,891 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 1.292\n",
+ "2025-02-24 19:10:44,582 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 1.294\n",
+ "2025-02-24 19:10:53,105 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 1.260\n",
+ "2025-02-24 19:10:53,779 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 1.258\n",
+ "2025-02-24 19:10:55,521 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:10:56,136 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:11:03,896 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 8 %\n",
+ "2025-02-24 19:11:03,900 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:11:04,206 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 8 %\n",
+ "2025-02-24 19:11:04,209 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:11:04,275 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: finished processing task\n",
+ "2025-02-24 19:11:04,277 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: try #1: sending task result to server\n",
+ "2025-02-24 19:11:04,277 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: checking task ...\n",
+ "2025-02-24 19:11:04,277 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:11:04,284 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: start to send task result to server\n",
+ "2025-02-24 19:11:04,285 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:11:04,291 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=train, id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff\n",
+ "2025-02-24 19:11:04,301 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: finished processing task\n",
+ "2025-02-24 19:11:04,302 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: try #1: sending task result to server\n",
+ "2025-02-24 19:11:04,302 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: checking task ...\n",
+ "2025-02-24 19:11:04,302 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:11:04,370 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: finished processing client result by controller\n",
+ "2025-02-24 19:11:04,371 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-2 task_id:9fb8f6f3-500f-4e60-89ed-486fa85e41ff\n",
+ "2025-02-24 19:11:04,374 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: start to send task result to server\n",
+ "2025-02-24 19:11:04,375 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:11:04,375 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.090221 seconds\n",
+ "2025-02-24 19:11:04,375 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9fb8f6f3-500f-4e60-89ed-486fa85e41ff]: task result sent to server\n",
+ "2025-02-24 19:11:04,376 - ClientTaskWorker - INFO - Finished one task run for client: site-2 interval: 2 task_processed: True\n",
+ "2025-02-24 19:11:04,381 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job]: got result from client site-1 for task: name=train, id=8a2f7d61-32b7-4171-8a15-337ef647b7a9\n",
+ "2025-02-24 19:11:04,446 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: finished processing client result by controller\n",
+ "2025-02-24 19:11:04,446 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: task train exit with status TaskCompletionStatus.OK\n",
+ "2025-02-24 19:11:04,446 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1 task_id:8a2f7d61-32b7-4171-8a15-337ef647b7a9\n",
+ "2025-02-24 19:11:04,448 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.073337 seconds\n",
+ "2025-02-24 19:11:04,449 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: task result sent to server\n",
+ "2025-02-24 19:11:04,449 - ClientTaskWorker - INFO - Finished one task run for client: site-1 interval: 2 task_processed: True\n",
+ "2025-02-24 19:11:04,647 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: aggregating 2 update(s) at round 0\n",
+ "2025-02-24 19:11:04,649 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: Start persist model on server.\n",
+ "2025-02-24 19:11:04,653 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: End persist model on server.\n",
+ "2025-02-24 19:11:04,653 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: Round 1 started.\n",
+ "2025-02-24 19:11:04,653 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: Sampled clients: ['site-1', 'site-2']\n",
+ "2025-02-24 19:11:04,654 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: Sending task train to ['site-1', 'site-2']\n",
+ "2025-02-24 19:11:04,654 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=8a2f7d61-32b7-4171-8a15-337ef647b7a9]: scheduled task train\n",
+ "2025-02-24 19:11:06,381 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: assigned task to client site-2: name=train, id=9add7403-38d0-4d78-8519-aef6b8cbf33e\n",
+ "2025-02-24 19:11:06,382 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: sent task assignment to client. client_name:site-2 task_id:9add7403-38d0-4d78-8519-aef6b8cbf33e\n",
+ "2025-02-24 19:11:06,382 - GetTaskCommand - INFO - return task to client. client_name: site-2 task_name: train task_id: 9add7403-38d0-4d78-8519-aef6b8cbf33e sharable_header_task_id: 9add7403-38d0-4d78-8519-aef6b8cbf33e\n",
+ "2025-02-24 19:11:06,388 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.010425 seconds\n",
+ "2025-02-24 19:11:06,388 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:11:06,388 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=9add7403-38d0-4d78-8519-aef6b8cbf33e\n",
+ "2025-02-24 19:11:06,388 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:11:06,388 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: execute for task (train)\n",
+ "2025-02-24 19:11:06,389 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: send data to peer\n",
+ "2025-02-24 19:11:06,389 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: sending payload to peer\n",
+ "2025-02-24 19:11:06,389 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: Waiting for result from peer\n",
+ "2025-02-24 19:11:06,403 - nvflare.app_common.executors.task_script_runner - INFO - current_round=1\n",
+ "2025-02-24 19:11:06,454 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: assigned task to client site-1: name=train, id=0dbaa782-7d3e-43e9-9a6b-b544e9756505\n",
+ "2025-02-24 19:11:06,454 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: sent task assignment to client. client_name:site-1 task_id:0dbaa782-7d3e-43e9-9a6b-b544e9756505\n",
+ "2025-02-24 19:11:06,455 - GetTaskCommand - INFO - return task to client. client_name: site-1 task_name: train task_id: 0dbaa782-7d3e-43e9-9a6b-b544e9756505 sharable_header_task_id: 0dbaa782-7d3e-43e9-9a6b-b544e9756505\n",
+ "2025-02-24 19:11:06,464 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.013328 seconds\n",
+ "2025-02-24 19:11:06,464 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:11:06,464 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=0dbaa782-7d3e-43e9-9a6b-b544e9756505\n",
+ "2025-02-24 19:11:06,464 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:11:06,464 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: execute for task (train)\n",
+ "2025-02-24 19:11:06,467 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: send data to peer\n",
+ "2025-02-24 19:11:06,467 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: sending payload to peer\n",
+ "2025-02-24 19:11:06,468 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: Waiting for result from peer\n",
+ "2025-02-24 19:11:06,711 - nvflare.app_common.executors.task_script_runner - INFO - current_round=1\n",
+ "2025-02-24 19:11:15,732 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 1.211\n",
+ "2025-02-24 19:11:16,137 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 1.218\n",
+ "2025-02-24 19:11:24,739 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 1.205\n",
+ "2025-02-24 19:11:25,382 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 1.214\n",
+ "2025-02-24 19:11:34,008 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 1.203\n",
+ "2025-02-24 19:11:34,680 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 1.192\n",
+ "2025-02-24 19:11:43,449 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 1.183\n",
+ "2025-02-24 19:11:43,918 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 1.203\n",
+ "2025-02-24 19:11:52,674 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 1.207\n",
+ "2025-02-24 19:11:53,265 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 1.182\n",
+ "2025-02-24 19:12:01,900 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 1.170\n",
+ "2025-02-24 19:12:02,417 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 1.175\n",
+ "2025-02-24 19:12:13,777 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 1.101\n",
+ "2025-02-24 19:12:13,933 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 1.097\n",
+ "2025-02-24 19:12:23,017 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 1.101\n",
+ "2025-02-24 19:12:23,040 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 1.084\n",
+ "2025-02-24 19:12:32,259 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 1.095\n",
+ "2025-02-24 19:12:32,275 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 1.101\n",
+ "2025-02-24 19:12:41,397 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 1.095\n",
+ "2025-02-24 19:12:41,454 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 1.102\n",
+ "2025-02-24 19:12:50,551 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 1.084\n",
+ "2025-02-24 19:12:50,599 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 1.082\n",
+ "2025-02-24 19:12:59,667 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 1.106\n",
+ "2025-02-24 19:12:59,724 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 1.120\n",
+ "2025-02-24 19:13:02,077 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:13:02,146 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:13:10,301 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 57 %\n",
+ "2025-02-24 19:13:10,304 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:13:10,316 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 57 %\n",
+ "2025-02-24 19:13:10,319 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:13:10,490 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: finished processing task\n",
+ "2025-02-24 19:13:10,491 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: try #1: sending task result to server\n",
+ "2025-02-24 19:13:10,491 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: checking task ...\n",
+ "2025-02-24 19:13:10,491 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:13:10,497 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: start to send task result to server\n",
+ "2025-02-24 19:13:10,497 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:13:10,502 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=train, id=9add7403-38d0-4d78-8519-aef6b8cbf33e\n",
+ "2025-02-24 19:13:10,503 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: validation metric 57 from client site-2\n",
+ "2025-02-24 19:13:10,547 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: finished processing task\n",
+ "2025-02-24 19:13:10,547 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: try #1: sending task result to server\n",
+ "2025-02-24 19:13:10,547 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: checking task ...\n",
+ "2025-02-24 19:13:10,548 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:13:10,584 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: finished processing client result by controller\n",
+ "2025-02-24 19:13:10,585 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-2 task_id:9add7403-38d0-4d78-8519-aef6b8cbf33e\n",
+ "2025-02-24 19:13:10,587 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.089770 seconds\n",
+ "2025-02-24 19:13:10,587 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=9add7403-38d0-4d78-8519-aef6b8cbf33e]: task result sent to server\n",
+ "2025-02-24 19:13:10,587 - ClientTaskWorker - INFO - Finished one task run for client: site-2 interval: 2 task_processed: True\n",
+ "2025-02-24 19:13:10,588 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: start to send task result to server\n",
+ "2025-02-24 19:13:10,588 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:13:10,592 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job]: got result from client site-1 for task: name=train, id=0dbaa782-7d3e-43e9-9a6b-b544e9756505\n",
+ "2025-02-24 19:13:10,593 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: validation metric 57 from client site-1\n",
+ "2025-02-24 19:13:10,660 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: finished processing client result by controller\n",
+ "2025-02-24 19:13:10,661 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1 task_id:0dbaa782-7d3e-43e9-9a6b-b544e9756505\n",
+ "2025-02-24 19:13:10,663 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.074279 seconds\n",
+ "2025-02-24 19:13:10,663 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: task result sent to server\n",
+ "2025-02-24 19:13:10,663 - ClientTaskWorker - INFO - Finished one task run for client: site-1 interval: 2 task_processed: True\n",
+ "2025-02-24 19:13:10,790 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: task train exit with status TaskCompletionStatus.OK\n",
+ "2025-02-24 19:13:10,792 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: new best validation metric at round 1: 57.0\n",
+ "2025-02-24 19:13:10,794 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: aggregating 2 update(s) at round 1\n",
+ "2025-02-24 19:13:10,795 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: Start persist model on server.\n",
+ "2025-02-24 19:13:10,797 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: End persist model on server.\n",
+ "2025-02-24 19:13:10,797 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: Round 2 started.\n",
+ "2025-02-24 19:13:10,798 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: Sampled clients: ['site-1', 'site-2']\n",
+ "2025-02-24 19:13:10,798 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: Sending task train to ['site-1', 'site-2']\n",
+ "2025-02-24 19:13:10,798 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=0dbaa782-7d3e-43e9-9a6b-b544e9756505]: scheduled task train\n",
+ "2025-02-24 19:13:12,593 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: assigned task to client site-2: name=train, id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999\n",
+ "2025-02-24 19:13:12,594 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: sent task assignment to client. client_name:site-2 task_id:05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999\n",
+ "2025-02-24 19:13:12,594 - GetTaskCommand - INFO - return task to client. client_name: site-2 task_name: train task_id: 05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999 sharable_header_task_id: 05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999\n",
+ "2025-02-24 19:13:12,600 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.011558 seconds\n",
+ "2025-02-24 19:13:12,600 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:13:12,600 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999\n",
+ "2025-02-24 19:13:12,601 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:13:12,601 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: execute for task (train)\n",
+ "2025-02-24 19:13:12,601 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: send data to peer\n",
+ "2025-02-24 19:13:12,601 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: sending payload to peer\n",
+ "2025-02-24 19:13:12,601 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: Waiting for result from peer\n",
+ "2025-02-24 19:13:12,668 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: assigned task to client site-1: name=train, id=eb82f454-28bc-416d-9160-2b6dba551dbf\n",
+ "2025-02-24 19:13:12,668 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: sent task assignment to client. client_name:site-1 task_id:eb82f454-28bc-416d-9160-2b6dba551dbf\n",
+ "2025-02-24 19:13:12,669 - GetTaskCommand - INFO - return task to client. client_name: site-1 task_name: train task_id: eb82f454-28bc-416d-9160-2b6dba551dbf sharable_header_task_id: eb82f454-28bc-416d-9160-2b6dba551dbf\n",
+ "2025-02-24 19:13:12,673 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.008248 seconds\n",
+ "2025-02-24 19:13:12,673 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:13:12,673 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=eb82f454-28bc-416d-9160-2b6dba551dbf\n",
+ "2025-02-24 19:13:12,674 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:13:12,674 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: execute for task (train)\n",
+ "2025-02-24 19:13:12,674 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: send data to peer\n",
+ "2025-02-24 19:13:12,674 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: sending payload to peer\n",
+ "2025-02-24 19:13:12,675 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: Waiting for result from peer\n",
+ "2025-02-24 19:13:12,806 - nvflare.app_common.executors.task_script_runner - INFO - current_round=2\n",
+ "2025-02-24 19:13:12,821 - nvflare.app_common.executors.task_script_runner - INFO - current_round=2\n",
+ "2025-02-24 19:13:22,334 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 1.000\n",
+ "2025-02-24 19:13:22,474 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 0.994\n",
+ "2025-02-24 19:13:31,550 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 1.018\n",
+ "2025-02-24 19:13:31,826 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 1.038\n",
+ "2025-02-24 19:13:40,684 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 1.015\n",
+ "2025-02-24 19:13:41,066 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 1.029\n",
+ "2025-02-24 19:13:50,173 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 1.036\n",
+ "2025-02-24 19:13:50,473 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 1.034\n",
+ "2025-02-24 19:13:59,415 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 1.028\n",
+ "2025-02-24 19:13:59,893 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 1.036\n",
+ "2025-02-24 19:14:08,820 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 1.032\n",
+ "2025-02-24 19:14:09,310 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 1.032\n",
+ "2025-02-24 19:14:20,227 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 0.927\n",
+ "2025-02-24 19:14:21,448 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 0.959\n",
+ "2025-02-24 19:14:29,471 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 0.961\n",
+ "2025-02-24 19:14:30,653 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 0.941\n",
+ "2025-02-24 19:14:38,611 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 0.957\n",
+ "2025-02-24 19:14:39,992 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 0.960\n",
+ "2025-02-24 19:14:47,633 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 0.973\n",
+ "2025-02-24 19:14:49,100 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 0.980\n",
+ "2025-02-24 19:14:57,086 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 0.994\n",
+ "2025-02-24 19:14:58,645 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 0.992\n",
+ "2025-02-24 19:15:06,125 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 0.988\n",
+ "2025-02-24 19:15:07,495 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 0.974\n",
+ "2025-02-24 19:15:08,416 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:15:09,882 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:15:16,461 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 63 %\n",
+ "2025-02-24 19:15:16,466 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:15:16,693 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: finished processing task\n",
+ "2025-02-24 19:15:16,694 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: try #1: sending task result to server\n",
+ "2025-02-24 19:15:16,695 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: checking task ...\n",
+ "2025-02-24 19:15:16,695 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:15:16,701 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: start to send task result to server\n",
+ "2025-02-24 19:15:16,701 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:15:16,706 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=train, id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999\n",
+ "2025-02-24 19:15:16,706 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: validation metric 63 from client site-2\n",
+ "2025-02-24 19:15:16,788 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: finished processing client result by controller\n",
+ "2025-02-24 19:15:16,789 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-2 task_id:05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999\n",
+ "2025-02-24 19:15:16,790 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.089340 seconds\n",
+ "2025-02-24 19:15:16,791 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=05e2dc1d-d1b7-4b00-bf7c-bf2a6930a999]: task result sent to server\n",
+ "2025-02-24 19:15:16,791 - ClientTaskWorker - INFO - Finished one task run for client: site-2 interval: 2 task_processed: True\n",
+ "2025-02-24 19:15:17,207 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 63 %\n",
+ "2025-02-24 19:15:17,211 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:15:17,284 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: finished processing task\n",
+ "2025-02-24 19:15:17,285 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: try #1: sending task result to server\n",
+ "2025-02-24 19:15:17,285 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: checking task ...\n",
+ "2025-02-24 19:15:17,285 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:15:17,291 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: start to send task result to server\n",
+ "2025-02-24 19:15:17,292 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:15:17,297 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job]: got result from client site-1 for task: name=train, id=eb82f454-28bc-416d-9160-2b6dba551dbf\n",
+ "2025-02-24 19:15:17,297 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: validation metric 63 from client site-1\n",
+ "2025-02-24 19:15:17,372 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: finished processing client result by controller\n",
+ "2025-02-24 19:15:17,372 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1 task_id:eb82f454-28bc-416d-9160-2b6dba551dbf\n",
+ "2025-02-24 19:15:17,374 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.082439 seconds\n",
+ "2025-02-24 19:15:17,375 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: task result sent to server\n",
+ "2025-02-24 19:15:17,375 - ClientTaskWorker - INFO - Finished one task run for client: site-1 interval: 2 task_processed: True\n",
+ "2025-02-24 19:15:17,389 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: task train exit with status TaskCompletionStatus.OK\n",
+ "2025-02-24 19:15:17,390 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: new best validation metric at round 2: 63.0\n",
+ "2025-02-24 19:15:17,393 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: aggregating 2 update(s) at round 2\n",
+ "2025-02-24 19:15:17,394 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: Start persist model on server.\n",
+ "2025-02-24 19:15:17,395 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: End persist model on server.\n",
+ "2025-02-24 19:15:17,395 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: Round 3 started.\n",
+ "2025-02-24 19:15:17,395 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: Sampled clients: ['site-1', 'site-2']\n",
+ "2025-02-24 19:15:17,395 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: Sending task train to ['site-1', 'site-2']\n",
+ "2025-02-24 19:15:17,395 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=eb82f454-28bc-416d-9160-2b6dba551dbf]: scheduled task train\n",
+ "2025-02-24 19:15:18,796 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: assigned task to client site-2: name=train, id=c0887cf1-c746-41f2-9df1-18e93fd35e0f\n",
+ "2025-02-24 19:15:18,797 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: sent task assignment to client. client_name:site-2 task_id:c0887cf1-c746-41f2-9df1-18e93fd35e0f\n",
+ "2025-02-24 19:15:18,797 - GetTaskCommand - INFO - return task to client. client_name: site-2 task_name: train task_id: c0887cf1-c746-41f2-9df1-18e93fd35e0f sharable_header_task_id: c0887cf1-c746-41f2-9df1-18e93fd35e0f\n",
+ "2025-02-24 19:15:18,802 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.009684 seconds\n",
+ "2025-02-24 19:15:18,802 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:15:18,802 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=c0887cf1-c746-41f2-9df1-18e93fd35e0f\n",
+ "2025-02-24 19:15:18,803 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:15:18,803 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: execute for task (train)\n",
+ "2025-02-24 19:15:18,803 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: send data to peer\n",
+ "2025-02-24 19:15:18,803 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: sending payload to peer\n",
+ "2025-02-24 19:15:18,804 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: Waiting for result from peer\n",
+ "2025-02-24 19:15:18,969 - nvflare.app_common.executors.task_script_runner - INFO - current_round=3\n",
+ "2025-02-24 19:15:19,379 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: assigned task to client site-1: name=train, id=b38e5e7a-7489-4d0f-87de-43a88889522c\n",
+ "2025-02-24 19:15:19,379 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: sent task assignment to client. client_name:site-1 task_id:b38e5e7a-7489-4d0f-87de-43a88889522c\n",
+ "2025-02-24 19:15:19,379 - GetTaskCommand - INFO - return task to client. client_name: site-1 task_name: train task_id: b38e5e7a-7489-4d0f-87de-43a88889522c sharable_header_task_id: b38e5e7a-7489-4d0f-87de-43a88889522c\n",
+ "2025-02-24 19:15:19,384 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.008228 seconds\n",
+ "2025-02-24 19:15:19,385 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:15:19,385 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=b38e5e7a-7489-4d0f-87de-43a88889522c\n",
+ "2025-02-24 19:15:19,385 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:15:19,385 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: execute for task (train)\n",
+ "2025-02-24 19:15:19,386 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: send data to peer\n",
+ "2025-02-24 19:15:19,386 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: sending payload to peer\n",
+ "2025-02-24 19:15:19,386 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: Waiting for result from peer\n",
+ "2025-02-24 19:15:19,714 - nvflare.app_common.executors.task_script_runner - INFO - current_round=3\n",
+ "2025-02-24 19:15:27,656 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 0.875\n",
+ "2025-02-24 19:15:28,837 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 0.860\n",
+ "2025-02-24 19:15:36,813 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 0.894\n",
+ "2025-02-24 19:15:38,003 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 0.874\n",
+ "2025-02-24 19:15:45,601 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 0.919\n",
+ "2025-02-24 19:15:46,932 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 0.935\n",
+ "2025-02-24 19:15:54,897 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 0.920\n",
+ "2025-02-24 19:15:56,149 - nvflare.app_common.executors.task_script_runner - INFO - [1, 8000] loss: 0.933\n",
+ "2025-02-24 19:16:04,242 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 0.935\n",
+ "2025-02-24 19:16:05,360 - nvflare.app_common.executors.task_script_runner - INFO - [1, 10000] loss: 0.942\n",
+ "2025-02-24 19:16:13,680 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 0.944\n",
+ "2025-02-24 19:16:14,625 - nvflare.app_common.executors.task_script_runner - INFO - [1, 12000] loss: 0.935\n",
+ "2025-02-24 19:16:25,646 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 0.833\n",
+ "2025-02-24 19:16:26,669 - nvflare.app_common.executors.task_script_runner - INFO - [2, 2000] loss: 0.843\n",
+ "2025-02-24 19:16:34,976 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 0.881\n",
+ "2025-02-24 19:16:35,833 - nvflare.app_common.executors.task_script_runner - INFO - [2, 4000] loss: 0.850\n",
+ "2025-02-24 19:16:44,190 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 0.868\n",
+ "2025-02-24 19:16:45,245 - nvflare.app_common.executors.task_script_runner - INFO - [2, 6000] loss: 0.879\n",
+ "2025-02-24 19:16:53,259 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 0.884\n",
+ "2025-02-24 19:16:54,480 - nvflare.app_common.executors.task_script_runner - INFO - [2, 8000] loss: 0.878\n",
+ "2025-02-24 19:17:02,438 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 0.890\n",
+ "2025-02-24 19:17:03,663 - nvflare.app_common.executors.task_script_runner - INFO - [2, 10000] loss: 0.894\n",
+ "2025-02-24 19:17:11,600 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 0.891\n",
+ "2025-02-24 19:17:12,803 - nvflare.app_common.executors.task_script_runner - INFO - [2, 12000] loss: 0.892\n",
+ "2025-02-24 19:17:14,002 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:17:15,278 - nvflare.app_common.executors.task_script_runner - INFO - Finished Training\n",
+ "2025-02-24 19:17:22,212 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 65 %\n",
+ "2025-02-24 19:17:22,216 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:17:22,402 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: finished processing task\n",
+ "2025-02-24 19:17:22,403 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: try #1: sending task result to server\n",
+ "2025-02-24 19:17:22,403 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: checking task ...\n",
+ "2025-02-24 19:17:22,403 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:17:22,408 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: start to send task result to server\n",
+ "2025-02-24 19:17:22,408 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:17:22,414 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=train, id=c0887cf1-c746-41f2-9df1-18e93fd35e0f\n",
+ "2025-02-24 19:17:22,415 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: validation metric 65 from client site-2\n",
+ "2025-02-24 19:17:22,492 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: finished processing client result by controller\n",
+ "2025-02-24 19:17:22,492 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-2 task_id:c0887cf1-c746-41f2-9df1-18e93fd35e0f\n",
+ "2025-02-24 19:17:22,495 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.087271 seconds\n",
+ "2025-02-24 19:17:22,495 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=c0887cf1-c746-41f2-9df1-18e93fd35e0f]: task result sent to server\n",
+ "2025-02-24 19:17:22,496 - ClientTaskWorker - INFO - Finished one task run for client: site-2 interval: 2 task_processed: True\n",
+ "2025-02-24 19:17:22,920 - nvflare.app_common.executors.task_script_runner - INFO - Accuracy of the network on the 10000 test images: 65 %\n",
+ "2025-02-24 19:17:22,923 - InProcessClientAPI - INFO - Try to send local model back to peer \n",
+ "2025-02-24 19:17:22,980 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: finished processing task\n",
+ "2025-02-24 19:17:22,981 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: try #1: sending task result to server\n",
+ "2025-02-24 19:17:22,981 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: checking task ...\n",
+ "2025-02-24 19:17:22,981 - Cell - INFO - broadcast: channel='aux_communication', topic='__task_check__', targets=['server.simulate_job'], timeout=5.0\n",
+ "2025-02-24 19:17:22,987 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: start to send task result to server\n",
+ "2025-02-24 19:17:22,987 - FederatedClient - INFO - Starting to push execute result.\n",
+ "2025-02-24 19:17:22,991 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job]: got result from client site-1 for task: name=train, id=b38e5e7a-7489-4d0f-87de-43a88889522c\n",
+ "2025-02-24 19:17:22,992 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: validation metric 65 from client site-1\n",
+ "2025-02-24 19:17:23,064 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: finished processing client result by controller\n",
+ "2025-02-24 19:17:23,065 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller]: task train exit with status TaskCompletionStatus.OK\n",
+ "2025-02-24 19:17:23,065 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1 task_id:b38e5e7a-7489-4d0f-87de-43a88889522c\n",
+ "2025-02-24 19:17:23,067 - Communicator - INFO - SubmitUpdate size: 251.4KB (251449 Bytes). time: 0.079637 seconds\n",
+ "2025-02-24 19:17:23,067 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: task result sent to server\n",
+ "2025-02-24 19:17:23,067 - ClientTaskWorker - INFO - Finished one task run for client: site-1 interval: 2 task_processed: True\n",
+ "2025-02-24 19:17:23,265 - IntimeModelSelector - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: new best validation metric at round 3: 65.0\n",
+ "2025-02-24 19:17:23,268 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: aggregating 2 update(s) at round 3\n",
+ "2025-02-24 19:17:23,270 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: Start persist model on server.\n",
+ "2025-02-24 19:17:23,272 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: End persist model on server.\n",
+ "2025-02-24 19:17:23,273 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: Round 4 started.\n",
+ "2025-02-24 19:17:23,273 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: Sampled clients: ['site-1', 'site-2']\n",
+ "2025-02-24 19:17:23,273 - FedAvg - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: Sending task train to ['site-1', 'site-2']\n",
+ "2025-02-24 19:17:23,273 - WFCommServer - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=b38e5e7a-7489-4d0f-87de-43a88889522c]: scheduled task train\n",
+ "2025-02-24 19:17:24,500 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: assigned task to client site-2: name=train, id=944f7288-d797-4017-8729-cb3a0937b9ec\n",
+ "2025-02-24 19:17:24,500 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-2, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: sent task assignment to client. client_name:site-2 task_id:944f7288-d797-4017-8729-cb3a0937b9ec\n",
+ "2025-02-24 19:17:24,501 - GetTaskCommand - INFO - return task to client. client_name: site-2 task_name: train task_id: 944f7288-d797-4017-8729-cb3a0937b9ec sharable_header_task_id: 944f7288-d797-4017-8729-cb3a0937b9ec\n",
+ "2025-02-24 19:17:24,506 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.009357 seconds\n",
+ "2025-02-24 19:17:24,506 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:17:24,506 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=944f7288-d797-4017-8729-cb3a0937b9ec\n",
+ "2025-02-24 19:17:24,506 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:17:24,506 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: execute for task (train)\n",
+ "2025-02-24 19:17:24,507 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: send data to peer\n",
+ "2025-02-24 19:17:24,507 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: sending payload to peer\n",
+ "2025-02-24 19:17:24,507 - PTInProcessClientAPIExecutor - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=944f7288-d797-4017-8729-cb3a0937b9ec]: Waiting for result from peer\n",
+ "2025-02-24 19:17:24,719 - nvflare.app_common.executors.task_script_runner - INFO - current_round=4\n",
+ "2025-02-24 19:17:25,072 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: assigned task to client site-1: name=train, id=4ee559a9-2896-456d-937b-04cc3c7e0399\n",
+ "2025-02-24 19:17:25,073 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site-1, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: sent task assignment to client. client_name:site-1 task_id:4ee559a9-2896-456d-937b-04cc3c7e0399\n",
+ "2025-02-24 19:17:25,073 - GetTaskCommand - INFO - return task to client. client_name: site-1 task_name: train task_id: 4ee559a9-2896-456d-937b-04cc3c7e0399 sharable_header_task_id: 4ee559a9-2896-456d-937b-04cc3c7e0399\n",
+ "2025-02-24 19:17:25,079 - Communicator - INFO - Received from simulator_server server. getTask: train size: 251.5KB (251536 Bytes) time: 0.010896 seconds\n",
+ "2025-02-24 19:17:25,080 - FederatedClient - INFO - pull_task completed. Task name:train Status:True \n",
+ "2025-02-24 19:17:25,080 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job]: got task assignment: name=train, id=4ee559a9-2896-456d-937b-04cc3c7e0399\n",
+ "2025-02-24 19:17:25,080 - ClientRunner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: invoking task executor PTInProcessClientAPIExecutor\n",
+ "2025-02-24 19:17:25,080 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: execute for task (train)\n",
+ "2025-02-24 19:17:25,080 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: send data to peer\n",
+ "2025-02-24 19:17:25,081 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: sending payload to peer\n",
+ "2025-02-24 19:17:25,081 - PTInProcessClientAPIExecutor - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=4ee559a9-2896-456d-937b-04cc3c7e0399]: Waiting for result from peer\n",
+ "2025-02-24 19:17:25,424 - nvflare.app_common.executors.task_script_runner - INFO - current_round=4\n",
+ "2025-02-24 19:17:33,621 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 0.761\n",
+ "2025-02-24 19:17:34,792 - nvflare.app_common.executors.task_script_runner - INFO - [1, 2000] loss: 0.776\n",
+ "2025-02-24 19:17:43,114 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 0.805\n",
+ "2025-02-24 19:17:44,116 - nvflare.app_common.executors.task_script_runner - INFO - [1, 4000] loss: 0.810\n",
+ "2025-02-24 19:17:52,321 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 0.831\n",
+ "2025-02-24 19:17:53,275 - nvflare.app_common.executors.task_script_runner - INFO - [1, 6000] loss: 0.817\n"
+ ]
+ }
+ ],
+ "source": [
+ "job.simulator_run(f\"/tmp/nvflare/{job.name}\", gpu=\"0\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c82a3be9-9e58-44ca-9d3f-e85456de7f12",
+ "metadata": {},
+ "source": [
+ "#### 6. Run FL Simulation with DP\n",
+ "Run the FL simulator with two clients for federated learning with differential privacy. The key now is to add a filer to each client that applies DP before sending the model updates back to the server\n",
+ "using the `job.to()` method.\n",
+ "\n",
+ "Let's create a new FedJob with the DP add through the [SVTPrivacy](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.filters.html#nvflare.app_common.filters.SVTPrivacy) Filter.\n",
+ "\n",
+ "> **Note:** Use `filter_type=FilterType.TASK_RESULT` as we are adding the filter on top of the model updates after local training."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "330e6fca-8098-4be4-8d75-6b5e7ab1869d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from nvflare import FilterType\n",
+ "from nvflare.app_common.filters import SVTPrivacy\n",
+ "\n",
+ "# Create BaseFedJob with initial model\n",
+ "job = BaseFedJob(\n",
+ " name=\"cifar10_fedavg_dp\",\n",
+ " initial_model=Net(),\n",
+ ")\n",
+ "\n",
+ "# Define the controller and send to server\n",
+ "controller = FedAvg(\n",
+ " num_clients=n_clients,\n",
+ " num_rounds=num_rounds,\n",
+ ")\n",
+ "job.to_server(controller)\n",
+ "\n",
+ "# Add clients\n",
+ "for i in range(n_clients):\n",
+ " runner = ScriptRunner(\n",
+ " script=\"src/cifar10_fl.py\"\n",
+ " )\n",
+ " job.to(runner, f\"site-{i+1}\")\n",
+ "\n",
+ " # add privacy filter.\n",
+ " dp_filter = SVTPrivacy(fraction=0.1, epsilon=0.1, noise_var=0.1, gamma=1e-5, tau=1e-6)\n",
+ " job.to(dp_filter, f\"site-{i+1}\", tasks=[\"train\"], filter_type=FilterType.TASK_RESULT)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a661c68e-6b7f-4215-93e3-d4fe55eb5e7e",
+ "metadata": {},
+ "source": [
+ "Finally, start the training"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fc6b911d-a171-49b1-ad2e-b0d73032110c",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "job.simulator_run(f\"/tmp/nvflare/{job.name}\", gpu=\"0\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fcf57a13-842b-46e0-9032-69d3a03b3182",
+ "metadata": {},
+ "source": [
+ "> **Note:** you can also try adding or combining the filters with other privacy filters or customize them. For example, use the [PercentilePrivacy](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.filters.html#nvflare.app_common.filters.PercentilePrivacy) filter based on Shokri and Shmatikov ([Privacy-preserving deep learning, CCS '15](https://dl.acm.org/doi/abs/10.1145/2810103.2813687)) or [ExcludeVars](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.filters.html#nvflare.app_common.filters.ExcludeVars) filter to exclude variables that shouldn't be shared with the server."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e90c423d-4784-4956-961f-e2ea1ef1b30e",
+ "metadata": {},
+ "source": [
+ "### 7. Visualize the results\n",
+ "You can plot the results by running `tensorboard --logdir /tmp/nvflare` in a new terminal\n",
+ "\n",
+ "data:image/s3,"s3://crabby-images/86231/862312d493cc2762b65417b69b8c78a7debb1372" alt="TensorBoard Training curve of FedAvg without and with DP""
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
- "id": "9bb84109-12db-4a55-aa44-4b9dc8d20659",
+ "id": "e5b1020d-0d2f-433b-81e4-065d484923fe",
"metadata": {},
"outputs": [],
"source": []
@@ -11,9 +800,9 @@
],
"metadata": {
"kernelspec": {
- "display_name": "nvflare_example",
+ "display_name": "Python 3 (ipykernel)",
"language": "python",
- "name": "nvflare_example"
+ "name": "python3"
},
"language_info": {
"codemirror_mode": {
@@ -25,7 +814,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.2"
+ "version": "3.11.7"
}
},
"nbformat": 4,
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/requirements.txt b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/requirements.txt
new file mode 100644
index 0000000000..f34de5ae20
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/requirements.txt
@@ -0,0 +1,4 @@
+nvflare~=2.5
+torch
+torchvision
+tensorboard
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/src/cifar10_fl.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/src/cifar10_fl.py
new file mode 100644
index 0000000000..19983285a9
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/src/cifar10_fl.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torchvision
+import torchvision.transforms as transforms
+from net import Net
+
+# (1) import nvflare client API
+import nvflare.client as flare
+
+# (optional) metrics
+from nvflare.client.tracking import SummaryWriter
+
+# (optional) set a fix place so we don't need to download everytime
+DATASET_PATH = "/tmp/nvflare/data"
+# If available, we use GPU to speed things up.
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+def main():
+ transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
+
+ batch_size = 4
+ epochs = 2
+
+ trainset = torchvision.datasets.CIFAR10(root=DATASET_PATH, train=True, download=True, transform=transform)
+ trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=2)
+
+ testset = torchvision.datasets.CIFAR10(root=DATASET_PATH, train=False, download=True, transform=transform)
+ testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=2)
+
+ net = Net()
+
+ # (2) initializes NVFlare client API
+ flare.init()
+
+ # (Optional) compute unique seed from client name to initialize data loaders
+ client_name = flare.get_site_name()
+ seed = int.from_bytes(client_name.encode(), "big")
+ torch.manual_seed(seed)
+
+ summary_writer = SummaryWriter()
+ while flare.is_running():
+ # (3) receives FLModel from NVFlare
+ input_model = flare.receive()
+ print(f"current_round={input_model.current_round}")
+
+ # (4) loads model from NVFlare
+ net.load_state_dict(input_model.params)
+
+ criterion = nn.CrossEntropyLoss()
+ optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
+
+ # (optional) use GPU to speed things up
+ net.to(DEVICE)
+ # (optional) calculate total steps
+ steps = epochs * len(trainloader)
+ for epoch in range(epochs): # loop over the dataset multiple times
+
+ running_loss = 0.0
+ for i, data in enumerate(trainloader, 0):
+ # get the inputs; data is a list of [inputs, labels]
+ # (optional) use GPU to speed things up
+ inputs, labels = data[0].to(DEVICE), data[1].to(DEVICE)
+
+ # zero the parameter gradients
+ optimizer.zero_grad()
+
+ # forward + backward + optimize
+ outputs = net(inputs)
+ loss = criterion(outputs, labels)
+ loss.backward()
+ optimizer.step()
+
+ # print statistics
+ running_loss += loss.item()
+ if i % 2000 == 1999: # print every 2000 mini-batches
+ print(f"[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}")
+ global_step = input_model.current_round * steps + epoch * len(trainloader) + i
+
+ summary_writer.add_scalar(tag="loss_for_each_batch", scalar=running_loss, global_step=global_step)
+ running_loss = 0.0
+
+ print("Finished Training")
+
+ PATH = "./cifar_net.pth"
+ torch.save(net.state_dict(), PATH)
+
+ # (5) wraps evaluation logic into a method to re-use for
+ # evaluation on both trained and received model
+ def evaluate(input_weights):
+ net = Net()
+ net.load_state_dict(input_weights)
+ # (optional) use GPU to speed things up
+ net.to(DEVICE)
+
+ correct = 0
+ total = 0
+ # since we're not training, we don't need to calculate the gradients for our outputs
+ with torch.no_grad():
+ for data in testloader:
+ # (optional) use GPU to speed things up
+ images, labels = data[0].to(DEVICE), data[1].to(DEVICE)
+ # calculate outputs by running images through the network
+ outputs = net(images)
+ # the class with the highest energy is what we choose as prediction
+ _, predicted = torch.max(outputs.data, 1)
+ total += labels.size(0)
+ correct += (predicted == labels).sum().item()
+
+ print(f"Accuracy of the network on the 10000 test images: {100 * correct // total} %")
+ return 100 * correct // total
+
+ # (6) evaluate on received model for model selection
+ accuracy = evaluate(input_model.params)
+ summary_writer.add_scalar(tag="global_model_accuracy", scalar=accuracy, global_step=input_model.current_round)
+ # (7) construct trained FL model
+ output_model = flare.FLModel(
+ params=net.cpu().state_dict(),
+ metrics={"accuracy": accuracy},
+ meta={"NUM_STEPS_CURRENT_ROUND": steps},
+ )
+ # (8) send model back to NVFlare
+ flare.send(output_model)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/src/net.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/src/net.py
new file mode 100644
index 0000000000..47ac7e9589
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-5_Privacy_In_Federated_Learning/05.2_differency_privacy/src/net.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class Net(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.conv1 = nn.Conv2d(3, 6, 5)
+ self.pool = nn.MaxPool2d(2, 2)
+ self.conv2 = nn.Conv2d(6, 16, 5)
+ self.fc1 = nn.Linear(16 * 5 * 5, 120)
+ self.fc2 = nn.Linear(120, 84)
+ self.fc3 = nn.Linear(84, 10)
+
+ def forward(self, x):
+ x = self.pool(F.relu(self.conv1(x)))
+ x = self.pool(F.relu(self.conv2(x)))
+ x = torch.flatten(x, 1) # flatten all dimensions except batch
+ x = F.relu(self.fc1(x))
+ x = F.relu(self.fc2(x))
+ x = self.fc3(x)
+ return x