Skip to content

Commit

Permalink
Merge branch 'main' into dashboard_prov
Browse files Browse the repository at this point in the history
  • Loading branch information
nvkevlu authored Feb 25, 2025
2 parents e384587 + db75f43 commit 7f62dfd
Show file tree
Hide file tree
Showing 140 changed files with 3,094 additions and 2,976 deletions.
File renamed without changes.
3 changes: 0 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,6 @@ To collaborate efficiently, please read through this section and follow them.
* [Building documentation](#building-the-documentation)
* [Signing your work](#signing-your-work)

> Note:
> some package dependencies requires python<version>-dev in local development such as
> python3.12-dev.

#### Checking the coding style
We check code style using flake8 and isort.
Expand Down
87 changes: 59 additions & 28 deletions examples/advanced/federated-statistics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Objective
NVIDIA FLARE will provide built-in federated statistics operators (controllers and executors) that
can generate global statistics based on local client side statistics.
can generate global statistics based on local client-side statistics.

At each client site, we could have one or more datasets (such as "train" and "test" datasets); each dataset may have many
features. For each feature in the dataset, we will calculate the statistics and then combine them to produce
Expand All @@ -19,14 +19,47 @@ The result should be visualized via the visualization utility in the notebook.

## Assumptions

Assume that clients will provide the following:
* user needs to provide target statistics such as count, histogram only
* user needs to provide the local statistics for the target statistics (by implementing the statistic_spec)
* user needs to provide the data sets and dataset features (feature name, data type)
* * Note: count is always required as we use count to enforce data privacy policy
We only support **numerical features**, not categorical features. But user can return all types of features
Assume that clients will provide the following:
* Users need to provide target statistics such as count, histogram only
* Users need to provide the local statistics for the target statistics (by implementing the statistics_spec)
* Users need to provide the datasets and dataset features (feature name, data type)
* Note: count is always required as we use count to enforce data privacy policy

We only support **numerical features**, not categorical features. However, users can return all types of features;
the non-numerical features will be removed.


## Statistics

Federated statistics includes numerics statistics measures for
* count
* mean
* sum
* std_dev
* histogram
* quantile

We did not include min, max value to avoid data privacy concern.

### Quantile

Quantile statistics refers to statistical measures that divide a probability distribution or dataset into intervals with equal probabilities or proportions. Quantiles help summarize the distribution of data by providing key points that indicate how values are spread.

#### Key Quantiles:
1. Median (50th percentile): The middle value of a dataset, dividing it into two equal halves.
2. Quartiles (25th, 50th, 75th percentiles): Divide the data into four equal parts:
* Q1 (25th percentile): Lower quartile, below which 25% of the data falls.
* Q2 (50th percentile): Median.
* Q3 (75th percentile): Upper quartile, below which 75% of the data falls.
3. Deciles (10th, 20th, ..., 90th percentiles): Divide the data into ten equal parts.
4. Percentiles (1st, 2nd, ..., 99th): Divide the data into 100 equal parts.

#### Usage of Quantiles:
* Descriptive Statistics: Summarizes the spread of data.
* Outlier Detection: Helps identify extreme values.
* Machine Learning: Used in feature engineering, normalization, and decision tree algorithms.
* Risk Analysis: Used in finance (e.g., Value at Risk, VaR).

## Examples

We provide several examples to demonstrate how should the operators be used.
Expand Down Expand Up @@ -57,20 +90,21 @@ The main steps are

The detailed example instructions can be found [Data frame statistics](df_stats/README.md)


### COVID 19 Radiology Image Examples

The second example provided is image histogram example. Different from **Tabular** data example,
The second example provided is an image histogram example. Unlike the **Tabular** data example:

The image examples show the followings
The image examples show the following:
* The [image_statistics.py](image_stats/jobs/image_stats/app/custom/image_statistics.py) only needs
to calculate the count and histogram target statistics, then user only needs to provide the calculation count, failure_count and histogram functions. There is no need to implement other metrics functions
(sum, mean,std_dev etc.) ( get_failure_count by default return 0 )
* For each site's dataset, there are several thousands of images, the local histogram is aggregate histogram of all the image histograms.
* The image files are large, we can't load everything in memory, then calculate the statistics.
We will need to iterate through files for each calculation. For single feature, such as example. This is ok. If there are multiple features,
such as multiple channels, reload image to memory for each channel to do histogram calculation is really wasteful.
* Unlike [Data frame statistics](df_stats/README.md), the histogram bin's global range is pre-defined by user [0, 256]
where in [Data frame statistics](df_stats/README.md), besides "Age", all other features histogram global bin range
to calculate the count and histogram target statistics. Users only need to provide the calculation count, failure_count and histogram functions. There is no need to implement other metrics functions
(sum, mean, std_dev etc.) (get_failure_count by default returns 0)
* For each site's dataset, there are several thousand images; the local histogram is an aggregate histogram of all the image histograms
* The image files are large, so we can't load everything into memory and then calculate the statistics.
We will need to iterate through files for each calculation. For a single feature, this is acceptable. If there are multiple features,
such as multiple channels, reloading images to memory for each channel to do histogram calculation is wasteful
* Unlike [Data frame statistics](df_stats/README.md), the histogram bin's global range is pre-defined by users [0, 256],
whereas in [Data frame statistics](df_stats/README.md), besides "Age", all other features' histogram global bin range
is dynamically estimated based on local min/max values

An example of image histogram (the underline image files have only 1 channel)
Expand Down Expand Up @@ -155,6 +189,7 @@ The main steps are
* provide client side configuration to specify data input location
* provide hierarchy specification file providing details about all the clients and their hierarchy.


## Privacy Policy and Privacy Filters

NVFLARE provide data privacy protection through privacy filters [privacy-management](https://nvflare.readthedocs.io/en/main/user_guide/security/site_policy_management.html#privacy-management)
Expand All @@ -178,22 +213,21 @@ defined and job doesn't specify the privacy scope, the job deployment will fail,

### Privacy Policy Instrumentation

There are different ways to set privacy filter depending the use cases
There are different ways to set privacy filters depending on the use cases:

#### Set Privacy Policy as researcher

You can specify the "task_result_filters" in config_fed_client.json to specify
the privacy control. This is useful when you develop these filters
the privacy control. This is useful when you develop these filters.

#### Setup site privacy policy as org admin

Once the company decides to instrument certain privacy policy independent of individual
job, one can copy the local directory privacy.json content to clients' local privacy.json ( merge not overwrite).
in this example, since we only has one app, we can simply copy the private.json from local directory to
Once the company decides to implement certain privacy policies independent of individual
jobs, one can copy the local directory privacy.json content to clients' local privacy.json (merge, not overwrite).
In this example, since we only have one app, we can simply copy the privacy.json from the local directory to:

* site-1/local/privacy.json
* site-2/local/privacy.json

We need to remove the same filters from the job definition in config_fed_client.json
by simply set the "task_result_filters" to empty list to avoid **double filtering**
```
Expand Down Expand Up @@ -304,10 +338,7 @@ sequenceDiagram
```




## Summary

We provided federated statistics operators that can easily aggregate and visualize the local statistics for
different data site and features. We hope this feature will make it easier to perform federated data analysis.

We provided federated statistics operators that can easily aggregate and visualize the local statistics for
different data site and features. We hope this feature will make it easier to perform federated data analysis.
59 changes: 56 additions & 3 deletions examples/advanced/federated-statistics/df_stats/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,52 @@ cd NVFlare/examples/advanced/federated-statistics/df_stats
pip install -r requirements.txt
```


## Install fastdigest

If you intend to calculate quantiles, you need to install fastdigest.

```
pip install fastdigest==0.4.0
```

on Ubuntu, you might get the following error:

Cargo, the Rust package manager, is not installed or is not on PATH.
This package requires Rust and Cargo to compile extensions. Install it through
the system's package manager or via https://rustup.rs/

Checking for Rust toolchain....

This is because fastdigest (or its dependencies) requires Rust and Cargo to build.

You need to install Rust and Cargo on your Ubuntu system. Follow these steps:
Install Rust and Cargo
Run the following command to install Rust using rustup:

```
cd NVFlare/examples/advanced/federated-statistics/df_stats
./install_cargo.sh
```

Then you can install fastdigest again
```
pip install fastdigest==0.4.0
```

### Quantile Calculation

To calculate federated quantiles, we needed to select a package that satisfies the following constraints:

* Works in distributed systems
* Does not copy the original data (avoiding privacy leaks)
* Avoids transmitting large amounts of data
* Ideally, no system-level dependency

We chose the fastdigest python package, a rust-based package. tdigest only carries the cluster coordinates, initially each data point is in its own cluster. By default, we will compress with max_bin = sqrt(datasize) to compress the coordinates, so the data won't leak. You can always override max_bins if you prefer more or less compression.



## 1. Prepare data

In this example, we are using UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult)
Expand Down Expand Up @@ -165,8 +211,12 @@ statistics computing, we will only need to provide the followings
"stddev": {},
"histogram": { "*": {"bins": 10 },
"Age": {"bins": 5, "range":[0,120]}
}
},
"quantile": {
"*": [25, 50, 75]
}
},
"writer_id": "stats_writer"
}
}
Expand Down Expand Up @@ -195,7 +245,8 @@ in FLARE job store.

### 5.2 client side configuration

First, we specify the built-in client side executor: `StatisticsExecutor`, which takes a local stats generator Id
First, we specify the built-in client side executor: `StatisticsExecutor`, which takes a local stats generator ID


```
"executor": {
Expand Down Expand Up @@ -248,7 +299,7 @@ In this example, task_result_filters is defined as task privacy filter : `Statis
`StatisticsPrivacyFilter` is using three separate the `StatisticsPrivacyCleanser`, you can find more details in
[local privacy policy](../local/privacy.json) and in later discussion on privacy.

The privacy cleansers specify policy can be find in
The privacy cleansers specify policies can be found in
```
"components": [
{
Expand Down Expand Up @@ -311,6 +362,8 @@ to calculate the local statistics, we will need to implements few methods
def histogram(self, dataset_name: str, feature_name: str, num_of_bins: int, global_min_value: float, global_max_value: float) -> Histogram:
def quantiles(self, dataset_name: str, feature_name: str, percentiles: List) -> Dict:
```
since some of features do not provide histogram bin range, we will need to calculate based on local min/max to estimate
the global min/max, and then use the global bin/max as the range for all clients' histogram bin range.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "c44a0217",
"metadata": {
"tags": []
Expand Down Expand Up @@ -81,7 +81,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"id": "93c62d5e",
"metadata": {
"tags": []
Expand Down Expand Up @@ -271,9 +271,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "nvflare_example",
"display_name": "nvflare-env",
"language": "python",
"name": "nvflare_example"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -285,7 +285,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
"version": "3.8.13"
}
},
"nbformat": 4,
Expand Down
15 changes: 15 additions & 0 deletions examples/advanced/federated-statistics/df_stats/install_cargo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

# fastdigest (or its dependencies) requires Rust and Cargo to build.
# You need to install Rust and Cargo on your Ubuntu system. Follow these steps:
# Install Rust and Cargo
# Run the following command to install Rust using rustup:


curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Then restart your terminal or run:

source $HOME/.cargo/env
# Verify Installation
# Check if Rust and Cargo are installed correctly:
rustc --version
cargo --version
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@


class DFStatistics(DFStatisticsCore):
def __init__(self, data_path):
def __init__(self, filename, data_root_dir="/tmp/nvflare/df_stats/data"):
super().__init__()
self.data_root_dir = "/tmp/nvflare/df_stats/data"
self.data_path = data_path
self.data_root_dir = data_root_dir
self.filename = filename
self.data: Optional[Dict[str, pd.DataFrame]] = None
self.data_features = [
"Age",
Expand Down Expand Up @@ -57,7 +57,7 @@ def load_data(self, fl_ctx: FLContext) -> Dict[str, pd.DataFrame]:
self.log_info(fl_ctx, f"load data for client {client_name}")
try:
skip_rows = self.skip_rows[client_name]
data_path = f"{self.data_root_dir}/{fl_ctx.get_identity_name()}/{self.data_path}"
data_path = f"{self.data_root_dir}/{fl_ctx.get_identity_name()}/{self.filename}"
# example of load data from CSV
df: pd.DataFrame = pd.read_csv(
data_path, names=self.data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@

def define_parser():
parser = argparse.ArgumentParser()
parser.add_argument("-n", "--n_clients", type=int, default=3)
parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/dataset/output")
parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/stats.json")
parser.add_argument("-n", "--n_clients", type=int, default=2)
parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/df_stats/data")
parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/adults_stats.json")
parser.add_argument("-j", "--job_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/stats_df")
parser.add_argument("-w", "--work_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/stats_df/work_dir")
parser.add_argument("-co", "--export_config", action="store_true", help="config only mode, export config")
Expand All @@ -45,12 +45,11 @@ def main():
"mean": {},
"sum": {},
"stddev": {},
"histogram": {"*": {"bins": 20}},
"Age": {"bins": 20, "range": [0, 10]},
"percentile": {"*": [25, 50, 75], "Age": [50, 95]},
"histogram": {"*": {"bins": 20}, "Age": {"bins": 20, "range": [0, 100]}},
"quantile": {"*": [0.1, 0.5, 0.9], "Age": [0.1, 0.5, 0.9]},
}
# define local stats generator
df_stats_generator = DFStatistics(data_root_dir=data_root_dir)
df_stats_generator = DFStatistics(filename="data.csv", data_root_dir=data_root_dir)

job = StatsJob(
job_name="stats_df",
Expand All @@ -63,6 +62,7 @@ def main():
job.setup_clients(sites)

if export_config:
print("Exporting job config...", job_dir)
job.export_job(job_dir)
else:
job.simulator_run(work_dir)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"range": [0,120]
}
},
"percentile": {
"quantile": {
"*": [25, 50, 75]
}
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ numpy
pandas
matplotlib
jupyterlab
tdigest

1 change: 0 additions & 1 deletion examples/advanced/streaming/src/simple_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@


class SimpleController(Controller):

def control_flow(self, abort_signal: Signal, fl_ctx: FLContext):
logger.info(f"Entering control loop of {self.__class__.__name__}")
engine = fl_ctx.get_engine()
Expand Down
Loading

0 comments on commit 7f62dfd

Please sign in to comment.