Merge branch 'main' into dashboard_prov

yanchengnv · Feb 25, 2025 · 7f62dfd · 7f62dfd
2 parents e384587 + db75f43
commit 7f62dfd
Show file tree

Hide file tree

Showing 140 changed files with 3,094 additions and 2,976 deletions.
diff --git a/3rdParty/tdigest.LICENSE.txt → 3rdParty/fastdigest.LICENSE.txt b/3rdParty/tdigest.LICENSE.txt → 3rdParty/fastdigest.LICENSE.txt
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -44,9 +44,6 @@ To collaborate efficiently, please read through this section and follow them.
 * [Building documentation](#building-the-documentation)
 * [Signing your work](#signing-your-work)
 
-> Note: 
-  > some package dependencies requires python<version>-dev in local development such as 
-  > python3.12-dev. 
 
 #### Checking the coding style
 We check code style using flake8 and isort.

diff --git a/examples/advanced/federated-statistics/README.md b/examples/advanced/federated-statistics/README.md
@@ -2,7 +2,7 @@
 
 ## Objective
 NVIDIA FLARE will provide built-in federated statistics operators (controllers and executors) that 
-can generate global statistics based on local client side statistics.
+can generate global statistics based on local client-side statistics.
 
 At each client site, we could have one or more datasets (such as "train" and "test" datasets); each dataset may have many 
 features. For each feature in the dataset, we will calculate the statistics and then combine them to produce 
@@ -19,14 +19,47 @@ The result should be visualized via the visualization utility in the notebook.
 
 ## Assumptions
 
-Assume that clients will provide the following: 
-   * user needs to provide target statistics such as count, histogram only
-   * user needs to provide the local statistics for the target statistics (by implementing the statistic_spec)
-   * user needs to provide the data sets and dataset features (feature name, data type)
-   * * Note: count is always required as we use count to enforce data privacy policy
-We only support **numerical features**, not categorical features. But user can return all types of features
+Assume that clients will provide the following:
+* Users need to provide target statistics such as count, histogram only
+* Users need to provide the local statistics for the target statistics (by implementing the statistics_spec)
+* Users need to provide the datasets and dataset features (feature name, data type)
+* Note: count is always required as we use count to enforce data privacy policy
+
+We only support **numerical features**, not categorical features. However, users can return all types of features;
 the non-numerical features will be removed.
 
+
+## Statistics
+
+  Federated statistics includes numerics statistics measures for 
+  * count
+  * mean 
+  * sum
+  * std_dev
+  * histogram 
+  * quantile
+
+  We did not include min, max value to avoid data privacy concern. 
+
+### Quantile
+
+Quantile statistics refers to statistical measures that divide a probability distribution or dataset into intervals with equal probabilities or proportions. Quantiles help summarize the distribution of data by providing key points that indicate how values are spread.
+
+#### Key Quantiles:
+1. Median (50th percentile): The middle value of a dataset, dividing it into two equal halves.
+2. Quartiles (25th, 50th, 75th percentiles): Divide the data into four equal parts:
+* Q1 (25th percentile): Lower quartile, below which 25% of the data falls.
+* Q2 (50th percentile): Median.
+* Q3 (75th percentile): Upper quartile, below which 75% of the data falls.
+3. Deciles (10th, 20th, ..., 90th percentiles): Divide the data into ten equal parts.
+4. Percentiles (1st, 2nd, ..., 99th): Divide the data into 100 equal parts.
+
+#### Usage of Quantiles:
+* Descriptive Statistics: Summarizes the spread of data.
+* Outlier Detection: Helps identify extreme values.
+* Machine Learning: Used in feature engineering, normalization, and decision tree algorithms.
+* Risk Analysis: Used in finance (e.g., Value at Risk, VaR).
+
 ## Examples
 
 We provide several examples to demonstrate how should the operators be used. 
@@ -57,20 +90,21 @@ The main steps are
 
 The detailed example instructions can be found [Data frame statistics](df_stats/README.md)
 
+
 ### COVID 19 Radiology Image Examples
 
-The second example provided is image histogram example. Different from **Tabular** data example, 
+The second example provided is an image histogram example. Unlike the **Tabular** data example:
 
-The image examples show the followings
+The image examples show the following:
 * The [image_statistics.py](image_stats/jobs/image_stats/app/custom/image_statistics.py) only needs
-to calculate the count and histogram target statistics, then user only needs to provide the calculation count, failure_count and histogram functions. There is no need to implement other metrics functions
- (sum, mean,std_dev etc.) ( get_failure_count by default return 0 )
-* For each site's dataset, there are several thousands of images, the local histogram is aggregate histogram of all the image histograms.  
-* The image files are large, we can't load everything in memory, then calculate the statistics. 
-We will need to iterate through files for each calculation. For single feature, such as example. This is ok. If there are multiple features,
-such as multiple channels, reload image to memory for each channel to do histogram calculation is really wasteful.
-* Unlike [Data frame statistics](df_stats/README.md), the histogram bin's global range is pre-defined by user [0, 256]
-where in [Data frame statistics](df_stats/README.md), besides "Age", all other features histogram global bin range
+to calculate the count and histogram target statistics. Users only need to provide the calculation count, failure_count and histogram functions. There is no need to implement other metrics functions
+(sum, mean, std_dev etc.) (get_failure_count by default returns 0)
+* For each site's dataset, there are several thousand images; the local histogram is an aggregate histogram of all the image histograms
+* The image files are large, so we can't load everything into memory and then calculate the statistics. 
+We will need to iterate through files for each calculation. For a single feature, this is acceptable. If there are multiple features,
+such as multiple channels, reloading images to memory for each channel to do histogram calculation is wasteful
+* Unlike [Data frame statistics](df_stats/README.md), the histogram bin's global range is pre-defined by users [0, 256],
+whereas in [Data frame statistics](df_stats/README.md), besides "Age", all other features' histogram global bin range
 is dynamically estimated based on local min/max values
 
 An example of image histogram (the underline image files have only 1 channel)
@@ -155,6 +189,7 @@ The main steps are
 * provide client side configuration to specify data input location
 * provide hierarchy specification file providing details about all the clients and their hierarchy.
 
+
 ## Privacy Policy and Privacy Filters
 
 NVFLARE provide data privacy protection through privacy filters [privacy-management](https://nvflare.readthedocs.io/en/main/user_guide/security/site_policy_management.html#privacy-management)
@@ -178,22 +213,21 @@ defined and job doesn't specify the privacy scope, the job deployment will fail,
 
 ### Privacy Policy Instrumentation 
 
-There are different ways to set privacy filter depending the use cases
+There are different ways to set privacy filters depending on the use cases:
 
 ####  Set Privacy Policy as researcher
 
 You can specify the "task_result_filters" in config_fed_client.json to specify
-the privacy control.  This is useful when you develop these filters
+the privacy control. This is useful when you develop these filters.
 
 #### Setup site privacy policy as org admin
 
-Once the company decides to instrument certain privacy policy independent of individual
-job, one can copy the local directory privacy.json content to clients' local privacy.json ( merge not overwrite).
-in this example, since we only has one app, we can simply copy the private.json from local directory to
+Once the company decides to implement certain privacy policies independent of individual
+jobs, one can copy the local directory privacy.json content to clients' local privacy.json (merge, not overwrite).
+In this example, since we only have one app, we can simply copy the privacy.json from the local directory to:
 
 * site-1/local/privacy.json
 * site-2/local/privacy.json
-
 We need to remove the same filters from the job definition in config_fed_client.json
 by simply set the "task_result_filters" to empty list to avoid **double filtering**
 ```
@@ -304,10 +338,7 @@ sequenceDiagram
 ```
 
 
-
-
 ## Summary
 
-We provided federated statistics operators that can easily aggregate and visualize the local statistics for 
-different data site and features. We hope this feature will make it easier to perform federated data analysis. 
-
+We provided federated statistics operators that can easily aggregate and visualize the local statistics for
+different data site and features. We hope this feature will make it easier to perform federated data analysis. 
diff --git a/examples/advanced/federated-statistics/df_stats/README.md b/examples/advanced/federated-statistics/df_stats/README.md
@@ -17,6 +17,52 @@ cd NVFlare/examples/advanced/federated-statistics/df_stats
 pip install -r requirements.txt
 ```
 
+
+## Install fastdigest
+
+If you intend to calculate quantiles, you need to install fastdigest. 
+
+```
+pip install fastdigest==0.4.0
+```
+
+on Ubuntu, you might get the following error:
+
+  Cargo, the Rust package manager, is not installed or is not on PATH.
+  This package requires Rust and Cargo to compile extensions. Install it through
+  the system's package manager or via https://rustup.rs/
+
+  Checking for Rust toolchain....
+
+This is because fastdigest (or its dependencies) requires Rust and Cargo to build. 
+
+You need to install Rust and Cargo on your Ubuntu system. Follow these steps:
+Install Rust and Cargo
+Run the following command to install Rust using rustup:
+
+```
+cd NVFlare/examples/advanced/federated-statistics/df_stats
+./install_cargo.sh
+```
+
+Then you can install fastdigest again
+```
+pip install fastdigest==0.4.0
+```
+
+### Quantile Calculation
+
+To calculate federated quantiles, we needed to select a package that satisfies the following constraints:
+
+* Works in distributed systems
+* Does not copy the original data (avoiding privacy leaks)
+* Avoids transmitting large amounts of data
+* Ideally, no system-level dependency 
+
+We chose the fastdigest python package, a rust-based package. tdigest only carries the cluster coordinates, initially each data point is in its own cluster. By default, we will compress with max_bin = sqrt(datasize) to compress the coordinates, so the data won't leak. You can always override max_bins if you prefer more or less compression.
+
+
+
 ## 1. Prepare data
 
 In this example, we are using UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult)
@@ -165,8 +211,12 @@ statistics computing, we will only need to provide the followings
           "stddev": {},
           "histogram": { "*": {"bins": 10 },
                          "Age": {"bins": 5, "range":[0,120]}
-                       }
+                       },
+          "quantile": {
+            "*": [25, 50, 75]
+          }
         },
+        
         "writer_id": "stats_writer"
       }
     }
@@ -195,7 +245,8 @@ in FLARE job store.
 
 ### 5.2 client side configuration
 
-First, we specify the built-in client side executor: `StatisticsExecutor`, which takes a local stats generator Id
+First, we specify the built-in client side executor: `StatisticsExecutor`, which takes a local stats generator ID
+
 
 ```
  "executor": {
@@ -248,7 +299,7 @@ In this example, task_result_filters is defined as task privacy filter : `Statis
 `StatisticsPrivacyFilter` is using three separate the `StatisticsPrivacyCleanser`, you can find more details in
 [local privacy policy](../local/privacy.json) and in later discussion on privacy.
 
-The privacy cleansers specify policy can be find in
+The privacy cleansers specify policies can be found in
 ```
   "components": [
     {
@@ -311,6 +362,8 @@ to calculate the local statistics, we will need to implements few methods
  
     def histogram(self, dataset_name: str, feature_name: str, num_of_bins: int, global_min_value: float, global_max_value: float) -> Histogram:
 
+    def quantiles(self, dataset_name: str, feature_name: str, percentiles: List) -> Dict:
+
 ```
 since some of features do not provide histogram bin range, we will need to calculate based on local min/max to estimate
 the global min/max, and then use the global bin/max as the range for all clients' histogram bin range.

diff --git a/examples/advanced/federated-statistics/df_stats/demo/visualization.ipynb b/examples/advanced/federated-statistics/df_stats/demo/visualization.ipynb
@@ -37,7 +37,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "c44a0217",
    "metadata": {
     "tags": []
@@ -81,7 +81,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
    "id": "93c62d5e",
    "metadata": {
     "tags": []
@@ -271,9 +271,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "nvflare_example",
+   "display_name": "nvflare-env",
    "language": "python",
-   "name": "nvflare_example"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -285,7 +285,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.2"
+   "version": "3.8.13"
   }
  },
  "nbformat": 4,

diff --git a/examples/advanced/federated-statistics/df_stats/install_cargo.sh b/examples/advanced/federated-statistics/df_stats/install_cargo.sh
@@ -0,0 +1,15 @@
+
+# fastdigest (or its dependencies) requires Rust and Cargo to build. 
+# You need to install Rust and Cargo on your Ubuntu system. Follow these steps:
+# Install Rust and Cargo
+# Run the following command to install Rust using rustup:
+
+
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+# Then restart your terminal or run:
+
+source $HOME/.cargo/env
+# Verify Installation
+# Check if Rust and Cargo are installed correctly:
+rustc --version
+cargo --version
diff --git a/examples/advanced/federated-statistics/df_stats/job_api/df_statistics.py b/examples/advanced/federated-statistics/df_stats/job_api/df_statistics.py
@@ -21,10 +21,10 @@
 
 
 class DFStatistics(DFStatisticsCore):
-    def __init__(self, data_path):
+    def __init__(self, filename, data_root_dir="/tmp/nvflare/df_stats/data"):
         super().__init__()
-        self.data_root_dir = "/tmp/nvflare/df_stats/data"
-        self.data_path = data_path
+        self.data_root_dir = data_root_dir
+        self.filename = filename
         self.data: Optional[Dict[str, pd.DataFrame]] = None
         self.data_features = [
             "Age",
@@ -57,7 +57,7 @@ def load_data(self, fl_ctx: FLContext) -> Dict[str, pd.DataFrame]:
         self.log_info(fl_ctx, f"load data for client {client_name}")
         try:
             skip_rows = self.skip_rows[client_name]
-            data_path = f"{self.data_root_dir}/{fl_ctx.get_identity_name()}/{self.data_path}"
+            data_path = f"{self.data_root_dir}/{fl_ctx.get_identity_name()}/{self.filename}"
             # example of load data from CSV
             df: pd.DataFrame = pd.read_csv(
                 data_path, names=self.data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?"

diff --git a/examples/advanced/federated-statistics/df_stats/job_api/df_stats_job.py b/examples/advanced/federated-statistics/df_stats/job_api/df_stats_job.py
@@ -20,9 +20,9 @@
 
 def define_parser():
     parser = argparse.ArgumentParser()
-    parser.add_argument("-n", "--n_clients", type=int, default=3)
-    parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/dataset/output")
-    parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/stats.json")
+    parser.add_argument("-n", "--n_clients", type=int, default=2)
+    parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/df_stats/data")
+    parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/adults_stats.json")
     parser.add_argument("-j", "--job_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/stats_df")
     parser.add_argument("-w", "--work_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/stats_df/work_dir")
     parser.add_argument("-co", "--export_config", action="store_true", help="config only mode, export config")
@@ -45,12 +45,11 @@ def main():
         "mean": {},
         "sum": {},
         "stddev": {},
-        "histogram": {"*": {"bins": 20}},
-        "Age": {"bins": 20, "range": [0, 10]},
-        "percentile": {"*": [25, 50, 75], "Age": [50, 95]},
+        "histogram": {"*": {"bins": 20}, "Age": {"bins": 20, "range": [0, 100]}},
+        "quantile": {"*": [0.1, 0.5, 0.9], "Age": [0.1, 0.5, 0.9]},
     }
     # define local stats generator
-    df_stats_generator = DFStatistics(data_root_dir=data_root_dir)
+    df_stats_generator = DFStatistics(filename="data.csv", data_root_dir=data_root_dir)
 
     job = StatsJob(
         job_name="stats_df",
@@ -63,6 +62,7 @@ def main():
     job.setup_clients(sites)
 
     if export_config:
+        print("Exporting job config...", job_dir)
         job.export_job(job_dir)
     else:
         job.simulator_run(work_dir)

diff --git a/...es/advanced/federated-statistics/df_stats/jobs/df_stats/app/config/config_fed_server.json b/...es/advanced/federated-statistics/df_stats/jobs/df_stats/app/config/config_fed_server.json
@@ -19,7 +19,7 @@
               "range": [0,120]
             }
           },
-          "percentile": {
+          "quantile": {
             "*": [25, 50, 75]
           }
         },

diff --git a/examples/advanced/federated-statistics/df_stats/requirements.txt b/examples/advanced/federated-statistics/df_stats/requirements.txt
@@ -2,4 +2,4 @@ numpy
 pandas
 matplotlib
 jupyterlab
-tdigest
+
diff --git a/examples/advanced/streaming/src/simple_controller.py b/examples/advanced/streaming/src/simple_controller.py
@@ -25,7 +25,6 @@
 
 
 class SimpleController(Controller):
-
     def control_flow(self, abort_signal: Signal, fl_ctx: FLContext):
         logger.info(f"Entering control loop of {self.__class__.__name__}")
         engine = fl_ctx.get_engine()
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,4 +2,4 @@ numpy @@
     pandas
     matplotlib
     jupyterlab
-    tdigest