diff --git a/docs/example_applications_algorithms.rst b/docs/example_applications_algorithms.rst index cfe9f54203..cbc0ba4bb2 100644 --- a/docs/example_applications_algorithms.rst +++ b/docs/example_applications_algorithms.rst @@ -23,58 +23,66 @@ NVIDIA FLARE has several tutorials and examples to help you get started with fed The following tutorials and quickstart guides walk you through some of these examples: - 1. **Hello World Examples** which can be run from the `hello_world notebook `_. + 1. **Hello World** introduction to NVFlare. - 1.1. Workflows + 1.1. Deep Learning to Federated Learning + * `Deep Learning to Federated Learning (GitHub) `_ - Example for converting Deep Learning (DL) to Federated Learning (FL). + + 1.2. Step-by-Step Examples + * `Step-by-Step Examples (GitHub) `__ - Step-by-step examples for running a federated learning project with NVFlare. + + 2. **Hello World Examples** which can be run from the `hello_world notebook `_. + + 2.1. Workflows * :ref:`Hello Scatter and Gather ` - Example using the Scatter And Gather (SAG) workflow with a Numpy trainer * :ref:`Hello Cross-Site Validation ` - Example using the Cross Site Model Eval workflow with a Numpy trainer * `Hello Cyclic Weight Transfer (GitHub) `_ - Example using the CyclicController workflow to implement `Cyclic Weight Transfer `_ with TensorFlow as the deep learning training framework - 1.2. Deep Learning + 2.2. Deep Learning * :ref:`Hello PyTorch ` - Example image classifier using FedAvg and PyTorch as the deep learning training framework * :ref:`Hello TensorFlow ` - Example image classifier using FedAvg and TensorFlow as the deep learning training frameworks - 2. **Tutorial notebooks** + 3. **Tutorial notebooks** * `Intro to the FL Simulator `_ - Shows how to use the :ref:`fl_simulator` to run a local simulation of an NVFLARE deployment to test and debug an application without provisioning a real FL project. * `Hello FLARE API `_ - Goes through the different commands of the :ref:`flare_api` to show the syntax and usage of each. * `NVFLARE in POC Mode `_ - Shows how to use :ref:`POC mode ` to test the features of a full FLARE deployment on a single machine. - 3. **FL algorithms** + 4. **FL algorithms** * `Federated Learning with CIFAR-10 (GitHub) `_ - Includes examples of using FedAvg, FedProx, FedOpt, SCAFFOLD, homomorphic encryption, and streaming of TensorBoard metrics to the server during training * :ref:`Federated XGBoost ` - Includes examples of histogram-based and tree-based algorithms. Tree-based algorithms also includes bagging and cyclic approaches - 4. **Traditional ML examples** + 5. **Traditional ML examples** * `Federated Linear Model with Scikit-learn (GitHub) `_ - For an example of using NVIDIA FLARE with `scikit-learn `_, a widely used open-source machine learning library that supports supervised and unsupervised learning. * `Federated K-Means Clustering with Scikit-learn (GitHub) `_ - NVIDIA FLARE with `scikit-learn `_ and k-Means. * `Federated SVM with Scikit-learn (GitHub) `_ - NVIDIA FLARE with `scikit-learn `_ and `SVM `_. * `Federated Learning for Random Forest based on XGBoost (GitHub) `_ - Example of using NVIDIA FLARE with `scikit-learn `_ and `Random Forest `_. - 5. **Medical Image Analysis** + 6. **Medical Image Analysis** * `MONAI Integration (GitHub) `_ - For an example of using NVIDIA FLARE to train a 3D medical image analysis model using federated averaging (FedAvg) and MONAI Bundle `MONAI `_ * `Federated Learning with Differential Privacy for BraTS18 segmentation (GitHub) `_ - Illustrates the use of differential privacy for training brain tumor segmentation models using federated learning * `Federated Learning for Prostate Segmentation from Multi-source Data (GitHub) `_ - Example of training a multi-institutional prostate segmentation model using `FedAvg `_, `FedProx `_, and `Ditto `_ - 6. **Federated Statistics** + 7. **Federated Statistics** * :ref:`Federated Statistic Overview ` - Discuss the overall federated statistics features * `Federated Statistics for medical imaging (Github) `_ - Example of gathering local image histogram to compute the global dataset histograms. * `Federated Statistics for tabular data with DataFrame (Github) `_ - Example of gathering local statistics summary from Pandas DataFrame to compute the global dataset statistics. * `Federated Statistics with Monai Statistics integration for Spleen CT Image (Github) `_ - Example demonstrated Monai statistics integration and few other features in federated statistics - 7. **Federated Site Policies** + 8. **Federated Site Policies** * `Federated Policies (Github) `_ - Discuss the federated site policies for authorization, resource and data privacy management - 8. **Experiment tracking** + 9. **Experiment tracking** * :ref:`FL Experiment Tracking with TensorBoard Streaming ` - Example building on Hello PyTorch with TensorBoard streaming from clients to server * :ref:`FL Experiment Tracking with MLflow ` - Example integrating Hello PyTorch with MLflow with streaming from clients to server - 9. **NLP** + 10. **NLP** * `NLP-NER (Github) `_ - Illustrates both `BERT `_ and `GPT-2 `_ models from `Hugging Face `_ (`BERT-base-uncased `_, `GPT-2 `_) on a Named Entity Recognition (NER) task using the `NCBI disease dataset `_. diff --git a/docs/examples/hello_world_examples.rst b/docs/examples/hello_world_examples.rst index 5cbf5f22c5..88b3d87fae 100644 --- a/docs/examples/hello_world_examples.rst +++ b/docs/examples/hello_world_examples.rst @@ -5,8 +5,11 @@ Hello World examples can be run from the `hello_world notebook + Step-by-Step Examples (GitHub) hello_scatter_and_gather hello_cross_val Hello Cyclic Weight Transfer (GitHub) hello_pt hello_tf2 + Hello Client Controlled Workflow (GitHub) diff --git a/docs/faq.rst b/docs/faq.rst index 300861ca1a..bba01fa0bd 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -181,22 +181,22 @@ Operational #. What is the difference between the Admin client and the FL client? - The :ref:`Admin client ` is used to control the state of the server's controller workflow and only interacts with the + The :ref:`FLARE Console ` is used to control the state of the server's controller workflow and only interacts with the server. FL clients poll the server and perform tasks based on the state of the server. The Admin client does not interact directly with FL client. #. Where does the Admin client run? - The :ref:`Admin client ` runs as a standalone process, typically on a researcher's workstation or laptop. + The :ref:`FLARE Console ` runs as a standalone process, typically on a researcher's workstation or laptop. #. What can you do with the Admin client? - The :ref:`Admin client ` is used to orchestrate the FL study, including starting and stopping server + The :ref:`FLARE Console ` is used to orchestrate the FL study, including starting and stopping server and clients, deploying applications, and managing FL experiments. #. How can I get the global model at the end of training? What can I do to resolve keys not matching with the model defined? - You can use the download_job command with the :ref:`Admin client ` to get the job result into the admin + You can use the download_job command with the :ref:`FLARE Console ` to get the job result into the admin transfer folder. The model is saved in a dict depending on the persistor you used, so you might need to access it with ``model.load_state_dict(torch.load(path_to_model)["model"])`` if you used PTFileModelPersistor because PTModelPersistenceFormatManager saves the model under the key "model". diff --git a/docs/programming_guide.rst b/docs/programming_guide.rst index 9ec5169a67..ecd029649b 100644 --- a/docs/programming_guide.rst +++ b/docs/programming_guide.rst @@ -41,7 +41,6 @@ Please refer to :ref:`application` for more details. programming_guide/data_exchange_object programming_guide/fl_context programming_guide/fl_component - programming_guide/serialization programming_guide/filters programming_guide/event_system programming_guide/component_configuration diff --git a/docs/programming_guide/serialization.rst b/docs/programming_guide/serialization.rst index 436491fb9d..b951886789 100644 --- a/docs/programming_guide/serialization.rst +++ b/docs/programming_guide/serialization.rst @@ -1,7 +1 @@ -.. _serialization: - -Serialization -============= - -Due to security concerns, `pickle ` has been replaced with FOBS (Flare object serialization) in NVFlare to exchange data between the server and clients. -See ``_ for usage guidelines. +See :ref:`serialization`. \ No newline at end of file diff --git a/docs/programming_guide/system_architecture.rst b/docs/programming_guide/system_architecture.rst index 55390b8506..b4c6ae9084 100644 --- a/docs/programming_guide/system_architecture.rst +++ b/docs/programming_guide/system_architecture.rst @@ -15,7 +15,7 @@ Concepts and System Components Spec-based Programming for System Service Objects ================================================= -NVIDIA FLARE 2.1.0 needs additional services to implement the HA feature: +NVIDIA FLARE needs additional services to implement the HA feature: storage, overseer, job definition management, etc. There are many ways to implement such services. For example, storage could be implemented with a file system, AWS S3, or some database technologies. Similarly, job definition management could be done with simple file reading or a sophisticated solution with a database or search engine. @@ -34,13 +34,13 @@ See the example :ref:`project_yml` for how these components are configured in St Overseer -------- -The Overseer is a system component newly introduced in 2.1.0 that determines the hot FL server at any time for high availability. +The Overseer is a system component that determines the hot FL server at any time for high availability. The name of the Overseer must be unique and in the format of fully qualified domain names. During provisioning time, if the name is specified incorrectly, either being duplicate or containing incompatible characters, the provision command will fail with an error message. It is possible to use a unique hostname rather than FQDN, with the IP mapped to the hostname by having it added to ``/etc/hosts``. -NVIDIA FLARE 2.1.0 comes with HTTPS-based overseer. Users are welcome to change the name and port arguments of the overseer +NVIDIA FLARE comes with an HTTPS-based overseer. Users are welcome to change the name and port arguments of the overseer in project.yml to fit their deployment environment. The Overseer will receive a Startup kit, which includes the start.sh shell script, its certificate and private key, @@ -66,7 +66,7 @@ their own Overseer Agent. NVIDIA FLARE provides two implementations: - :class:`HttpOverseerAgent` to work with the Overseer server. For NVIDIA - FLARE 2.1.0, the provisioning tool will automatically map parameters specified in Overseer into the arguments for + FLARE, the provisioning tool will automatically map parameters specified in Overseer into the arguments for the HttpOverseerAgent. - :class:`DummyOverseerAgent` is a dummy agent that simply returns the configured endpoint as the hot FL server. The dummy agent is used when a single FL server is configured diff --git a/docs/real_world_fl.rst b/docs/real_world_fl.rst index ed8190cdf1..c2e8cebd8c 100644 --- a/docs/real_world_fl.rst +++ b/docs/real_world_fl.rst @@ -14,6 +14,8 @@ help gather information to provision a project and distribute startup kits, see For more details on what you can do with apps with custom components and the flexibility that the Controller and Worker APIs bring, see the :ref:`programming_guide`. +For setting up authorization policies, see :ref:`federated authorization `. + You can also see some `example applications `_ integrating with `Clara Train `_ and `MONAI `_ diff --git a/docs/resources/hub_site.png b/docs/resources/hub_site.png new file mode 100644 index 0000000000..f025fd6bb4 Binary files /dev/null and b/docs/resources/hub_site.png differ diff --git a/docs/resources/systems_multiple_hierarchies.png b/docs/resources/systems_multiple_hierarchies.png new file mode 100644 index 0000000000..a72f8760c0 Binary files /dev/null and b/docs/resources/systems_multiple_hierarchies.png differ diff --git a/docs/resources/t2_job_creation.png b/docs/resources/t2_job_creation.png new file mode 100644 index 0000000000..607d952558 Binary files /dev/null and b/docs/resources/t2_job_creation.png differ diff --git a/docs/user_guide.rst b/docs/user_guide.rst index ea1391040d..8e45de9fe8 100644 --- a/docs/user_guide.rst +++ b/docs/user_guide.rst @@ -22,10 +22,9 @@ which are explained in more detail in their own sections linked below. user_guide/dashboard_api user_guide/dashboard_ui user_guide/nvflare_security - user_guide/federated_authorization user_guide/site_policy_management - user_guide/authorization_policy_previewer user_guide/docker_compose user_guide/helm_chart user_guide/logging_configuration user_guide/confidential_computing + user_guide/hierarchy_unification_bridge diff --git a/docs/user_guide/authorization_policy_previewer.rst b/docs/user_guide/authorization_policy_previewer.rst index 0cfd067c0f..886569d954 100644 --- a/docs/user_guide/authorization_policy_previewer.rst +++ b/docs/user_guide/authorization_policy_previewer.rst @@ -1,63 +1 @@ -.. _authorization_policy_previewer: - -****************************** -Authorization Policy Previewer -****************************** - -Authorization is an important security feature of NVFLARE. In NVFLARE 2.2, each site defines its own authorization policy. Since authorization policy is vital for system security, and many people can now define policies, it's important to be able to validate the policies before deploying them to production. - -The Authorization Policy Previewer is a tool for validating authorization policy definitions. The tool provides an interactive user interface and commands for the user to validate different aspects of policy definitions: - - - Show defined roles and rights - - Show the content of the policy definition - - Show the permission matrix (role/right/conditions) - - Evaluate a right against a specified user - -Start Authorization Policy Previewer -====================================== -To start the Authorization Policy Previewer, enter this command on a terminal: - -.. code-block:: shell - - nvflare authz_preview -p - -The authorization_policy_file must be a JSON file that follows authorization file format. - -If the file is not a valid JSON file or does not follow authorization file format, this command will exit with exception. - -Execute Authorization Policy Previewer Commands -================================================ -If the Authorization Policy Previewer is successfully started, the prompt ">" will be displayed and for command input. - -To get the complete list of commands, enter "?" on the prompt. - -Most commands are self-explanatory, except for the "eval_right". With this command, you can evaluate a specified right against a specified user (name:org:role) to make sure the result is correct. - -Role Rights -=========== -Most permissions in the policy file may be defined with Command Categories. However, once the policy file is loaded, categories are already resolved to individual commands, following the fallback mechanism. - -Use the ``show_role_rights command`` to verify that all commands have the right permissions for all roles. - -Evaluate a Right -================ -The syntax of the ``eval_right`` command is: - -.. code-block:: shell - - eval_right site_org right_name user_name:org:role [submitter_name:org:role] - -where: - -.. code-block:: - - site_org - the organization of the site - right_name - the right to be evaluated. You can use the "show_rights" command to list all available commands. - User specification - a user spec has three pieces of information separated by colons. Name is the name of the user; org is the organization that the user belongs to; and role is the user's role. You can use the "show_roles" command to list all available roles. - Submitter specification - some job related commands can evaluate the relation between the user and the submitter of a job. Submitter spec has the same format as user spec. - -Please refer to :ref:`Federated Authorization ` for details on the right definition and evaluation. - -Stop Authorization Policy Previewer -====================================== -To exit from the Authorization Policy Previewer, enter the "bye" command at the prompt. +See :ref:`authorization_policy_previewer`. diff --git a/docs/user_guide/federated_authorization.rst b/docs/user_guide/federated_authorization.rst index afb8068c05..a3eae41b7a 100644 --- a/docs/user_guide/federated_authorization.rst +++ b/docs/user_guide/federated_authorization.rst @@ -1,220 +1 @@ -.. _federated_authorization: - -######################### -Federated Authorization -######################### - -Federated learning is conducted over computing resources owned by different organizations. Naturally these organizations have concerns about their computing resources being misused or abused. Even if an NVFLARE docker is trusted by participating orgs, researchers can still bring their own custom code to be part of a study (BYOC), which could be a big concern to many organizations. In addition, organizations may also have IP (intellectual property) requirements on the studies performed by their own researchers. - -NVFLARE comes with an authorization system that can help address these security concerns and IP requirements. With this system, an organization can define strict policy to control access to their computing resources and/or FL jobs. - -Here are some examples that an org can do: - - - Restrict BYOC to only the org's own researchers; - - Allow jobs only from its own researchers, or from specified other orgs, or even from specified trusted other researchers; - - Totally disable remote shell commands on its sites - - Allow the "ls" shell command but disable all other remote shell commands - -Centralized vs. Federated Authorization -======================================== -In NVFLARE before version 2.2.1, the authorization policy was centrally enforced by the FL Server. In a true federated environment, each organization should be able to define and enforce their own authorization policy instead of relying others (such as FL Server that is owned by a separate org) to do so. - -NVFLARE 2.2.1 changes the way authorization is implemented to federated authorization where each organization defines and enforces its own authorization policy: - - - Each organization defines its policy in its own authorization.json (in the local folder of the workspace) - - This locally defined policy is loaded by FL Clients owned by the organization - - The policy is also enforced by these FL Clients - -This decentralized authorization has an added benefit: since each organization takes care of its own authorization, there will be no need to update the policy of any other participants (FL Server or Clients) when a new orgs or clients are added. - -See `Federated Policies (Github) `_ for a working example with federated site policies for authorization. - -Simplified Authorization Policy Configuration -============================================== -Since each organization defines its own policy, there will be no need to centrally define all orgs and users. The policy configuration for an org is simply a matrix of role/right permissions. Each role/right combination in the permission matrix answers this question: what kind of users of this role can have this right? - -To answer this question, the role/right combination defines one or more conditions, and the user must meet one of these conditions to have the right. The set of conditions is called a control. - -Roles ------ -Users are classified into roles. NVFLARE defines four roles starting in 2.2.1: - - - Project Admin - this role is responsible for the whole FL project; - - Org Admin - this role is responsible for the administration of all sites in its org. Each org must have one Org Admin; - - Lead (researcher) - this role conducts FL studies - - Member (researcher) - this role observes the FL study but cannot submit jobs - -Rights ------- -NVFLARE 2.2.1 supports more accurate right definitions to be more flexible: - - - Each server-side admin command is a right! This makes it possible for an org to control each command explicitly; - - Admin commands are grouped into categories. For example, commands like abort_job, delete_job, start_app are in manage_job category; all shell commands are put into the shell_commands category. Each category is also a right. - - BYOC is now defined as a right so that some users are allowed to submit jobs with BYOC whereas some are not. - -This right system makes it easy to write simple policies that only use command categories. It also makes it possible to write policies to control individual commands. When both categories and commands are used, command-based control takes precedence over category-based control. - -See :ref:`command_categories` for command categories. - -Controls and Conditions ------------------------ -A *control* is a set of one or more conditions that is specified in the permission matrix. Conditions specify relationships among the subject user, the site, and the job submitter. The following are supported relationships: - - - The user belongs to the site's organization (user org = site org) - - The user is the job submitter (user name = submitter name) - - The user and the job submitter are in the same org (user org = submitter org) - - The user is a specified person (user name = specified name) - - The user is in a specified org (user org = specified org) - -Keep in mind that the relationship is always relative to the subject user - we check to see whether the user's name or org has the right relationship with the site or job submitter. - -Since conditions need to be expressed in the policy definition file (authorization.json), some concise and consistent notations are needed. The following are the notations for these conditions: - -.. csv-table:: - :header: Notation,Condition,Examples - :widths: 15, 20, 15 - - o:site,The user belongs to the site's organization - n:submitter,The user is the job submitter - o:submitter,The user and the job submitter belong to the same org - n:,The user is a specified person,n:john@nvidia.com - o:,The user is in a specified org,o:nvidia - -The words "site" and "submitter" are reserved. - -In addition, two words are used for extreme conditions: - - - Any user is allowed: any - - No user is allowed: none - -See :ref:`sample_auth_policy` for an example policy. - -Policy Evaluation ------------------ -Policy evaluation is to answer the question: is the user allowed to do this command? - -The following is the evaluation algorithm: - - - If a control is defined for this command and user role, then this control will be evaluated; - - Otherwise, if the command belongs to a category and a control is defined for the category and user role, then this control will be evaluated; - - Otherwise, return False - -As a shorthand, if the control is the same for all rights for a role, you can specify a control for a role without explicitly specifying rights one by one. For example, this is used for the "project_admin" role since this role can do everything. - -Command Authorization Process ------------------------------ -We know that users operate NVFLARE systems with admin commands via the FLARE Console. But when a user issues a command, how does authorization happen throughout the system? In NVFLARE 2.1 and before, the authorization policy is evaluated and enforced by the FL Server that processes the command. But in NVFLARE 2.2, this is totally changed. - -The command is still received by the FL Server. If the command only involves the Server, then the server's authorization policy is evaluated and enforced. If the command involves FL clients, then the command will be sent to those clients without any authorization evaluation on the server. When a client receives the command, it will evaluate its own authorization policy. The client will execute the command only if it passes authorization. It is therefore possible that some clients accept the command whereas some other clients do not. - -If a client rejects the command, it will return "authorization denied" error back to the server. - -Job Submission -^^^^^^^^^^^^^^ -Job submission is a special and important function in NVFLARE. The researcher uses the "submit_job" command to submit a job. But the job is not executed until it is scheduled and deployed later. Note that when the job is scheduled, the user may or may not be even online. - -Job authorization will be done in two places. When the job is submitted, only the Server will evaluate the "submit_job" right. If allowed, the job will be accepted into the Job Store. When the job is later scheduled for execution, all sites (FL Server and Clients) involved in the job will evaluate "submit_job" again based on its own authorization policy. If the job comes with custom code, the "byoc" right will also be evaluated. The job will be rejected if either right fails. - -Hence it is quite possible that the job is accepted at submission time, but cannot run due to authorization errors from FL clients. - -You may ask why we don't check authorization with each involved FL client at the time of job submission. There are three considerations: - -1) This will make the system more complicated since the server would need to interact with the clients -2) At the time of submission, some or all of the FL clients may not even be online -3) A job's clients could be open-ended in that it will be deployed to all available clients. The list of available clients could be different by the time the job is scheduled for execution. - -Job Management Commands -^^^^^^^^^^^^^^^^^^^^^^^ -There are multiple commands (clone_job, delete_job, download_job, etc.) in the "manage_jobs" category. Such commands are executed on the Server only and do not involve any FL clients. Hence even if an organization defines controls for these commands, these controls will have no effect. - -Job management command authorization often evaluates the relationship between the subject user and the job submitter, as shown in the examples. - -.. _command_categories: - -Appendix One - Command Categories -================================= - -.. code-block:: python - - class CommandCategory(object): - - MANAGE_JOB = "manage_job" - OPERATE = "operate" - VIEW = "view" - SHELL_COMMANDS = "shell_commands" - - - COMMAND_CATEGORIES = { - AC.ABORT: CommandCategory.MANAGE_JOB, - AC.ABORT_JOB: CommandCategory.MANAGE_JOB, - AC.START_APP: CommandCategory.MANAGE_JOB, - AC.DELETE_JOB: CommandCategory.MANAGE_JOB, - AC.DELETE_WORKSPACE: CommandCategory.MANAGE_JOB, - - AC.CHECK_STATUS: CommandCategory.VIEW, - AC.SHOW_STATS: CommandCategory.VIEW, - AC.RESET_ERRORS: CommandCategory.VIEW, - AC.SHOW_ERRORS: CommandCategory.VIEW, - AC.LIST_JOBS: CommandCategory.VIEW, - - AC.SYS_INFO: CommandCategory.OPERATE, - AC.RESTART: CommandCategory.OPERATE, - AC.SHUTDOWN: CommandCategory.OPERATE, - AC.REMOVE_CLIENT: CommandCategory.OPERATE, - AC.SET_TIMEOUT: CommandCategory.OPERATE, - AC.CALL: CommandCategory.OPERATE, - - AC.SHELL_CAT: CommandCategory.SHELL_COMMANDS, - AC.SHELL_GREP: CommandCategory.SHELL_COMMANDS, - AC.SHELL_HEAD: CommandCategory.SHELL_COMMANDS, - AC.SHELL_LS: CommandCategory.SHELL_COMMANDS, - AC.SHELL_PWD: CommandCategory.SHELL_COMMANDS, - AC.SHELL_TAIL: CommandCategory.SHELL_COMMANDS, - } - - -.. _sample_auth_policy: - -Appendix Two - Sample Policy with Explanations -============================================== - -This is an example authorization.json (in the local folder of the workspace for a site). - -.. code-block:: shell - - { - "format_version": "1.0", - "permissions": { - "project_admin": "any", # can do everything on my site - "org_admin": { - "submit_job": "none", # cannot submit jobs to my site - "manage_job": "o:submitter", # can only manage jobs submitted by people in the user's own org - "download_job": "o:submitter", # can only download jobs submitted by people in the user's own org - "view": "any", # can do commands in the "view" category - "operate": "o:site", # can do commands in the "operate" category only if the user is in my org - "shell_commands": "o:site" # can do shell commands only if the user is in my org - }, - "lead": { - "submit_job": "any", # can submit jobs to my sites - "byoc": "o:site", # can submit jobs with BYOC to my sites only if the user is in my org - "manage_job": "n:submitter", # can only manage the user's own jobs - "view": "any", # can do commands in "view" category - "operate": "o:site", # can do commands in "operate" category only if the user is in my org - "shell_commands": "none", # cannot do shell commands on my site - "ls": "o:site", # can do the "ls" shell command if the user is in my org - "grep": "o:site" # can do the "grep" shell command if the user is in my org - }, - "member": { - "submit_job": [ - "o:site", # can submit jobs to my site if the user is in my org - "O:orgA", # can submit jobs to my site if the user is in org "orgA" - "N:john" # can submit jobs to my site if the user is "john" - ], - "byoc": "none", # cannot submit BYOC jobs to my site - "manage_job": "none", # cannot manage jobs - "download_job": "n:submitter", # can download user's own jobs - "view": "any", # can do commands in the "view" category - "operate": "none" # cannot do commands in "operate" category - } - } - } +See :ref:`federated_authorization`. \ No newline at end of file diff --git a/docs/user_guide/hierarchy_unification_bridge.rst b/docs/user_guide/hierarchy_unification_bridge.rst new file mode 100644 index 0000000000..1876ed1e14 --- /dev/null +++ b/docs/user_guide/hierarchy_unification_bridge.rst @@ -0,0 +1,546 @@ +.. _hierarchy_unification_bridge: + +############################ +Hierarchy Unification Bridge +############################ + +************************** +Background and Motivations +************************** +Users have been working on the idea of making multiple FL systems work together to train a common model. Each FL system has its own server(s) and clients, +implemented with the same or different FL frameworks. All these FL systems are managed by a central server, called FL Hub, which is responsible for +coordinating the FL systems to work together in an orderly fashion. + +This proposal requires all FL frameworks to follow a common interaction protocol, which is not yet defined. Hence as a first step, the scope is +reduced to make all FLARE-based FL systems work together. + +FLARE is designed to support institution-based collaboration. This means that the number of clients per system is limited (< 100). Hierarchy Unification Bridge (HUB) is a +solution that can support systems that exceed this limit by making multiple FLARE systems work together in a hierarchical manner. At the top of the hierarchy is the Root System, +which is just a regular FLARE system that has an FL Server and multiple FL Clients. Each client site can be a simple site that runs regular training, or it can be an independent +FLARE system that has its own server and clients. This scheme can repeat many times to form a hierarchy as deep as needed. + +****** +Design +****** +The key to implementing this logical hierarchy is making the lower tier system (Tier 2 or T2) a client of the upper tier system (Tier 1 or T1). + +The following diagram shows how this is done: + +.. image:: ../resources/hub_site.png + +In this diagram, green blocks represent components of the T1 system, and the blue blocks represent components of the T2 system. Though T1 and T2 +systems are independent of each other, they belong to the same organization. Though they do not have to be on the same VM, they must be able to access + shared file systems. + +Here is the general process flow: + + - T1 Server tries to schedule and deploy a job to T1 Client, as usual + - T1 Client Root receives the job and tries to deploy it + - The Deployer creates a job for T2 system based on T1 job and the preconfigured information + - The Deployer writes the created job to T2's job store + - T2 Server Root schedules and deploys the T2 job as usual + - T1 Server starts the T1 job, which causes the T1 job to be started on T1 Client Job cell + - Similarly, T2 Server starts the T2 job and creates T2 Server Job cell + - Now the job is running + - T1 Client Job cell and T2 Server Job cell communicate with each other via a File Pipe to exchange task data and task result + - T1 Client Job cell and T1 Server Job cell exchange task data/result as usual + +********** +Challenges +********** + +There are two main challenges to this design: + + - How to turn a job from the T1 system into a job in the T2 system? + - What is the workflow controller of the T2 system to ensure the semantics of the T1 system's control logic? + +These two questions are closely related. The job's control logic is ultimately determined by the workflow running in T1's server. +The control logic could be fairly complex. For example, the SAG controller determines the tasks to be performed for the clients, +the aggregator to be used, as well as the number of rounds to be executed. All systems must work together on a round-by-round basis, +meaning that all system clients must participate in the training for each round, and aggregation must happen at the T1 Server at the +end of each round. It is not that all systems perform their own SAG for the whole job and then aggregate their final results at the T1 server. + +As we know, a FL Client has no control logic - it merely executes tasks assigned by the server and submits task results. Since the +T2 system is like a client of the T1 system, its whole goal is to execute assigned tasks properly by its own clients. Now the question +is how does the T2 server know how to assign the task to its own clients? For example, the task assigned by the T1 server is simply "train", +how does the T2 Server know whether it should broadcast the "train" task to its own clients, or it should be done in a relay fashion? In case +of broadcast, what should be done to the submitted results from its clients? Should they be aggregated locally before sending back to the +T1 system; or should they be simply collected and then sent back to the T1 system? + +************************* +Operation Driven Workflow +************************* + +First some terminologies: + +FL Operation +============ +An FL Operation describes how an FL task is to be done. FLARE supports two types of operations: *broadcast* (bcast) and *relay*. + +The *broadcast* operation specifies all the attributes of the Controller's ``broadcast_and_wait`` method: min_targets, wait_time_after_min_received, +timeout, etc,. In addition, it also specifies how the aggregation is to be done (an aggregator component ID). + +Similarly, the *relay* operation specifies all the attributes of the Controller's ``relay_and_wait`` method. In addition, it could also specify the +shareable generator and persistor component ID. + +FL Operator +=========== +An Operator is just a Python class that implements an operation. For each supported operation, there is an Operator that implements its semantics, +implemented with Controller API. + +HUB Controller +-------------- +The HUB Controller runs in T2's Server Job cell to control the workflow. It is a general-purpose operation-based controller that has a simple control logic: + + - Receives task data from the T1 system (HubExecutor) + - Determines the operation to be performed based on task data headers and/or job config + - Finds the Operator for the requested operation + - Invokes the operator to execute the operation + - Send the result back to the requester + +HUB Executor +------------ +The HUB executor runs in T1's Client Job cell. It works with the HUB Controller to get the assigned task done and return the result back to the T1 server. + +HUB Executor/Controller Interaction +----------------------------------- +The HUB Executor and the HUB Controller use a file-based mechanism (called File Pipe) to interact with each other: + + - The Executor waits to receive a task from the T1 server. + - The Executor creates a file for the received Task Data, and waits for the Task Result file from the T2 system. + - The Controller reads the task data file, which contains a Shareable object. + - From the headers of the task data object and the preconfigured operation information,, the Controller determines the FL operation to perform and finds the Operator for it. + - The Controller invokes the Operator to get the task performed by its own clients. + - The Controller waits for the results from the Operator and creates the Task Result file. + - The Executor reads the Task Result and sends it back to the T1 server. + +Essentially, this Operation-based controller makes the T2 system an FL Operation Process Engine (FLOPE). It simply executes an operation requested by another system. +This allows the actual FL control logic to be run anywhere. For example, a researcher could run the training loop on her own machine, and only send training operations to the T2 system for execution. + + +Job Modifications +----------------- +For the HUB to work, the T1's client must be running the HUB Executor (instead of the regular client trainer), and the T2's server must be running the +HUB Controller (instead of the regular workflow as configured in the T1's server). This requires modification to the T1 Job for the T1 client, and creation of the T2 job for the T2 system: + + - T1's config_fed_client.json is replaced with the template that uses HUB Executor for all tasks (hub_client.json). This template also defines the File Pipe to be used for communication with the HUB Controller on T2. + - T2's config_fed_client.json is the same as the original T1's config_fed_client.json. + - T2's config_fed_server.json is based on the template that defines the HUB Controller (hub_server.json). This template also defines the File Pipe to be used for communication with the HUB Executor on T1. + - T1's config_fed_server.json may need to contain operation descriptions for all tasks. This information is added to T2's config_fed_server.json, and is used by the HUB Controller to determine and invoke operators. + +The following diagram shows how the T2 Job (in green color) is created based on the T1's original job (in blue color) and augmented with hub_server.json. + +.. image:: ../resources/t2_job_creation.png + +The following are the examples of these templates: + +hub_client.json +^^^^^^^^^^^^^^^ + +.. code-block:: json + + { + "format_version": 2, + "executors": [ + { + "tasks": [ + "*" + ], + "executor": { + "id": "Executor", + "path": "nvflare.app_common.hub.hub_executor.HubExecutor", + "args": { + "pipe_id": "pipe", + "task_wait_time": 600, + "result_poll_interval": 0.5 + } + } + } + ], + "components": [ + { + "id": "pipe", + "path": "nvflare.fuel.utils.pipe.file_pipe.FilePipe", + "args": { + "root_path": "/tmp/nvflare/hub/pipe/a" + } + } + ] + } + + +hub_server.json +^^^^^^^^^^^^^^^ + +.. code-block:: json + + { + "format_version": 2, + "workflows": [ + { + "id": "controller", + "path": "nvflare.app_common.hub.hub_controller.HubController", + "args": { + "pipe_id": "pipe", + "task_wait_time": 60, + "task_data_poll_interval": 0.5 + } + } + ], + "components": [ + { + "id": "pipe", + "path": "nvflare.fuel.utils.pipe.file_pipe.FilePipe", + "args": { + "root_path": "/tmp/nvflare/hub/pipe/a" + } + } + ] + } + +As shown in the templates, the File Pipe for both sides must be configured to use the same root path. + +T1 App Deployer and T2 Job Store +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +T1's app deployer must be replaced with the HubAppDeployer, which does the job modification and creation, as described in above. + +Once the App Deployer creates the T2 job, it must write the job into T2's job store. This requires the T1 client to have access to T2's job store. + +Both of these are achieved by modifications to T1's local resources: + +.. code-block:: json + + { + "format_version": 2, + "client": { + "retry_timeout": 30, + "compression": "Gzip" + }, + "components": [ + { + "id": "resource_manager", + "path": "nvflare.app_common.resource_managers.list_resource_manager.ListResourceManager", + "args": { + "resources": { + "gpu": [0, 1, 2, 3] + } + } + }, + { + "id": "resource_consumer", + "path": "nvflare.app_common.resource_consumers.gpu_resource_consumer.GPUResourceConsumer", + "args": {} + }, + { + "id": "job_manager", + "path": "nvflare.apis.impl.job_def_manager.SimpleJobDefManager", + "args": { + "uri_root": "/tmp/nvflare/hub/jobs/t2a", + "job_store_id": "job_store" + } + }, + { + "id": "job_store", + "path": "nvflare.app_common.storages.filesystem_storage.FilesystemStorage" + }, + { + "id": "app_deployer", + "path": "nvflare.app_common.hub.hub_app_deployer.HubAppDeployer" + } + ] + } + +In this example, the App Deployer configuration is at the bottom, and the job store access configuration consists of the two components above that. + +Job Submission +^^^^^^^^^^^^^^ +The user is just submitting a regular job to the T1 system and is not concerned about how the job is executed with multiple +systems. The T2 systems are just clients of the job. Since T2 systems use operation-based controllers, they need to be able to determine operations for +received tasks. This is where the user will need to provide additional information about what operation is to be used for each task. This is achieved by +defining operators in the config_fed_server.json of the job config: + +.. code-block:: json + + { + "format_version": 2, + "operators": { + "train": { + "method": "bcast", + "aggregator": "aggregator", + "timeout": 600, + "min_targets": 1 + }, + "submit_model": { + "method": "bcast", + "aggregator": "model_collector", + "timeout": 600, + "min_targets": 1 + }, + "validate": { + "method": "bcast", + "aggregator": "val_collector", + "timeout": 600, + "min_targets": 1 + } + }, + "components": [ + { + "id": "aggregator", + "path": "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator", + "args": { + "expected_data_kind": "WEIGHTS" + } + }, + { + "id": "model_collector", + "path": "nvflare.app_common.aggregators.dxo_collector.DXOCollector", + "args": {} + }, + { + "id": "val_collector", + "path": "nvflare.app_common.aggregators.dxo_collector.DXOCollector", + "args": {} + } + ] + } + +This example shows how to configure operators for the tasks of ``train``, ``submit_model``, and ``validate``. Note that they all use the ``bcast`` method, but use different aggregation techniques. + +.. note:: + + Jobs for all HUB systems use the same job ID created by the root system. This makes it easier to correlate the jobs across all systems. + +*********************** +How to Set Up HUB Sites +*********************** + +As shown in the above, a HUB site has two entities running: a FL Client for the T1 system and a FL Server for the T2 system. The two entities must be able to access a shared file system, though they don't have to be on the same VM. + +You don't need to do anything special to T2's FL Server - it's just a normal FLARE system. All the setup effort is on the T1's FL Client. + +Step 1: create a client for the T1 system +========================================= +This is the normal provision and setup process of the T1 system. Once completed, you should have the client configuration (workspace, startup kit, local folder, etc.) created. + +Step 2: Modify "/local/resources.json" +================================================ + +.. code-block:: json + + { + "format_version": 2, + "client": { + "retry_timeout": 30, + "compression": "Gzip", + "communication_timeout": 30 + }, + "components": [ + { + "id": "resource_manager", + "path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager", + "args": { + "num_of_gpus": 0, + "mem_per_gpu_in_GiB": 0 + } + }, + { + "id": "resource_consumer", + "path": "nvflare.app_common.resource_consumers.gpu_resource_consumer.GPUResourceConsumer", + "args": {} + }, + { + "id": "job_manager", + "path": "nvflare.apis.impl.job_def_manager.SimpleJobDefManager", + "args": { + "uri_root": "/tmp/nvflare/jobs-storage/a", + "job_store_id": "job_store" + } + }, + { + "id": "job_store", + "path": "nvflare.app_common.storages.filesystem_storage.FilesystemStorage" + }, + { + "id": "app_deployer", + "path": "nvflare.app_common.hub.hub_app_deployer.HubAppDeployer" + } + ] + } + +You need to add three components: + + - ``job_manager`` - make sure that its "uri_root" is set to the correct path that is used by the T2's Server Configuration. + - ``job_store`` - make sure it is configured exactly the same as in T2 system + - ``app_deployer`` - you don't need to change anything + + +Step 3: Create hub_client.json in the clients "/local" folder +======================================================================== + +.. code-block:: json + + { + "format_version": 2, + "executors": [ + { + "tasks": [ + "*" + ], + "executor": { + "id": "executor", + "path": "nvflare.app_common.hub.hub_executor.HubExecutor", + "args": { + "pipe_id": "pipe" + } + } + } + ], + "components": [ + { + "id": "pipe", + "path": "nvflare.fuel.utils.pipe.file_pipe.FilePipe", + "args": { + "root_path": "/tmp/nvflare/pipe/a" + } + } + ] + } + +You can and should adjust the ``root_path`` parameter in the component above: + + - ``root_path`` - this is the root path to be used for the T1 system to exchange data with the T2 system. Make sure that this path is accessible to both T1 and T2 systems, and it is set to the same value as in Step 4. + +Configuring HubExecutor +----------------------- +You can further configure the HubExecutor with the following arguments: + + - ``task_wait_time`` - if specified, how long (seconds) the HubExecutor will wait for a task result from the T2 system. Make sure you allow enough time for the T2 system to complete the task; otherwise T1 may abort the job prematurely. You don't have to specify a value. By default, the HubExecutor will keep waiting until either the result is received or the peer is disconnected. + - ``result_poll_interval`` - how often does the HubExecutor try to read task results from the pipe. It's default to 0.1 seconds. You shouldn't need to change this value. + - ``task_read_wait_time`` - after sending a task to the peer, how long to wait for the peer to read task data. If the task is not read by the peer before this time, the job will be aborted. This is usually because the T2 system is not running, or the job couldn't be scheduled or deployed. The default value of this arg is 10 seconds. If you want to change it, make sure that you give enough time for T2 to get the job scheduled and started. This is especially important if the T2 system itself is also multi-tier. + +Step 4: Create hub_server.json in the clients "/local" folder + +.. code-block:: json + + { + "format_version": 2, + "workflows": [ + { + "id": "controller", + "path": "nvflare.app_common.hub.hub_controller.HubController", + "args": { + "pipe_id": "pipe" + } + } + ], + "components": [ + { + "id": "pipe", + "path": "nvflare.fuel.utils.pipe.file_pipe.FilePipe", + "args": { + "root_path": "/tmp/nvflare/pipe/a" + } + } + ] + } + +You can and should adjust the ``root_path`` parameter in the component above: + + - root_path - this is the root path to be used for the T2 system to exchange data with the T1 system. Make sure that this path is accessible to both T1 and T2 systems, and it is set to the same value as in Step 3. + +Configuring HubController + +You can further configure the HubController with the following arguments: + + - ``task_wait_time`` - how long (seconds) the T2's HubController will wait for task assignment from the T1 system. If you want to specify this value, make sure you allow enough time for the T1 to get the task data; otherwise T2 may abort the job prematurely. You don't have to specify a value. By default, the HubController will keep waiting until either a task is received or the peer is disconnected. + - ``task_data_poll_interval`` - how often to try to read task data from the pipe. It's default to 0.1 seconds. You shouldn't need to change this value. + +******************** +Multiple Hierarchies +******************** +This design allows a FLARE system to be part of multiple hierarchies, as shown here: + +.. image:: ../resources/systems_multiple_hierarchies.png + +In this example, System A and C are in two hierarchies: R1 and R2. + +To implement this, the HUB site just needs to have one T1 configuration for each hierarchy. For instance, site A will have two T1 configurations: one for R1 and one for R2. +Both configurations must share the same setup for job_manager, job_store, and pipe path. + +Potentials +========== +The key to make all systems work together is the Operation-Driven workflow (the HubController). It essentially makes the FLARE system an operation executor. Currently, +operations can only be called by the HubExecutor through File Pipe, but it is easily doable to make it callable through messaging. For example, the FLARE API could be +enhanced to invoke operations, something like this: + +.. code-block:: python + + from nvflare.fuel.flare_api.flare_api import Session, new_secure_session + + sess = new_secure_session() + task_data = ... + for r in range(100): + result = sess.call_operation( + method="bcst", + task=task_data, + aggregator="InTimeWeightAggregator", + timeout=300, + min_clients=3 + ) + # process result... + task_data = result + +Limitations +=========== + +Deploy Map cannot be supported at lower levels +---------------------------------------------- +The job is submitted at the root system level. FL clients in lower level systems are unavailable to the researcher to configure the deploy map. As a result, lower level systems will deploy tasks to all of its clients. + +Operators can only be configured once unless prefixes are used +-------------------------------------------------------------- +You can configure different operators for different levels, provided that different levels are provisioned with different project names! + +To configure operators for a specific level, simply add its project name as a prefix to the task name in config_fed_server.json of the job: + +.. code-block:: json + + "operators": { + "train": { + "method": "bcast", + "aggregator": "aggregator", + "timeout": 60, + "min_targets": 1, + "wait_time_after_min_received": 30 + }, + "BC.train": { + "method": "relay" + } + } + +In this example, the project "BC" is configured to use the "relay" method for task "train", whereas all other levels (projects) use the default "bcast" method. + +Job Signature cannot be validated at lower level systems +-------------------------------------------------------- +This is because the job submitted to the lower level system is modified from the original job. Hence the job signatures (which are based on the original job definition) can no longer be validated against the modified job definitions. + +Job signature validation is disabled for HUB-created jobs. + +Invisibility into lower levels +------------------------------ +Each system is provisioned independently and has its own admin servers. The user can access these systems independently, but cannot view the details of lower +level systems through the root system. The only commands that have impact on all levels are ``submit_job`` and ``abort_job``. + +The ``submit_job`` command issued at a level only affects this level and its lower level systems. Therefore, to execute a job at all levels, the command must be issued at the root level. + +The ``abort_job`` command issued at a level only affects this level and its lower level systems. Therefore, to abort the job at all levels, the command must be issued at the root level. + +Timing not guaranteed +--------------------- +Once a job is submitted, it is up to lower level systems to schedule it. It is not guaranteed that all systems will be able to start the job at the same time, or the job +may not be even scheduled by the lower level system. In these cases, the job may be aborted when a lower level system couldn't get the job scheduled in time. + +.. note:: + + T1 client (HubExecutor) waits for a response from T2. It will cancel the job if T2 fails to respond for a configurable amount of time. Similarly, once started, + T2 controller (HubController) waits for task data from T1. It will cancel the job if T1 fails to create the task for a configurable amount of time. diff --git a/docs/user_guide/nvflare_cli/fl_simulator.rst b/docs/user_guide/nvflare_cli/fl_simulator.rst index 391003e0b6..d1abaaea61 100644 --- a/docs/user_guide/nvflare_cli/fl_simulator.rst +++ b/docs/user_guide/nvflare_cli/fl_simulator.rst @@ -782,6 +782,13 @@ application run. .. code-block:: python + import argparse + import sys + from sys import platform + + from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner + + def define_simulator_parser(simulator_parser): simulator_parser.add_argument("job_folder") simulator_parser.add_argument("-w", "--workspace", type=str, help="WORKSPACE folder") diff --git a/docs/user_guide/nvflare_security.rst b/docs/user_guide/nvflare_security.rst index 2f3db1c3d9..7079c1a4ba 100644 --- a/docs/user_guide/nvflare_security.rst +++ b/docs/user_guide/nvflare_security.rst @@ -4,88 +4,7 @@ NVIDIA FLARE Security **************************************** -The security framework of NVIDIA FLARE 2.2 has been reworked for better usability and to improve security. - -Terminologies -============= -For the ease of discussion, we'll start by defining a few terms. - -Project -------- -An FL study with identified participants. - -Org ---- -An organization that participates in the study. - -Site ----- -The computing system that runs NVFLARE application as part of the study. -There are two kinds of sites: Server and Clients. -Each site belongs to an organization. - -FL Server ------------- -An application running on a Server site responsible for client coordination based on federation workflows. There can be -one or more FL Servers for each project. - -FL Client ----------- -An application running on a client site that responds to Server's task assignments and performs learning actions based -on its local data. - -Overseer ----------- -An application responsible for overseeing overall system health and enabling seamless failover of FL servers. This -component is only needed for High Available. - -User ------ -A human that participates in the FL project. - -.. _nvflare_roles: - -Role ------- -A role defines a type of users that have certain privileges of system operations. Each user is assigned a role in the -project. There are four defined roles: Project Admin, Org Admin, Lead Researcher, and Member Researcher. - -.. _project_admin_role: - -Project Admin Role -^^^^^^^^^^^^^^^^^^^^ -The Project Admin is responsible for provisioning the participants and coordinating personnel from all sites for the project. -When using the Dashboard UI, the Project Admin is the administrator for the site and is responsible for inputting the -values to set up the project in the beginning and then approving the users and client sites while making edits if necessary. - -The Project Admin is also responsible for the management of the FL Server. - -There is only one Project Admin for each project. - -Org Admin Role -^^^^^^^^^^^^^^^^^^^^ -This role is responsible for the management of the sites of his/her organization. - -Lead Researcher Role -^^^^^^^^^^^^^^^^^^^^^^^ -This role can be configured for increased privileges for an organization for a scientist who works -with other researchers to ensure the success of the project. - -Member Researcher Role -^^^^^^^^^^^^^^^^^^^^^^^ -This role can be configured for another level of privileges a scientist who works with the Lead Researcher -to make sure his/her site is properly prepared for the project. - - -FLARE Console (previously called Admin Client) ----------------------------------------------- -An console application running on a user's machine that allows the user to perform NVFLARE system operations with a -command line interface. - -Provisioning Tool ------------------ -The tool used by Project Admin to provision all participating sites and users of the project. The output of the -Provisioning tool enables all participants (sites and users) to securely communicate with each other. +The security framework of NVIDIA FLARE has been reworked for better usability and to improve security. Security Framework =================== @@ -93,13 +12,26 @@ NVFLARE is an application running in the IT environment of each participating si application is the combination of the security measures implemented in this application and the security measures of the site's IT infrastructure. -NVFLARE implements security measures in the following areas: +NVFLARE implements security measures in the following areas (see each section below for details): - - Identity Security: the authentication and authorization of communicating parties - - Communication Security: the confidentiality of data communication messages. + - Identity Security: the authentication and authorization of communicating parties + - Site Policy Management: the policies for resource management, authorization, and privacy protection defined by each site + - Communication Security: the confidentiality of data communication messages - Message Serialization: techniques for ensuring safe serialization/deserialization process between communicating parties - - Data Privacy Protection: techniques for preventing local data from being leaked and/or reverse-engineered. - - Auditing: techniques for keep audit trails of critical events (e.g. commands issued by users, learning/training related events that can be analyzed to understand the final results) + - Data Privacy Protection: techniques for preventing local data from being leaked and/or reverse-engineered + - Auditing: techniques for keeping audit trails to record events (e.g. commands issued by users, learning/training related events that can be analyzed to understand the final results) + +.. toctree:: + :maxdepth: 1 + + security/terminologies_and_roles + security/identity_security + security/site_policy_management + security/authorization_policy_previewer + security/communication_security + security/serialization + security/data_privacy_protection + security/auditing All other security concerns must be handled by the site's IT security infrastructure. These include, but are not limited to: @@ -107,144 +39,15 @@ All other security concerns must be handled by the site's IT security infrastruc - Firewall policies - Data management policies: storage, retention, cleaning, distributions, access, etc. -Security Trust Boundary and Balance of Risk & Usability +Security Trust Boundary and Balance of Risk and Usability --------------------------------------------------------- -The security framework does not operate in vacuum, we assume the physical security is already in place for all +The security framework does not operate in vacuum; we assume that physical security is already in place for all participating server and client machines. TLS provides the authentication mechanism within the trusted environments. -Under such circumstances, we trade off some of the security risk with ease of use when transferring data between client -and server in previous releases. The python pickle was used in NVFLARE 2.0. This trade-off caused some concern due to -the use of Pickle. To address such as concern, we replaced python pickle with Flare Object Serializer (FOBS). See -:ref:`serialization ` for details. - -Identity Security ------------------- -This area is concerned with these two trust issues: - - - Authentication: ensures communicating parties have enough confidence about each other's identities – everyone is who they claim to be. - - Authorization: ensures that the user can only do what he/she is authorized to do. - -Authentication -^^^^^^^^^^^^^^^ -NVFLARE's authentication model is based on Public Key Infrastructure (PKI) technology: - - - For the FL project, the Project Admin uses the Provisioning Tool to create a Root CA with a self-signed root certificate. This Root CA will be used to issue all other certs needed by communicating parties. - - Identities involved in the study (Server(s), Clients, the Overseer, Users) are provisioned with the Provisioning Tool. Each identity is defined with a unique common name. For each identity, the Provisioning Tool generates a separate password-protected Startup Kit, which includes security credentials for mutual TLS authentication: - - The certificate of the Root CA - - The cert of the identity - - The private key of the identity - - Startup Kits are distributed to the intended identities: - - The FL Server's kit is sent to the Project Admin - - The kit for each FL Client is sent to the Org Admin responsible for the site - - FLARE Console (previously called Admin Client) kits are sent to the user(s) - - To ensure the integrity of the Startup Kit, each file in the kit is signed by the Root CA. - - Each Startup Kit also contains a "start.sh" file, which can be used to properly start the NVFLARE application. - - Once started, the Client tries to establish a mutually-authenticated TLS connection with the Server, using the PKI credentials in its Startup Kits. This is possible only if the client and the server both have the correct Startup Kits. - - Similarly, when a user tries to operate the NVFLARE system with the Admin Client app, the admin client tries to establish a mutually-authenticated TLS connection with the Server, using the PKI credentials in its Startup Kits. This is possible only if the admin client and the server both have the correct Startup Kits. The admin user also must enter his/her assigned user name correctly. - -The security of the system comes from the PKI credentials in the Startup Kits. As you can see, this mechanism involves manual processing and human interactions for Startup Kit distribution, and hence the identity security of the system depends on the trust of the involved people. To minimize security risk, we recommend that people involved follow these best practice guidelines: - - - The Project Admin, who is responsible for the provisioning process of the study, should protect the study's configuration files and store created Startup Kits securely. - - When distributing Startup Kits, the Project Admin should use trusted communication methods, and never send passwords of the Startup Kits in the same communication. It is preferred to send the Kits and passwords with different communication methods. - - Org Admin and users must protect their Startup Kits and only use them for intended purposes. - -.. note:: - - The provisioning tool tries to use the strongest cryptography suites possible when generating the PKI credentials. All of the certificates are compliant with the X.509 standard. All private keys are generated with a size of 2048-bits. The backend is openssl 1.1.1f, released on March 31, 2020, with no known CVE. All certificates expire within 360 days. - -.. note:: - - NVFLARE 2.2 implements a :ref:`website ` that supports user and site registration. Users will be able to download their Startup Kits (and other artifacts) from the website. - -Authorization -^^^^^^^^^^^^^^ -See :ref:`Federated Authorization ` - -Communication Security ------------------------ -All data communications are through secure channels established with mutually-authenticated TLS connections. The -communication protocol between the FL Server and clients is gRPC. The protocol between FLARE Console instances and the -FL Server is TCP. - -NVIDIA FLARE uses client-server communication architecture. The FL Server accepts connection requests from clients. -Clients never need to accept connection requests from anywhere. - -The IT infrastructure of the FL Server site must allow two ports to be opened: one for the FL Server to communicate with -FL Clients, and one for the FL Server to communicate with FLARE Console instances. Both ports should be unprivileged. -Specifically, we suggest against the use of port 443, the typical port number for HTTPS. This is because gRPC does -not exactly implement HTTPS to the letter, and the firewall of some sites may decide to block it. - -The IT infrastructure of FL Client sites must allow the FL application to connect to the address (domain and port) -opened by the FL server. - -Enhanced Message Serialization -------------------------------- -Prior to NVFLARE 2.1, messages between the FL server and clients were serialized with Python's pickle facility. Many people -have pointed out the potential security risks due to the flexibility of Pickle. - -NVFLARE now uses a more secure mechanism called FOBS (Flare OBject Serializer) for message serialization and -deserialization. See :ref:`serialization ` for details. - -Enhanced Auditing -------------------- -Prior to NVFLARE 2.2, the audit trail only includes user command events (on both server and client sites). NVFLARE 2.2 -enhances the audit trail by including critical job events generated by the learning process. - -Audit File Location -^^^^^^^^^^^^^^^^^^^^ -The audit file audit.txt is located in the root directory of the workspace. -Audit File Format -^^^^^^^^^^^^^^^^^^ -The audit file is a text file. Each line in the file is an event. Each event contains headers and an optional message. -Event headers are enclosed in square brackets. The following are some examples of events: - -.. code-block:: - - [E:b6ac4a2a-eb01-4123-b898-758f20dc028d][T:2022-09-13 13:56:01.280558][U:?][A:_cert_login admin@b.org] - [E:16392ed4-d6c7-490a-a84b-12685297e912][T:2022-09-1412:59:47.691957][U:trainer@b.org][A:train.deploy] - [E:636ee230-3534-45a2-9689-d0ec6c90ed45][R:9dbf4179-991b-4d67-be2f-8e4bac1b8eb2][T:2022-09-14 15:08:33.181712][J:c4886aa3-9547-4ba7-902e-eb5e52085bc2][A:train#39027d22-3c70-4438-9c6b-637c380b8669]received task from server - -Event Headers -^^^^^^^^^^^^^^^^^^ -Event headers specify meta information about the event. Each header is expressed with the header type (one character), -followed by a colon (:) and the value of the header. The following are defined header types and their values. - -.. csv-table:: - :header: Checks,Meaning,Value - :widths: 5, 10, 20 - - E,Event ID,A UUID - T,Timestamp,Time of the event - U,User,Name of the user - A,Action,User issued command or job's task name and ID - J,Job,ID of the job - R,Reference,Reference to peer's event ID - -Most of the headers are self-explanatory, except for the R header. Events can be related. For example, a user command -could cause an event to be recorded on both the server and clients. Similarly, a client's action could cause the server -to act on it (e.g. client submitting task results). The R header records the related event ID on the peer. Reference -event IDs can help to correlate events across the system. - -Data Privacy Protection -------------------------- -Federated learning activities are performed with task-based interactions between the server and FL clients: the server -issues tasks to the clients, and clients process tasks and return results back to the server. NVFLARE comes with a -general-purpose data filtering mechanism for processing task data and results: - - - On the Server: before task data is sent to the client, the configured "task_data_filters" defined in the job are executed; - - On the Client: when the task data is received by the client and before giving it to the executor for processing, NVFLARE framework applies configured "task_data_filters" defined in the job; - - On the Client: after the execution of the task by the executor and before sending the produced result back to the server, NVFLARE framework applies configured "task_result_filters" to the result before sending to the Server. - - On the Server: after receiving the task result from the client, the NVFLARE framework applies configured "task_result_filters" before giving it to the Controller for processing. - -This mechanism has been used for the purpose of data privacy protection on the client side. For example, differential -privacy filters can be applied to model weights before sending to the server for aggregation. - -NVFLARE has implemented some commonly used privacy protection filters: https://github.com/NVIDIA/NVFlare/tree/main/nvflare/app_common/filters - -Admin Capabilities -------------------- -The NVFLARE system is operated by users using the command line interface provided by the admin client. The following +Admin Capabilities Through FLARE Console +---------------------------------------- +The NVFLARE system is operated by users using the command line interface provided by the :ref:`FLARE Console `. The following types of commands are available: - Check system operating status @@ -254,7 +57,7 @@ types of commands are available: - Start, stop jobs - Clean up job workspaces -All admin commands are subject to authorization policies of the participating sites. +All commands are subject to authorization policies of the participating sites. Dynamic Additions and Users and Sites -------------------------------------- @@ -262,10 +65,3 @@ Federated Authorization makes it possible to dynamically add new users and sites always keep an up-to-date list of users and sites. This is because the user identity information (name, org, and role) is included in the certificate of the user; and each site now performs authorization based on its local policies (instead of the FL Server performing authorization for all sites). - -Site Policy Management ------------------------- -Prior to NVFLARE 2.2, all policies (resource management, authorization and privacy protection) could only be centrally -controlled by the FL Server. NVFLARE 2.2 made it possible for each site to define end enforce its own policies. - -See :ref:`site policy management `. diff --git a/docs/user_guide/security/auditing.rst b/docs/user_guide/security/auditing.rst new file mode 100644 index 0000000000..dbf69b4136 --- /dev/null +++ b/docs/user_guide/security/auditing.rst @@ -0,0 +1,42 @@ +.. _auditing: + +Auditing +======== +NVFLARE has an auditing mechanism to record events that occur in the system. Both user command events +and critical job events generated by the learning process are recorded. + +Audit File Location +^^^^^^^^^^^^^^^^^^^^ +The audit file audit.txt is located in the root directory of the workspace. + +Audit File Format +^^^^^^^^^^^^^^^^^^ +The audit file is a text file. Each line in the file is an event. Each event contains headers and an optional message. +Event headers are enclosed in square brackets. The following are some examples of events: + +.. code-block:: + + [E:b6ac4a2a-eb01-4123-b898-758f20dc028d][T:2022-09-13 13:56:01.280558][U:?][A:_cert_login admin@b.org] + [E:16392ed4-d6c7-490a-a84b-12685297e912][T:2022-09-1412:59:47.691957][U:trainer@b.org][A:train.deploy] + [E:636ee230-3534-45a2-9689-d0ec6c90ed45][R:9dbf4179-991b-4d67-be2f-8e4bac1b8eb2][T:2022-09-14 15:08:33.181712][J:c4886aa3-9547-4ba7-902e-eb5e52085bc2][A:train#39027d22-3c70-4438-9c6b-637c380b8669]received task from server + +Event Headers +^^^^^^^^^^^^^^^^^^ +Event headers specify meta information about the event. Each header is expressed with the header type (one character), +followed by a colon (:) and the value of the header. The following are defined header types and their values. + +.. csv-table:: + :header: Header Type,Meaning,Value + :widths: 5, 10, 20 + + E,Event ID,A UUID for the ID of the event + T,Timestamp,Time of the event + U,User,Name of the user + A,Action,User issued command or job's task name and ID + J,Job,ID of the job + R,Reference,Reference to peer's event ID + +Most of the headers are self-explanatory, except for the R header. Events can be related. For example, a user command +could cause an event to be recorded on both the server and clients. Similarly, a client's action could cause the server +to act on it (e.g. client submitting task results). The R header records the related event ID on the peer. Reference +event IDs can help to correlate events across the system. diff --git a/docs/user_guide/security/authorization_policy_previewer.rst b/docs/user_guide/security/authorization_policy_previewer.rst new file mode 100644 index 0000000000..fcac5315fe --- /dev/null +++ b/docs/user_guide/security/authorization_policy_previewer.rst @@ -0,0 +1,68 @@ +.. _authorization_policy_previewer: + +****************************** +Authorization Policy Previewer +****************************** + +:ref:`Authorization ` is an important security feature of NVFLARE. Since NVFLARE 2.2, each site defines its own authorization policy. +Since authorization policy is vital for system security, and many people can now define policies, it's important to be able +to validate the policies before deploying them to production. + +The Authorization Policy Previewer is a tool for validating authorization policy definitions. The tool provides an interactive +user interface and commands for the user to validate different aspects of policy definitions: + + - Show defined roles and rights + - Show the content of the policy definition + - Show the permission matrix (role/right/conditions) + - Evaluate a right against a specified user + +Start Authorization Policy Previewer +====================================== +To start the Authorization Policy Previewer, enter this command on a terminal: + +.. code-block:: shell + + nvflare authz_preview -p + +The authorization_policy_file must be a JSON file that follows authorization file format. + +If the file is not a valid JSON file or does not follow authorization file format, this command will exit with exception. + +Execute Authorization Policy Previewer Commands +================================================ +If the Authorization Policy Previewer is successfully started, the prompt ``>`` will be displayed and for command input. + +To get the complete list of commands, enter "?" on the prompt. + +Most commands are self-explanatory, except for the "eval_right". With this command, you can evaluate a specified right against a +specified user (name:org:role) to make sure the result is correct. + +Role Rights +=========== +Most permissions in the policy file may be defined with Command Categories. However, once the policy file is loaded, categories are +already resolved to individual commands, following the fallback mechanism. + +Use the ``show_role_rights command`` to verify that all commands have the right permissions for all roles. + +Evaluate a Right +================ +The syntax of the ``eval_right`` command is: + +.. code-block:: shell + + eval_right site_org right_name user_name:org:role [submitter_name:org:role] + +where: + +.. code-block:: + + site_org - the organization of the site + right_name - the right to be evaluated. You can use the "show_rights" command to list all available commands. + User specification - a user spec has three pieces of information separated by colons. Name is the name of the user; org is the organization that the user belongs to; and role is the user's role. You can use the "show_roles" command to list all available roles. + Submitter specification - some job related commands can evaluate the relation between the user and the submitter of a job. Submitter spec has the same format as user spec. + +Please refer to :ref:`Federated Authorization ` for details on the right definition and evaluation. + +Stop Authorization Policy Previewer +====================================== +To exit from the Authorization Policy Previewer, enter the "bye" command at the prompt. diff --git a/docs/user_guide/security/communication_security.rst b/docs/user_guide/security/communication_security.rst new file mode 100644 index 0000000000..3dd5d71609 --- /dev/null +++ b/docs/user_guide/security/communication_security.rst @@ -0,0 +1,19 @@ +.. _communication_security: + +Communication Security +====================== + +All data communications are through secure channels established with mutually-authenticated TLS connections. The +communication protocol between the FL Server and clients is gRPC. The protocol between FLARE Console instances and the +FL Server is TCP. + +NVIDIA FLARE uses client-server communication architecture. The FL Server accepts connection requests from clients. +Clients never need to accept connection requests from anywhere. + +The IT infrastructure of the FL Server site must allow two ports to be opened: one for the FL Server to communicate with +FL Clients, and one for the FL Server to communicate with FLARE Console instances. Both ports should be unprivileged. +Specifically, we suggest against the use of port 443, the typical port number for HTTPS. This is because gRPC does +not exactly implement HTTPS to the letter, and the firewall of some sites may decide to block it. + +The IT infrastructure of FL Client sites must allow the FL application to connect to the address (domain and port) +opened by the FL server. diff --git a/docs/user_guide/security/data_privacy_protection.rst b/docs/user_guide/security/data_privacy_protection.rst new file mode 100644 index 0000000000..dd91a93f0d --- /dev/null +++ b/docs/user_guide/security/data_privacy_protection.rst @@ -0,0 +1,17 @@ +.. _data_privacy_protection: + +Data Privacy Protection +======================= +Federated learning activities are performed with task-based interactions between the server and FL clients: the server +issues tasks to the clients, and clients process tasks and return results back to the server. NVFLARE comes with a +general-purpose data :ref:`filtering mechanism ` for processing task data and results: + + - On the Server: before task data is sent to the client, the configured "task_data_filters" defined in the job are executed; + - On the Client: when the task data is received by the client and before giving it to the executor for processing, NVFLARE framework applies configured "task_data_filters" defined in the job; + - On the Client: after the execution of the task by the executor and before sending the produced result back to the server, NVFLARE framework applies configured "task_result_filters" to the result before sending to the Server. + - On the Server: after receiving the task result from the client, the NVFLARE framework applies configured "task_result_filters" before giving it to the Controller for processing. + +This mechanism has been used for the purpose of data privacy protection on the client side. For example, differential +privacy filters can be applied to model weights before sending to the server for aggregation. + +NVFLARE has implemented some commonly used privacy protection filters: https://github.com/NVIDIA/NVFlare/tree/main/nvflare/app_common/filters diff --git a/docs/user_guide/security/identity_security.rst b/docs/user_guide/security/identity_security.rst new file mode 100644 index 0000000000..606125de2b --- /dev/null +++ b/docs/user_guide/security/identity_security.rst @@ -0,0 +1,500 @@ +################# +Identity Security +################# +This area is concerned with these two trust issues: + + - Authentication: ensures communicating parties have enough confidence about each other's identities: everyone is who they claim to be. + - Authorization: ensures that the user can only do what he/she is authorized to do. + +Authentication +============== +NVFLARE's authentication model is based on Public Key Infrastructure (PKI) technology: + + - For the FL project, the Project Admin uses the Provisioning Tool to create a Root CA with a self-signed root certificate. This Root CA will be used to issue all other certs needed by communicating parties. + - Identities involved in the study (Server(s), Clients, the Overseer, Users) are provisioned with the Provisioning Tool. Each identity is defined with a unique common name. For each identity, the Provisioning Tool generates a separate password-protected Startup Kit, which includes security credentials for mutual TLS authentication: + - The certificate of the Root CA + - The cert of the identity + - The private key of the identity + - Startup Kits are distributed to the intended identities: + - The FL Server's kit is sent to the Project Admin + - The kit for each FL Client is sent to the Org Admin responsible for the site + - FLARE Console (previously called Admin Client) kits are sent to the user(s) + - To ensure the integrity of the Startup Kit, each file in the kit is signed by the Root CA. + - Each Startup Kit also contains a "start.sh" file, which can be used to properly start the NVFLARE application. + - Once started, the Client tries to establish a mutually-authenticated TLS connection with the Server, using the PKI credentials in its Startup Kits. This is possible only if the client and the server both have the correct Startup Kits. + - Similarly, when a user tries to operate the NVFLARE system with the Admin Client app, the admin client tries to establish a mutually-authenticated TLS connection with the Server, using the PKI credentials in its Startup Kits. This is possible only if the admin client and the server both have the correct Startup Kits. The admin user also must enter his/her assigned user name correctly. + +The security of the system comes from the PKI credentials in the Startup Kits. As you can see, this mechanism involves manual processing and human interactions for Startup Kit distribution, and hence the identity security of the system depends on the trust of the involved people. To minimize security risk, we recommend that people involved follow these best practice guidelines: + + - The Project Admin, who is responsible for the provisioning process of the study, should protect the study's configuration files and store created Startup Kits securely. + - When distributing Startup Kits, the Project Admin should use trusted communication methods, and never send passwords of the Startup Kits in the same communication. It is preferred to send the Kits and passwords with different communication methods. + - Org Admin and users must protect their Startup Kits and only use them for intended purposes. + +.. note:: + + The provisioning tool tries to use the strongest cryptography suites possible when generating the PKI credentials. All of the certificates are compliant with the X.509 standard. All private keys are generated with a size of 2048-bits. The backend is openssl 1.1.1f, released on March 31, 2020, with no known CVE. All certificates expire within 360 days. + +.. note:: + + :ref:`NVFlare Dashboard ` is a website that supports user and site registration. Users will be able to download their Startup Kits (and other artifacts) from the website. + + +.. _federated_authorization: + +Authorization: Federated Authorization +====================================== +Federated learning is conducted over computing resources owned by different organizations. Naturally these organizations have concerns +about their computing resources being misused or abused. Even if an NVFLARE docker is trusted by participating orgs, researchers can +still bring their own custom code to be part of a study (BYOC), which could be a big concern to many organizations. In addition, +organizations may also have IP (intellectual property) requirements on the studies performed by their own researchers. + +NVFLARE comes with an authorization system that can help address these security concerns and IP requirements. With this system, an organization can define strict policy to control access to their computing resources and/or FL jobs. + +Here are some examples that an org can do: + + - Restrict BYOC to only the org's own researchers; + - Allow jobs only from its own researchers, or from specified other orgs, or even from specified trusted other researchers; + - Totally disable remote shell commands on its sites + - Allow the "ls" shell command but disable all other remote shell commands + +Centralized vs. Federated Authorization +--------------------------------------- +In NVFLARE before version 2.2.1, the authorization policy was centrally enforced by the FL Server. In a true federated environment, each organization should be able to define and enforce their own authorization policy instead of relying others (such as FL Server that is owned by a separate org) to do so. + +NVFLARE now uses federated authorization where each organization defines and enforces its own authorization policy: + + - Each organization defines its policy in its own authorization.json (in the local folder of the workspace) + - This locally defined policy is loaded by FL Clients owned by the organization + - The policy is also enforced by these FL Clients + +This decentralized authorization has an added benefit: since each organization takes care of its own authorization, there will be no need to update the policy of any other participants (FL Server or Clients) when a new orgs or clients are added. + +See `Federated Policies (Github) `_ for a working example with federated site policies for authorization. + +Simplified Authorization Policy Configuration +--------------------------------------------- +Since each organization defines its own policy, there will be no need to centrally define all orgs and users. The policy configuration for an org is simply a matrix of role/right permissions. Each role/right combination in the permission matrix answers this question: what kind of users of this role can have this right? + +To answer this question, the role/right combination defines one or more conditions, and the user must meet one of these conditions to have the right. The set of conditions is called a control. + +Roles +^^^^^ +Users are classified into roles. NVFLARE defines four roles: + + - Project Admin - this role is responsible for the whole FL project; + - Org Admin - this role is responsible for the administration of all sites in its org. Each org must have one Org Admin; + - Lead (researcher) - this role conducts FL studies + - Member (researcher) - this role observes the FL study but cannot submit jobs + +Rights +^^^^^^ +NVFLARE supports more accurate right definitions to be more flexible: + + - Each server-side admin command is a right! This makes it possible for an org to control each command explicitly; + - Admin commands are grouped into categories. For example, commands like abort_job, delete_job, start_app are in manage_job category; all shell commands are put into the shell_commands category. Each category is also a right. + - BYOC is now defined as a right so that some users are allowed to submit jobs with BYOC whereas some are not. + +This right system makes it easy to write simple policies that only use command categories. It also makes it possible to write policies to control individual commands. When both categories and commands are used, command-based control takes precedence over category-based control. + +See :ref:`command_categories` for command categories. + +Controls and Conditions +^^^^^^^^^^^^^^^^^^^^^^^ +A *control* is a set of one or more conditions that is specified in the permission matrix. Conditions specify relationships among the subject user, the site, and the job submitter. The following are supported relationships: + + - The user belongs to the site's organization (user org = site org) + - The user is the job submitter (user name = submitter name) + - The user and the job submitter are in the same org (user org = submitter org) + - The user is a specified person (user name = specified name) + - The user is in a specified org (user org = specified org) + +Keep in mind that the relationship is always relative to the subject user - we check to see whether the user's name or org has the right relationship with the site or job submitter. + +Since conditions need to be expressed in the policy definition file (authorization.json), some concise and consistent notations are needed. The following are the notations for these conditions: + +.. csv-table:: + :header: Notation,Condition,Examples + :widths: 15, 20, 15 + + o:site,The user belongs to the site's organization + n:submitter,The user is the job submitter + o:submitter,The user and the job submitter belong to the same org + n:,The user is a specified person,n:john@nvidia.com + o:,The user is in a specified org,o:nvidia + +The words "site" and "submitter" are reserved. + +In addition, two words are used for extreme conditions: + + - Any user is allowed: any + - No user is allowed: none + +See :ref:`sample_auth_policy` for an example policy. + +Policy Evaluation +^^^^^^^^^^^^^^^^^ +Policy evaluation is to answer the question: is the user allowed to do this command? + +The following is the evaluation algorithm: + + - If a control is defined for this command and user role, then this control will be evaluated; + - Otherwise, if the command belongs to a category and a control is defined for the category and user role, then this control will be evaluated; + - Otherwise, return False + +As a shorthand, if the control is the same for all rights for a role, you can specify a control for a role without explicitly specifying rights one by one. For example, this is used for the "project_admin" role since this role can do everything. + +Command Authorization Process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +We know that users operate NVFLARE systems with admin commands via the FLARE Console. But when a user issues a command, how does authorization happen +throughout the system? + +If the command only involves the Server, then the server's authorization policy is evaluated and +enforced. If the command involves FL clients, then the command will be sent to those clients without any authorization evaluation on the server. +When a client receives the command, it will evaluate its own authorization policy. The client will execute the command only if it passes authorization. +It is therefore possible that some clients accept the command whereas some other clients do not. + +If a client rejects the command, it will return "authorization denied" error back to the server. + +Job Submission +"""""""""""""" +Job submission is a special and important function in NVFLARE. The researcher uses the "submit_job" command to submit a job. But the job +is not executed until it is scheduled and deployed later. Note that when the job is scheduled, the user may or may not be even online. + +Job authorization will be done in two places. When the job is submitted, only the Server will evaluate the "submit_job" right. If allowed, +the job will be accepted into the Job Store. When the job is later scheduled for execution, all sites (FL Server and Clients) involved in +the job will evaluate "submit_job" again based on its own authorization policy. If the job comes with custom code, the "byoc" right will +also be evaluated. The job will be rejected if either right fails. + +Hence it is quite possible that the job is accepted at submission time, but cannot run due to authorization errors from FL clients. + +You may ask why we don't check authorization with each involved FL client at the time of job submission. There are three considerations: + +1) This will make the system more complicated since the server would need to interact with the clients +2) At the time of submission, some or all of the FL clients may not even be online +3) A job's clients could be open-ended in that it will be deployed to all available clients. The list of available clients could be different by the time the job is scheduled for execution. + +Job Management Commands +""""""""""""""""""""""" +There are multiple commands (clone_job, delete_job, download_job, etc.) in the "manage_jobs" category. Such commands are executed on the Server only and do not involve any FL clients. Hence even if an organization defines controls for these commands, these controls will have no effect. + +Job management command authorization often evaluates the relationship between the subject user and the job submitter, as shown in the examples. + +.. _command_categories: + +Command Categories +------------------ + +.. code-block:: python + + class CommandCategory(object): + + MANAGE_JOB = "manage_job" + OPERATE = "operate" + VIEW = "view" + SHELL_COMMANDS = "shell_commands" + + + COMMAND_CATEGORIES = { + AC.ABORT: CommandCategory.MANAGE_JOB, + AC.ABORT_JOB: CommandCategory.MANAGE_JOB, + AC.START_APP: CommandCategory.MANAGE_JOB, + AC.DELETE_JOB: CommandCategory.MANAGE_JOB, + AC.DELETE_WORKSPACE: CommandCategory.MANAGE_JOB, + + AC.CHECK_STATUS: CommandCategory.VIEW, + AC.SHOW_STATS: CommandCategory.VIEW, + AC.RESET_ERRORS: CommandCategory.VIEW, + AC.SHOW_ERRORS: CommandCategory.VIEW, + AC.LIST_JOBS: CommandCategory.VIEW, + + AC.SYS_INFO: CommandCategory.OPERATE, + AC.RESTART: CommandCategory.OPERATE, + AC.SHUTDOWN: CommandCategory.OPERATE, + AC.REMOVE_CLIENT: CommandCategory.OPERATE, + AC.SET_TIMEOUT: CommandCategory.OPERATE, + AC.CALL: CommandCategory.OPERATE, + + AC.SHELL_CAT: CommandCategory.SHELL_COMMANDS, + AC.SHELL_GREP: CommandCategory.SHELL_COMMANDS, + AC.SHELL_HEAD: CommandCategory.SHELL_COMMANDS, + AC.SHELL_LS: CommandCategory.SHELL_COMMANDS, + AC.SHELL_PWD: CommandCategory.SHELL_COMMANDS, + AC.SHELL_TAIL: CommandCategory.SHELL_COMMANDS, + } + + +.. _sample_auth_policy: + +Sample Policy with Explanations +------------------------------- + +This is an example authorization.json (in the local folder of the workspace for a site). + +.. code-block:: shell + + { + "format_version": "1.0", + "permissions": { + "project_admin": "any", # can do everything on my site + "org_admin": { + "submit_job": "none", # cannot submit jobs to my site + "manage_job": "o:submitter", # can only manage jobs submitted by people in the user's own org + "download_job": "o:submitter", # can only download jobs submitted by people in the user's own org + "view": "any", # can do commands in the "view" category + "operate": "o:site", # can do commands in the "operate" category only if the user is in my org + "shell_commands": "o:site" # can do shell commands only if the user is in my org + }, + "lead": { + "submit_job": "any", # can submit jobs to my sites + "byoc": "o:site", # can submit jobs with BYOC to my sites only if the user is in my org + "manage_job": "n:submitter", # can only manage the user's own jobs + "view": "any", # can do commands in "view" category + "operate": "o:site", # can do commands in "operate" category only if the user is in my org + "shell_commands": "none", # cannot do shell commands on my site + "ls": "o:site", # can do the "ls" shell command if the user is in my org + "grep": "o:site" # can do the "grep" shell command if the user is in my org + }, + "member": { + "submit_job": [ + "o:site", # can submit jobs to my site if the user is in my org + "O:orgA", # can submit jobs to my site if the user is in org "orgA" + "N:john" # can submit jobs to my site if the user is "john" + ], + "byoc": "none", # cannot submit BYOC jobs to my site + "manage_job": "none", # cannot manage jobs + "download_job": "n:submitter", # can download user's own jobs + "view": "any", # can do commands in the "view" category + "operate": "none" # cannot do commands in "operate" category + } + } + } + +.. _site_specific_auth: + +Site-specific Authentication and Federated Job-level Authorization +================================================================== +Site-specific authentication and authorization allows users to inject their own authentication and +authorization methods into the NVFlare system. This includes the FL server / clients registration, authentication, +and the job deployment and run authorization. + +NVFlare provides a general purpose event based pluggable authentication and authorization framework to allow for expanding functionality such as: + + - exposing the app through a WAF (Web Application Firewall) or any other network element enforcing Mutual Transport Layer Security(mTLS) + - using a confidential certification authority to ensure the identity of each participating site and to ensure that they meet the computing requirements for confidential computing + - defining additional roles to manage who can submit which kind of jobs to execute within NVFlare, identify who submits jobs and which dataset can be accessed + +Users can write their own :ref:`FLComponents `, listening to the NVFlare system events at different points of their workflow, +then easily plug in their authentication and authorization logic as needed. + +Assumptions and Risks +--------------------- +By enabling the customized site-specific authentication and authorization, NVFlare will make several security +related data available to the external FL components, e.g. IDENTITY_NAME, PUBLIC_KEY, CERTIFICATE, etc. In order +to protect them from being compromised, that data needs to be made read-only. + +Because of the external pluginable authentication and authorization processes, the results of the processes could +potentially cause the jobs to not be able to be deployed or run. When configuring and using these functions, the users +need to be aware of the impact and know where to plug in the authentication and authorization check. + +Event based pluginable authentication and authorization +------------------------------------------------------- +The NVFlare event based solution supports site-specific authentication and federated job-level authorization. +Users can provide and implement any sort of additional security checks by building and plugging in FLcomponents which +listen to the appropriate events and provide custom authentication and authorization functions. + +.. code-block:: python + + class EventType(object): + """Built-in system events.""" + + SYSTEM_START = "_system_start" + SYSTEM_END = "_system_end" + ABOUT_TO_START_RUN = "_about_to_start_run" + START_RUN = "_start_run" + ABOUT_TO_END_RUN = "_about_to_end_run" + END_RUN = "_end_run" + SWAP_IN = "_swap_in" + SWAP_OUT = "_swap_out" + START_WORKFLOW = "_start_workflow" + END_WORKFLOW = "_end_workflow" + ABORT_TASK = "_abort_task" + FATAL_SYSTEM_ERROR = "_fatal_system_error" + FATAL_TASK_ERROR = "_fatal_task_error" + JOB_DEPLOYED = "_job_deployed" + JOB_STARTED = "_job_started" + JOB_COMPLETED = "_job_completed" + JOB_ABORTED = "_job_aborted" + JOB_CANCELLED = "_job_cancelled" + + BEFORE_PULL_TASK = "_before_pull_task" + AFTER_PULL_TASK = "_after_pull_task" + BEFORE_PROCESS_SUBMISSION = "_before_process_submission" + AFTER_PROCESS_SUBMISSION = "_after_process_submission" + + BEFORE_TASK_DATA_FILTER = "_before_task_data_filter" + AFTER_TASK_DATA_FILTER = "_after_task_data_filter" + BEFORE_TASK_RESULT_FILTER = "_before_task_result_filter" + AFTER_TASK_RESULT_FILTER = "_after_task_result_filter" + BEFORE_TASK_EXECUTION = "_before_task_execution" + AFTER_TASK_EXECUTION = "_after_task_execution" + BEFORE_SEND_TASK_RESULT = "_before_send_task_result" + AFTER_SEND_TASK_RESULT = "_after_send_task_result" + + CRITICAL_LOG_AVAILABLE = "_critical_log_available" + ERROR_LOG_AVAILABLE = "_error_log_available" + EXCEPTION_LOG_AVAILABLE = "_exception_log_available" + WARNING_LOG_AVAILABLE = "_warning_log_available" + INFO_LOG_AVAILABLE = "_info_log_available" + DEBUG_LOG_AVAILABLE = "_debug_log_available" + + PRE_RUN_RESULT_AVAILABLE = "_pre_run_result_available" + + # event types for job scheduling - server side + BEFORE_CHECK_CLIENT_RESOURCES = "_before_check_client_resources" + + # event types for job scheduling - client side + BEFORE_CHECK_RESOURCE_MANAGER = "_before_check_resource_manager" + +Additional system events +^^^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: python + + AFTER_CHECK_CLIENT_RESOURCES = "_after_check_client_resources" + DEPLOY_JOB_TO_SERVER = "_deploy_job_to_server" + DEPLOY_JOB_TO_CLIENT = "_deploy_job_to_client" + + BEFORE_SEND_ADMIN_COMMAND = "_before_send_admin_command" + + BEFORE_CLIENT_REGISTER = "_before_client_register" + AFTER_CLIENT_REGISTER = "_after_client_register" + CLIENT_REGISTERED = "_client_registered" + SYSTEM_BOOTSTRAP = "_system_bootstrap" + + AUTHORIZE_COMMAND_CHECK = "_authorize_command_check" + + +Security check Inputs +--------------------- +Make a ``SECURITY_ITEMS`` dict available in the FLContext, which holds any security check related data. + +NVFlare standard data: + +.. code-block:: python + + IDENTITY_NAME + SITE_NAME + SITE_ORG + USER_NAME + USER_ORG + USER_ROLE + JOB_META + + +Security check Outputs +---------------------- + +.. code-block:: python + + AUTHORIZATION_RESULT + AUTHORIZATION_REASON + +NVFlare will check the ``AUTHORIZATION_RESULT`` to determine if the operations have been authorized to be performed. Before each +operation, the NVFLare platform removes any ``AUTHORIZATION_RESULT`` in the FLContext. After the authorization check process, it +looks for if these results are present in the FLContext or not. If present, it uses its TRUE/FALSE value to determine the action. +If not present, it will be treated as TRUE by default. + +Each FLComponent listening and handling the event can use the security data to generate the necessary authorization check +results as needed. The workflow will only continue when all the FLComponents pass the security check. Any one FLComponent +that has the FALSE value will cause the workflow to stop execution. + +FLARE Console event support +--------------------------- +In order to support additional security data for site-specific customized authentication, we need to add the support for +event based solutions for the FLARE console. Using these events, the FLARE console will be able to add in the custom +SSL certificates, etc, security related data, sent along with the admin commands to the server for site-specific authentication check. + +.. code-block:: python + + BEFORE_ADMIN_REGISTER + AFTER_ADMIN_REGISTER + BEFORE_SENDING_COMMAND + AFTER_SENDING_COMMAND + BEFORE_RECEIVING_ADMIN_RESULT + AFTER_RECEIVING_ADMIN_RESULT + +.. note:: + + The site-specific authentication and authorization applies to both FLARE console and :ref:`flare_api`. + +Allow more data to be sent to the server for client registration +---------------------------------------------------------------- +If the application needs to send additional data from the client to the server to perform the authentication check, the client +can set the data into the FL_Context as public data. Then the server side can get access to the data through the PEER_FL_CONTEXT. +The application can build the FLComponent to listen to the EventType.CLIENT_REGISTERED to perform the authentication check needed. + + +Site-specific Security Example +------------------------------ +To use the site-specific security functions, write a custom Security implementation in the ``local/custom/security_handler.py``, +then configure it as a component in the site ``resources.json``. + +.. code-block:: python + + from typing import Tuple + + from nvflare.apis.event_type import EventType + from nvflare.apis.fl_component import FLComponent + from nvflare.apis.fl_constant import FLContextKey + from nvflare.apis.fl_context import FLContext + from nvflare.apis.job_def import JobMetaKey + + + class CustomSecurityHandler(FLComponent): + + def handle_event(self, event_type: str, fl_ctx: FLContext): + if event_type == EventType.AUTHORIZE_COMMAND_CHECK: + result, reason = self.authorize(fl_ctx=fl_ctx) + if not result: + fl_ctx.set_prop(FLContextKey.AUTHORIZATION_RESULT, False, sticky=False) + fl_ctx.set_prop(FLContextKey.AUTHORIZATION_REASON, reason, sticky=False) + + def authorize(self, fl_ctx: FLContext) -> Tuple[bool, str]: + command = fl_ctx.get_prop(FLContextKey.COMMAND_NAME) + if command in ["check_resources"]: + security_items = fl_ctx.get_prop(FLContextKey.SECURITY_ITEMS) + job_meta = security_items.get(FLContextKey.JOB_META) + if job_meta.get(JobMetaKey.JOB_NAME) == "FL Demo Job1": + return False, f"Not authorized to execute: {command}" + else: + return True, "" + else: + return True, "" + +In the ``local/resources.json``: + +.. code-block:: json + + { + "format_version": 2, + ... + "components": [ + { + "id": "resource_manager", + "path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager", + "args": { + "num_of_gpus": 0, + "mem_per_gpu_in_GiB": 0 + } + }, + ... + { + "id": "security_handler", + "path": "security_handler.CustomSecurityHandler" + } + ] + } + + +With the above example, when there is a job named "FL Demo Job1" scheduled to run on this client from the server, +the client will throw the authorization error and prevent the job from running. Any other jobs will be able to execute +on this client. diff --git a/docs/user_guide/security/serialization.rst b/docs/user_guide/security/serialization.rst new file mode 100644 index 0000000000..997804161b --- /dev/null +++ b/docs/user_guide/security/serialization.rst @@ -0,0 +1,8 @@ +.. _serialization: + +Message Serialization +===================== +NVFLARE uses a secure mechanism called FOBS (Flare OBject Serializer) for message serialization and +deserialization when exchanging data between the server and clients. + +See ``_ for usage guidelines. diff --git a/docs/user_guide/site_policy_management.rst b/docs/user_guide/security/site_policy_management.rst similarity index 85% rename from docs/user_guide/site_policy_management.rst rename to docs/user_guide/security/site_policy_management.rst index b9689c7a33..75889453b3 100644 --- a/docs/user_guide/site_policy_management.rst +++ b/docs/user_guide/security/site_policy_management.rst @@ -3,20 +3,18 @@ **************************************** Site Policy Management **************************************** -Prior to NVFLARE 2.2, all policies (resource management, authorization and privacy protection, logging configurations) can only be defined by the Project Admin during provision time; and authorization policies are centrally enforced by the FL Server. - -However, in a true federated system, FL sites could be running in different IT environments that are subject to local site policies. For example, a client may have 4 GPUs, whereas another site may have 8. Even for the same site, the computing resources could change from time to time. As discussed in the Federated Authorization document, authorization policies should be totally controlled by each site. It is not feasible for the Project Admin to define local policies for each FL client site. - -NVFLARE 2.2 makes it possible for each site to define its own policies in the following areas: +It is possible for each site to define its own policies in the following areas: - Resource Management: the configuration of system resources that are solely the decisions of local IT; - Authorization Policy: local authorization policy that determines what a user can or cannot do on the local site; - Privacy Policy: local policy that specifies what types of studies are allowed and how to add privacy protection to the learning results produced by the FL client on the local site. - Logging Configuration: each site can now define its own logging configuration for system generated log messages. + Workspace Structure =================== -NVFLARE's policy files are stored in the workspace. To support local site policies, a new "local" folder is added to the workspace. Here is the complete workspace structure, with the addition os the "local" folder: +NVFLARE's policy files are stored in the workspace. To support local site policies, a new "local" folder is added to the workspace. +Here is the complete workspace structure, with the addition of the "local" folder: .. code-block:: :emphasize-lines: 2-11 @@ -58,9 +56,14 @@ NVFLARE's policy files are stored in the workspace. To support local site polici config custom -Content highlighted in yellow is generated by the Provision process - the ZIP package generated by the Provision now contains two folders: startup and local. The "startup" folder contains security credentials needed for communication to the FL Server, as well as general system configuration information. The "local" folder contains default and/or samples for local policies. If the Org Admin wants to define his/her own policies, he/she can do so by creating separate files to override the default. These files are unhighlighted ones in the "local" folder. +Content highlighted in yellow is generated by the Provision process - the ZIP package generated by the Provision now contains two +folders: startup and local. The "startup" folder contains security credentials needed for communication to the FL Server, as well as +general system configuration information. The "local" folder contains default and/or samples for local policies. If the Org Admin +wants to define his/her own policies, he/she can do so by creating separate files to override the default. These files are unhighlighted +ones in the "local" folder. -The Org Admin can also install additional custom code in the "local/custom" folder. This makes it possible for the site to develop its own custom filters for privacy control. +The Org Admin can also install additional custom code in the "local/custom" folder. This makes it possible for the site to develop its +own custom filters for privacy control. Resource Management Policy ========================== @@ -164,7 +167,7 @@ The Org Admin can define local authorization policy in authorization.json. Privacy Management ================== -NVFLARE 2.2 comes with a security enhancement that allows each site to define its own privacy protection policy to be applied to the learning results produced by the client. +NVFLARE comes with a security enhancement that allows each site to define its own privacy protection policy to be applied to the learning results produced by the client. Note that in this discussion, data privacy protection specifically refers to this threat: the receiver (Server) of the learning results produced by a sender (Client) could discover/reconstruct the learning data by reverse engineering the learning results. @@ -175,7 +178,7 @@ As in previous versions of NVFLARE, the primary privacy protection technique is In previous versions of NVFLARE, only researchers can specify filters in the job configuration. However it may not be the best interest of the researchers to protect data privacy of FL clients. Protecting data privacy is the Org Admin's interest. -NVFLARE 2.2 allows the Org Admin to specify filters for data privacy protection. Unlike researcher-specified filters that are only applicable to a job, filters specified in the site's privacy policies are applicable to all jobs! This is made possible by the concept of Scope. +NVFLARE allows the Org Admin to specify filters for data privacy protection. Unlike researcher-specified filters that are only applicable to a job, filters specified in the site's privacy policies are applicable to all jobs! This is made possible by the concept of Scope. A scope can be thought of as a space within which jobs are performed. For example, depending on the purpose of the FL project, the Project Admin may decide to conduct the study in two phases. First run jobs in a "public" scope that use some publicly available datasets and with relaxed data privacy protection. After algorithms are determined, then run jobs in a "private" scope where each site's own datasets will be used with more strict data privacy protection. @@ -252,7 +255,7 @@ The scope of the job is specified with the meta key "scope". If the job doesn't Privacy Processing Rules ======================== -The following are the privacy processing rules built into NVFLARE 2.2: +The following are the privacy processing rules built into NVFLARE: If the site does not define privacy.json, then no privacy control is applied. diff --git a/docs/user_guide/security/terminologies_and_roles.rst b/docs/user_guide/security/terminologies_and_roles.rst new file mode 100644 index 0000000000..c32a8ecfd4 --- /dev/null +++ b/docs/user_guide/security/terminologies_and_roles.rst @@ -0,0 +1,83 @@ +*********************** +Terminologies and Roles +*********************** + +Terminologies +============= +For establishing background knowledge, here are a few terms. + +Project +------- +An FL study with identified participants. + +Org +--- +An organization that participates in the study. + +Site +---- +The computing system that runs NVFLARE application as part of the study. +There are two kinds of sites: Server and Clients. +Each site belongs to an organization. + +FL Server +------------ +An application running on a Server site responsible for client coordination based on federation workflows. There can be +one or more FL Servers for each project. + +FL Client +---------- +An application running on a client site that responds to Server's task assignments and performs learning actions based +on its local data. + +Overseer +---------- +An application responsible for overseeing overall system health and enabling seamless failover of FL servers. This +component is only needed for High Available. + +User +----- +A human that participates in the FL project. + +.. _nvflare_roles: + +Role +------ +A role defines a type of users that have certain privileges of system operations. Each user is assigned a role in the +project. There are four defined roles: Project Admin, Org Admin, Lead Researcher, and Member Researcher. + +.. _project_admin_role: + +Project Admin Role +^^^^^^^^^^^^^^^^^^^^ +The Project Admin is responsible for provisioning the participants and coordinating personnel from all sites for the project. +When using the Dashboard UI, the Project Admin is the administrator for the site and is responsible for inputting the +values to set up the project in the beginning and then approving the users and client sites while making edits if necessary. + +The Project Admin is also responsible for the management of the FL Server. + +There is only one Project Admin for each project. + +Org Admin Role +^^^^^^^^^^^^^^^^^^^^ +This role is responsible for the management of the sites of his/her organization. + +Lead Researcher Role +^^^^^^^^^^^^^^^^^^^^^^^ +This role can be configured for increased privileges for an organization for a scientist who works +with other researchers to ensure the success of the project. + +Member Researcher Role +^^^^^^^^^^^^^^^^^^^^^^^ +This role can be configured for another level of privileges a scientist who works with the Lead Researcher +to make sure his/her site is properly prepared for the project. + +FLARE Console (previously called Admin Client) +---------------------------------------------- +An console application running on a user's machine that allows the user to perform NVFLARE system operations with a +command line interface. + +Provisioning Tool +----------------- +The tool used by Project Admin to provision all participating sites and users of the project. The output of the +Provisioning tool enables all participants (sites and users) to securely communicate with each other. diff --git a/docs/user_guide/security/unsafe_component_detection.rst b/docs/user_guide/security/unsafe_component_detection.rst new file mode 100644 index 0000000000..5fcdcf182d --- /dev/null +++ b/docs/user_guide/security/unsafe_component_detection.rst @@ -0,0 +1,94 @@ +************************** +Unsafe Component Detection +************************** +NVFLARE is based on a componentized architecture in that FL jobs are performed by components that are configured in configuration +files. These components are created at the beginning of job execution. To address the issue of components potentially being unsafe +and leaking sensitive information, NVFLARE uses an event based solutionm. + +NVFLARE has a very powerful and flexible event mechanism that allows custom code to be plugged into defined moments of system +workflow (e.g. start/end of the job, before/after a task is executed, etc.). At such moments, NVFLARE fires events and invokes +:ref:`fl_component` objects that handle these events. + +The ``BEFORE_BUILD_COMPONENT`` event type can allow a custom FLComponent to detect unsafe job components during the time of configuration processing. This event +type is fired before the configuration processor starts to build a job component (executor, filter, etc.). + +Detect Unsafe Job Components +============================ +To detect unsafe job components, the user simply needs to create a custom FLComponent object that handles this event, +as shown in the following ComponentChecker example: + +.. code-block:: python + + from nvflare.apis.event_type import EventType + from nvflare.apis.fl_component import FLComponent + from nvflare.apis.fl_constant import FLContextKey + from nvflare.apis.fl_context import FLContext + from nvflare.apis.fl_exception import UnsafeComponentError + + class ComponentChecker(FLComponent): + + def handle_event(self, event_type: str, fl_ctx: FLContext): + prop_keys = fl_ctx.get_prop_keys() + if event_type == EventType.BEFORE_BUILD_COMPONENT: + print(f"ComponentChecker: fl_ctx props: {prop_keys}") + comp_config = fl_ctx.get_prop(FLContextKey.COMPONENT_CONFIG) + print(f"Comp Config: {comp_config}") + raise UnsafeComponentError("client encountered bad component") + + +The important points are: + + - The class must extend FLComponent + - It defines the handle_event method, following the exact signature + - It checks the event_type to be ``EventType.BEFORE_BUILD_COMPONENT``. + - It checks the component being built based on the information provided in the fl_ctx. There are many properties in fl_ctx. The most important ones are the ``COMPONENT_CONFIG`` that is a dict of the component's configuration data. The fl_ctx also has ``WORKSPACE_OBJECT`` that allows you to access any file in the job's workspace. + - If any issue is detected with the component to be built, you raise the ``UnsafeComponentError`` exception with a meaningful text. + +The following properties in the fl_ctx could be helpful too: + +``FLContextKey.COMPONENT_NODE`` - This gives you the information about the component's location in the config structure (which could be viewed as a tree). + +``FLContextKey.CONFIG_CTX`` - This gives you information about the entire config structure. + +``FLContextKey.CURRENT_JOB_ID`` - The ID of the current job. + +``FLContextKey.JOB_META`` - This is a dict that contains meta information (e.g. job submitter's name, org and role) about the current job. + +``FLContextKey.WORKSPACE_OBJECT`` - This object provides many convenience methods to determine the paths of files in the workspace + +Install Your Component Checker +============================== +Once you define your component checker (you can name your class any way you want - does not have to be ComponentChecker), you need +to install it to your FL site(s). + +First of all, your custom code could be included as part of your FL docker, depending on how you manage the docker. If this is not +possible, then you can include it in the FL site's ``/local/custom`` folder. + +Second, include this custom component in your site's ``job_resources.json``, as shown here: + +.. code-block:: json + + { + "format_version": 2, + "components": [ + { + "id": "comp_checker", + "path": "comp_auth.ComponentChecker" + } + ] + } + +Your site's workspace should look like this: + +.. code-block:: + + workspace_root + local + resources.json + job_resources.json + ... + custom + comp_auth.py + startup + ... + diff --git a/examples/advanced/README.md b/examples/advanced/README.md index 5a16376f14..fc4acbb7ef 100644 --- a/examples/advanced/README.md +++ b/examples/advanced/README.md @@ -1,8 +1,8 @@ # NVFlare advanced examples -We introduce advanced examples in this folder. +This folder contains advanced examples for NVFlare. -Please make sure you set up virtual environment and Jupyterlab follows [example root readme](../README.md) +Please make sure you set up a virtual environment and install JupyterLab following the [example root readme](../README.md). Please also install "./requirements.txt" in each example folder. @@ -36,6 +36,20 @@ Please also install "./requirements.txt" in each example folder. * [Federated Learning for Prostate Segmentation from Multi-source Data](./prostate/README.md) * Example of training a multi-institutional prostate segmentation model using [FedAvg](https://arxiv.org/abs/1602.05629), [FedProx](https://arxiv.org/abs/1812.06127), and [Ditto](https://arxiv.org/abs/2012.04221). +## Finance +* [Financial Application with Federated XGBoost Methods](./finance/README.md) + * Illustrates the use of NVFlare on a financial application using XGBoost to train a model in a federated manner. + +## Swarm Learning +* [Swarm Learning](./swarm_learning/README.md) + * Example of swarm learning with NVIDIA FLARE using PyTorch with the CIFAR-10 dataset. + +## Vertical Federated Learning +* [Vertical Federated Learning](./vertical_federated_learning/README.md) + * Example of running split learning using the CIFAR-10 dataset. +* [Vertical Federated XGBoost](./vertical_xgboost/README.md) + * Example of vertical federated learning with NVIDIA FLARE on tabular data. + ## Federated Statistics * [Federated Statistic Overview](./federated-statistics/README.md) * Discuss the overall federated statistics features @@ -48,9 +62,19 @@ Please also install "./requirements.txt" in each example folder. * [Federated Policies](./federated-policies/README.rst) * Discuss the federated site policies for authorization, resource and data privacy management +## Custom Authentication +* [Custom Authentication](./custom_authentication/README.rst) + * Example demonstrating custom authentication policy + +## Job-level Authorization +* [Job-level Authorization](./job-level-authorization/README.md) + * Example demonstrating job-level authorization policy + ## Experiment tracking * [Hello PyTorch with TensorBoard Streaming](./experiment-tracking/tensorboard/README.md) * Example building upon [Hello PyTorch](../hello-world/hello-pt/README.md) showcasing the [TensorBoard](https://tensorflow.org/tensorboard) streaming capability from the clients to the server. +* [Experiment Tracking with MLflow and Weights and Biases](./experiment-tracking/README.md) + * Example showing the use of the Writers and Receivers in NVFlare to write to different experiment tracking systems. ## Federated Learning Hub diff --git a/examples/advanced/job-level-authorization/README.md b/examples/advanced/job-level-authorization/README.md new file mode 100644 index 0000000000..d3481207f5 --- /dev/null +++ b/examples/advanced/job-level-authorization/README.md @@ -0,0 +1,82 @@ +# Example for Job-level authorization + +# Overview + +The purpose of this example is to demonstrate following features of NVFlare, + +1. Run NVFlare in secure mode +2. Demonstrate job-level authorization policy + +## System Requirements + +1. Install Python and set up a Virtual Environment, +``` +python3 -m venv nvflare-env +source nvflare-env/bin/activate +``` +2. Install NVFlare +``` +pip install nvflare +``` +3. The example is part of the NVFlare source code. The source code can be obtained like this, +``` +git clone https://github.com/NVIDIA/NVFlare.git +``` +4. TLS requires domain names. Please add the following line in the `/etc/hosts` file, +``` +127.0.0.1 server1 +``` + +### Setup + +``` +cd NVFlare/examples/advanced/job-level-authorization +./setup.sh +``` + +All the startup kits will be generated in this folder, +``` +/tmp/nvflare/poc/job-level-authorization/prod_00 +``` + +Note that the "workspace" folder is removed every time `setup.sh` is run. Please do not save customized files in this folder. + +### Starting NVFlare + +This script will start up the server and 2 clients, +``` +nvflare poc start +``` + +### Logging with Admin Console + +For example, to login as the `super@a.org` user: + +``` +cd /tmp/nvflare/poc/job-level-authorization/prod_00/super@a.org +./startup/fl_admin.sh +``` + +At the prompt, enter the user email `super@a.org` + +The setup.sh has copied the jobs folder to the workspace folder. +So jobs can be submitted like this, type the following command in the admin console: + +``` +submit_job ../../job1 +submit_job ../../job2 +``` + +## Participants + +### Site +* `server1`: NVFlare server +* `site_a`: Site_a has a CustomSecurityHandler set up which does not allow the job "FL Demo Job1" to run. Any other named jobs will be able to deploy and run on site_a. +* `site_b`: Site_b does not have the extra security handling codes. It allows any job to be deployed and run. + +### Jobs + +* job1: The job is called `hello-numpy-sag`. site_a will allow this job to run. +* job2: The job is called `FL Demo Job1`. site_a will block this job to run. + + diff --git a/examples/advanced/job-level-authorization/README.rst b/examples/advanced/job-level-authorization/README.rst deleted file mode 100644 index 6979cc8f79..0000000000 --- a/examples/advanced/job-level-authorization/README.rst +++ /dev/null @@ -1,85 +0,0 @@ -Example for Federated Policies -============================== - - -Overview --------- - -The purpose of this example is to demonstrate following features of NVFlare, - -1. Run NVFlare in secure mode -2. Demonstrate job-level authorization policy - -System Requirements -------------------- - -1. Install Python and Virtual Environment, -:: - python3 -m venv nvflare-env - source nvflare-env/bin/activate - -2. Install NVFlare -:: - pip install nvflare - -3. The example is part of the NVFlare source code. The source code can be obtained like this, -:: - git clone https://github.com/NVIDIA/NVFlare.git - -4. TLS requires domain names. Please add following line in :code:`/etc/hosts` file, -:: - 127.0.0.1 server1 - - -Setup -_____ - -:: - cd NVFlare/examples/advanced/job-level-authorization - ./setup.sh -All the startup kits will be generated in this folder, -:: - /tmp/nvflare/poc/job-level-authorization/prod_00 - -.. note:: - :code:`workspace` folder is removed everytime :code:`setup.sh` is run. Please do not save customized - files in this folder. - -Starting NVFlare -________________ - -This script will start up the server and 2 clients, -:: - nvflare poc start - -Logging with Admin Console -__________________________ - -For example, this is how to login as :code:`super@a.org` user, -:: - cd /tmp/nvflare/poc/job-level-authorization/prod_00/super@a.org - ./startup/fl_admin.sh -At the prompt, enter the user email :code:`super@a.org` - -The setup.sh has copied the jobs folder to the workspace folder. -So jobs can be submitted like this, type the following command in the admin console: - -:: - submit_job ../../job1 - submit_job ../../job2 - -Participants ------------- -Site -____ -* :code:`server1`: NVFlare server -* :code:`site_a`: Site_a has a CustomSecurityHandler set up which does not allow the job "FL Demo Job1" to run. Any other named jobs will be able to deploy and run on site_a. -* :code:`site_b`: Site_b does not have the extra security handling codes. It allows any job to be deployed and run. - -Jobs -____ - -* job1: The job is called :code:`hello-numpy-sag`. site_a will allow this job to run. -* job2: The job is called :code:`FL Demo Job1`. site_a will block this job to run. - - diff --git a/examples/hello-world/README.md b/examples/hello-world/README.md index 4c2f375c8c..286a8fe111 100644 --- a/examples/hello-world/README.md +++ b/examples/hello-world/README.md @@ -19,15 +19,20 @@ Before you run the notebook, the following preparation work must be done: ## Hello World Examples ### Easier ML/DL to FL transition -* [ML to FL](./ml-to-fl/README.md): Showcase how to convert existing ML/DL codes to a NVFlare job. +* [ML to FL](./ml-to-fl/README.md): Showcases how to convert existing ML/DL code to an NVFlare job. + +### Step by step examples +* [Step by step examples](./step-by-step/readme.md): Shows specific techniques and workflows and what needs to be changed for each. ### Workflows * [Hello Scatter and Gather](./hello-numpy-sag/README.md) - * Example using "[ScatterAndGather](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.workflows.scatter_and_gather.html)" controller workflow. + * Example using [ScatterAndGather](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.workflows.scatter_and_gather.html) controller workflow. * [Hello Cross-Site Validation](./hello-numpy-cross-val/README.md) * Example using [CrossSiteModelEval](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.workflows.cross_site_model_eval.html) controller workflow. * [Hello Cyclic Weight Transfer](./hello-cyclic/README.md) * Example using [CyclicController](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.workflows.cyclic_ctl.html) controller workflow to implement [Cyclic Weight Transfer](https://pubmed.ncbi.nlm.nih.gov/29617797/). +* [Hello Client Controlled Workflows](./hello-ccwf/README.md) + * Example using [Client Controlled Workflows](https://nvflare.readthedocs.io/en/main/programming_guide/controllers/client_controlled_workflows.html). ### Deep Learning * [Hello PyTorch](./hello-pt/README.md) diff --git a/examples/hello-world/hello-ccwf/README.md b/examples/hello-world/hello-ccwf/README.md new file mode 100644 index 0000000000..cb4fe495ea --- /dev/null +++ b/examples/hello-world/hello-ccwf/README.md @@ -0,0 +1,26 @@ +# Hello Client Controlled Workflow (CCWF) + +[Client Controlled Workflows](https://nvflare.readthedocs.io/en/main/programming_guide/controllers/client_controlled_workflows.html) are managed +by logic from clients. This example shows the components used in a job for a client controlled workflow. + +### 1. Install NVIDIA FLARE + +Follow the [Installation](https://nvflare.readthedocs.io/en/main/quickstart.html) instructions. + +### 2. Run the experiment + +Use nvflare simulator to run the hello-examples: + +``` +nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-ccwf/jobs/numpy-swcse +``` + +### 3. Access the logs and results + +You can find the running logs and results inside the simulator's workspace/simulate_job + +```bash +$ ls /tmp/nvflare/simulate_job/ +app_server app_site-1 app_site-2 log.txt + +``` diff --git a/examples/hello-world/step-by-step/readme.md b/examples/hello-world/step-by-step/readme.md index 3cf5b0786d..c2bd31a2a6 100644 --- a/examples/hello-world/step-by-step/readme.md +++ b/examples/hello-world/step-by-step/readme.md @@ -1,9 +1,9 @@ # Step-by-Step Examples -When give a machine learning problem, we probably wonder, where do we start to formulate the federated learning problem. +When given a machine learning problem, we probably wonder, where do we start to formulate the federated learning problem. -* What does the data look like ? -* How do we compare global statistics with the site's local data statistics ? +* What does the data look like? +* How do we compare global statistics with the site's local data statistics? * How to formulate the federated algorithms * https://developer.download.nvidia.com/healthcare/clara/docs/federated_traditional_machine_learning_algorithms.pdf * Given the formulation, how to convert the existing machine learning or deep learning code to Federated learning code. @@ -22,8 +22,8 @@ The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 ![image](cifar10/data/cifar10.png) -We will use using [pytorch](https://pytorch.org/) deep learning framework to illustrate how to formulate, and convert the deep learning training -program to federated learning training program. The example will include +We will use the [pytorch](https://pytorch.org/) deep learning framework to illustrate how to formulate and convert the deep learning training +program to a federated learning training program. The example will include: * Federated Histogram analysis with Federated Statistics * Scatter and Gather (SAG) workflow with NVFLARE Client APIs @@ -35,17 +35,11 @@ program to federated learning training program. The example will include ## Tabular HIGGs dataset -With HIGGs Dataset, we like to demonstrate traditional machine learning techniques in federated learning. -These include: +With the HIGGs Dataset, we like to demonstrate traditional machine learning techniques in federated learning. +These include: * Federated Statistics for tabular data * Federated Linear and Logistics Regression * Federated Kmeans * Federated SVM with non-learner kernel * Federated (Horizontal) XGBoost - - - - - -