diff --git a/notebooks/responsible_ai/privacy/solutions/privacy_dpsgd.ipynb b/notebooks/responsible_ai/privacy/solutions/privacy_dpsgd.ipynb new file mode 100644 index 00000000..f5f29584 --- /dev/null +++ b/notebooks/responsible_ai/privacy/solutions/privacy_dpsgd.ipynb @@ -0,0 +1,535 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "KwDK47gfLsYf" + }, + "source": [ + "# Implement Differential Privacy with TensorFlow Privacy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning Objectives\n", + "\n", + "* Learn how to wrap existing optimizers (e.g., SGD, Adam) into their differentially private counterparts using TensorFlow Privacy\n", + "* Understand hyperparameters introduced by differentially private machine learning\n", + "* Measure the privacy guarantee provided using analysis tools included in TensorFlow Privacy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "00fQV7e0Unz3" + }, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vsCUvXP0W4j2" + }, + "source": [ + "[Differential privacy](https://en.wikipedia.org/wiki/Differential_privacy) (DP) is a framework for measuring the privacy guarantees provided by an algorithm. Through the lens of differential privacy, you can design machine learning algorithms that responsibly train models on private data. Learning with differential privacy provides measurable guarantees of privacy, helping to mitigate the risk of exposing sensitive training data in machine learning. Intuitively, a model trained with differential privacy should not be affected by any single training example, or small set of training examples, in its data set. This helps mitigate the risk of exposing sensitive training data in ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6vd8qUwEW5pP" + }, + "source": [ + "The basic idea of this approach, called differentially private stochastic gradient descent (DP-SGD), is to modify the gradients\n", + "used in stochastic gradient descent (SGD), which lies at the core of almost all deep learning algorithms. Models trained with DP-SGD provide provable differential privacy guarantees for their input data. There are two modifications made to the vanilla SGD algorithm:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TUphKzYu01O9" + }, + "source": [ + "1. First, the sensitivity of each gradient needs to be bounded. In other words, you need to limit how much each individual training point sampled in a minibatch can influence gradient computations and the resulting updates applied to model parameters. This can be done by *clipping* each gradient computed on each training point.\n", + "2. *Random noise* is sampled and added to the clipped gradients to make it statistically impossible to know whether or not a particular data point was included in the training dataset by comparing the updates SGD applies when it operates with or without this particular data point in the training dataset.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jXU7MZhhW-aL" + }, + "source": [ + "This tutorial uses [tf.keras](https://www.tensorflow.org/guide/keras) to train a convolutional neural network (CNN) to recognize handwritten digits with the DP-SGD optimizer provided by the TensorFlow Privacy library. TensorFlow Privacy provides code that wraps an existing TensorFlow optimizer to create a variant that implements DP-SGD." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ijJYKVc05DYX" + }, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "r56BqqyEqA16", + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install --user tensorflow-privacy==0.8.12" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CKuHPYQCsV-x" + }, + "source": [ + "Begin by importing the necessary libraries:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import warnings\n", + "\n", + "os.environ[\"TF_CPP_MIN_LOG_LEVEL\"] = \"2\"\n", + "warnings.filterwarnings(\"ignore\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "ef56gCUqrdVn", + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-03-11 18:23:24.342772: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "2024-03-11 18:23:24.342907: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "2024-03-11 18:23:24.349126: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "import tensorflow as tf\n", + "\n", + "tf.get_logger().setLevel(\"ERROR\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r_fVhfUyeI3d" + }, + "source": [ + "Import TensorFlow Privacy." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "RseeuA7veIHU", + "tags": [] + }, + "outputs": [], + "source": [ + "import tensorflow_privacy\n", + "from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mU1p8N7M5Mmn" + }, + "source": [ + "## Load and pre-process the dataset\n", + "\n", + "Load the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and prepare the data for training." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "_1ML23FlueTr", + "tags": [] + }, + "outputs": [], + "source": [ + "train, test = tf.keras.datasets.mnist.load_data()\n", + "train_data, train_labels = train\n", + "test_data, test_labels = test\n", + "\n", + "train_data = np.array(train_data, dtype=np.float32) / 255\n", + "test_data = np.array(test_data, dtype=np.float32) / 255\n", + "\n", + "train_data = train_data.reshape(train_data.shape[0], 28, 28, 1)\n", + "test_data = test_data.reshape(test_data.shape[0], 28, 28, 1)\n", + "\n", + "train_labels = np.array(train_labels, dtype=np.int32)\n", + "test_labels = np.array(test_labels, dtype=np.int32)\n", + "\n", + "train_labels = tf.keras.utils.to_categorical(train_labels, num_classes=10)\n", + "test_labels = tf.keras.utils.to_categorical(test_labels, num_classes=10)\n", + "\n", + "assert train_data.min() == 0.0\n", + "assert train_data.max() == 1.0\n", + "assert test_data.min() == 0.0\n", + "assert test_data.max() == 1.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xVDcswOCtlr3" + }, + "source": [ + "## Define the hyperparameters\n", + "Set learning model hyperparamter values. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qXNp_25y7JP2" + }, + "source": [ + "DP-SGD has three general hyperamater and three privacy-specific hyperparameters that you must tune:\n", + "\n", + "**General hyperparameters**\n", + "\n", + "1. `epochs` (int) - This refers to the one entire passing of training data through the algorithm. Larger epoch increase the privacy risks since the model is trained on a same data point for multiple times.\n", + "2. `batch_size` (int) - Batch size affects different aspects of DP-SGD training. For instance, increasing the batch size could reduce the amount of noise added during training under the same privacy guarantee, which reduces the training variance.\n", + "3. `learning_rate` (float) - This hyperparameter already exists in vanilla SGD. The higher the learning rate, the more each update matters. If the updates are noisy (such as when the additive noise is large compared to the clipping threshold), a low learning rate may help the training procedure converge. \n", + "\n", + "**Privacy-specific hyperparameters**\n", + "1. `l2_norm_clip` (float) - The maximum Euclidean (L2) norm of each gradient that is applied to update model parameters. This hyperparameter is used to bound the optimizer's sensitivity to individual training points. \n", + "2. `noise_multiplier` (float) - Ratio of the standard deviation to the clipping norm (The amount of noise sampled and added to gradients during training). Generally, more noise results in better privacy (often, but not necessarily, at the expense of lower utility).\n", + "3. `microbatches` (int) - Each batch of data is split in smaller units called microbatches. By default, each microbatch should contain a single training example. This allows us to clip gradients on a per-example basis rather than after they have been averaged across the minibatch. This in turn decreases the (negative) effect of clipping on signal found in the gradient and typically maximizes utility. However, computational overhead can be reduced by increasing the size of microbatches to include more than one training examples. The average gradient across these multiple training examples is then clipped. The total number of examples consumed in a batch, i.e., one step of gradient descent, remains the same. The number of microbatches should evenly divide the batch size. \n", + "\n", + "\n", + "Use the hyperparameter values below to obtain a reasonably accurate model (95% test accuracy):" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "pVw_r2Mq7ntd", + "tags": [] + }, + "outputs": [], + "source": [ + "epochs = 1\n", + "batch_size = 32\n", + "learning_rate = 0.25\n", + "\n", + "l2_norm_clip = 1.0\n", + "noise_multiplier = 0.5\n", + "num_microbatches = 32 # Same as the batch size (i.e. no microbatch)\n", + "\n", + "if batch_size % num_microbatches != 0:\n", + " raise ValueError(\n", + " \"Batch noise_multipliere should be an integer multiple of the number of microbatches\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wXAmHcNOmHc5" + }, + "source": [ + "## Build the model\n", + "\n", + "Define a convolutional neural network as the learning model. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "oCOo8aOLmFta", + "tags": [] + }, + "outputs": [], + "source": [ + "model = tf.keras.Sequential(\n", + " [\n", + " tf.keras.layers.Conv2D(\n", + " 16,\n", + " 8,\n", + " strides=2,\n", + " padding=\"same\",\n", + " activation=\"relu\",\n", + " input_shape=(28, 28, 1),\n", + " ),\n", + " tf.keras.layers.MaxPool2D(2, 1),\n", + " tf.keras.layers.Conv2D(\n", + " 32, 4, strides=2, padding=\"valid\", activation=\"relu\"\n", + " ),\n", + " tf.keras.layers.MaxPool2D(2, 1),\n", + " tf.keras.layers.Flatten(),\n", + " tf.keras.layers.Dense(32, activation=\"relu\"),\n", + " tf.keras.layers.Dense(10, activation=\"softmax\"),\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FT4lByFg-I_r" + }, + "source": [ + "Define the optimizer and loss function for the learning model. Compute the loss as a vector of losses per-example rather than as the mean over a minibatch to support gradient manipulation over each training point. " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "bqBvjCf5-ZXy", + "tags": [] + }, + "outputs": [], + "source": [ + "optimizer = tensorflow_privacy.DPKerasSGDOptimizer(\n", + " l2_norm_clip=l2_norm_clip,\n", + " noise_multiplier=noise_multiplier,\n", + " num_microbatches=num_microbatches,\n", + " learning_rate=learning_rate,\n", + ")\n", + "\n", + "loss = tf.keras.losses.CategoricalCrossentropy(\n", + " reduction=tf.losses.Reduction.NONE\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LI_3nXzEGmrP" + }, + "source": [ + "## Train the model\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "z4iV03VqG1Bo", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1875/1875 [==============================] - 168s 89ms/step - loss: 0.7247 - accuracy: 0.8371 - val_loss: 0.8313 - val_accuracy: 0.8877\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.compile(optimizer=optimizer, loss=loss, metrics=[\"accuracy\"])\n", + "\n", + "model.fit(\n", + " train_data,\n", + " train_labels,\n", + " epochs=epochs,\n", + " validation_data=(test_data, test_labels),\n", + " batch_size=batch_size,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0kkzQH2LXNjF" + }, + "source": [ + "## Measure the differential privacy guarantee\n", + "\n", + "Perform a privacy analysis to measure the DP guarantee achieved by a training algorithm. Knowing the level of DP achieved enables the objective comparison of two training runs to determine which of the two is more privacy-preserving. At a high level, the privacy analysis measures how much a potential adversary can improve their guess about properties of any individual training point by observing the outcome of the training procedure (e.g., model updates and parameters). \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TL7_lX5sHCTI" + }, + "source": [ + "This guarantee is sometimes referred to as the **privacy budget**. A lower privacy budget bounds more tightly an adversary's ability to improve their guess. This ensures a stronger privacy guarantee. Intuitively, this is because it is harder for a single training point to affect the outcome of learning: for instance, the information contained in the training point cannot be memorized by the ML algorithm and the privacy of the individual who contributed this training point to the dataset is preserved.\n", + "\n", + "In this tutorial, the privacy analysis is performed in the framework of Rényi Differential Privacy (RDP), which is a relaxation of pure DP based on [this paper](https://arxiv.org/abs/1702.07476) that is particularly well suited for DP-SGD.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wUEk25pgmnm-" + }, + "source": [ + "Two metrics are used to express the DP guarantee of an ML algorithm:\n", + "\n", + "1. Delta ($\\delta$) - Bounds the probability of the privacy guarantee not holding. A rule of thumb is to set it to be less than the inverse of the size of the training dataset. In this tutorial, it is set to $10^{-5}$ as the MNIST dataset has 60,000 training points.\n", + "2. Epsilon ($\\epsilon$) - This is the privacy budget. It measures the strength of the privacy guarantee (or maximum tolerance for revealing information on input data) by bounding how much the probability of a particular model output can vary by including (or excluding) a single training point. A smaller value for $\\epsilon$ implies a better privacy guarantee. However, the $\\epsilon$ value is only an upper bound and a large value could still mean good privacy in practice.\n", + "\n", + "For more detail about the mathematical definition of $(\\epsilon, \\delta)$-differential privacy, see the original [DP-SGD paper](https://arxiv.org/pdf/1607.00133.pdf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PczVdKsGyRQM" + }, + "source": [ + "Tensorflow Privacy provides a tool, `compute_dp_sgd_privacy`, to compute the value of $\\epsilon$ given a fixed value of $\\delta$ and the following hyperparameters from the training process:\n", + "\n", + "1. The total number of points in the training data, `n`.\n", + "2. The `batch_size`.\n", + "3. The `noise_multiplier`.\n", + "4. The number of `epochs` of training." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "ws8-nVuVDgtJ", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DP-SGD performed over 60000 examples with 32 examples per iteration, noise\n", + "multiplier 0.5 for 1 epochs without microbatching, and no bound on number of\n", + "examples per user.\n", + "\n", + "This privacy guarantee protects the release of all model checkpoints in addition\n", + "to the final model.\n", + "\n", + "Example-level DP with add-or-remove-one adjacency at delta = 1e-05 computed with\n", + "RDP accounting:\n", + " Epsilon with each example occurring once per epoch: 10.726\n", + " Epsilon assuming Poisson sampling (*): 3.800\n", + "\n", + "No user-level privacy guarantee is possible without a bound on the number of\n", + "examples per user.\n", + "\n", + "(*) Poisson sampling is not usually done in training pipelines, but assuming\n", + "that the data was randomly shuffled, it is believed the actual epsilon should be\n", + "closer to this value than the conservative assumption of an arbitrary data\n", + "order.\n", + "\n" + ] + } + ], + "source": [ + "dpsgd_statement = compute_dp_sgd_privacy.compute_dp_sgd_privacy_statement(\n", + " number_of_examples=train_data.shape[0],\n", + " batch_size=batch_size,\n", + " noise_multiplier=noise_multiplier,\n", + " used_microbatching=False,\n", + " num_epochs=epochs,\n", + " delta=1e-5,\n", + ")\n", + "\n", + "print(dpsgd_statement)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c-KyttEWFRDc" + }, + "source": [ + "The tool reports $\\epsilon$ value for the hyperparameters chosen above, including $\\delta=10^{-5}$." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright 2024 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "classification_privacy.ipynb", + "provenance": [], + "toc_visible": true + }, + "environment": { + "kernel": "python3", + "name": "tf2-cpu.2-11.m114", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/tf2-cpu.2-11:m114" + }, + "kernelspec": { + "display_name": "Python 3 (Local)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}