`__
+
.. raw:: html
@@ -359,7 +368,7 @@
:card_description: This tutorial covers how to run quantized and fused models on a Raspberry Pi 4 at 30 fps.
:image: _static/img/thumbnails/cropped/realtime_rpi.png
:link: intermediate/realtime_rpi.html
- :tags: TorchScript,Model Optimization,Image/Video,Quantization
+ :tags: TorchScript,Model-Optimization,Image/Video,Quantization
.. customcarditem::
:header: Autograd in C++ Frontend
@@ -475,6 +484,13 @@
:link: advanced/static_quantization_tutorial.html
:tags: Quantization
+.. customcarditem::
+ :header: Grokking PyTorch Intel CPU Performance from First Principles
+ :card_description: A case study on the TorchServe inference framework optimized with Intel® Extension for PyTorch.
+ :image: _static/img/thumbnails/cropped/generic-pytorch-logo.png
+ :link: intermediate/torchserve_with_ipex
+ :tags: Model-Optimization,Production
+
.. Parallel-and-Distributed-Training
.. customcarditem::
@@ -592,6 +608,14 @@
:link: intermediate/torchrec_tutorial.html
:tags: TorchRec,Recommender
+.. customcarditem::
+ :header: Exploring TorchRec sharding
+ :card_description: This tutorial covers the sharding schemes of embedding tables by using EmbeddingPlanner
and DistributedModelParallel
API.
+ :image: _static/img/thumbnails/torchrec.png
+ :link: advanced/sharding.html
+ :tags: TorchRec,Recommender
+
+
.. End of tutorial card section
.. raw:: html
@@ -831,6 +855,7 @@
intermediate/dynamic_quantization_bert_tutorial
intermediate/quantized_transfer_learning_tutorial
advanced/static_quantization_tutorial
+ intermediate/torchserve_with_ipex
.. toctree::
:maxdepth: 2
@@ -868,4 +893,5 @@
:hidden:
:caption: Recommendation Systems
- intermediate/torchrec_tutorial
\ No newline at end of file
+ intermediate/torchrec_tutorial
+ advanced/sharding
diff --git a/intermediate_source/FSDP_adavnced_tutorial.rst b/intermediate_source/FSDP_adavnced_tutorial.rst
new file mode 100644
index 000000000..1adbf9722
--- /dev/null
+++ b/intermediate_source/FSDP_adavnced_tutorial.rst
@@ -0,0 +1,602 @@
+Advanced Fully Sharded Data Parallel(FSDP) Tutorial
+=====================================================
+
+**Author**: `Hamid Shojanazeri `__, `Less Wright `__, `Rohan Varma `__, `Yanli Zhao `__
+
+
+This tutorial introduces more advanced features of Fully Sharded Data Parallel (FSDP) as part of the PyTorch 1.12 release. To get familiar with FSDP, please refer to the `FSDP getting started tutorial `__.
+
+In this tutorial, we fine-tune a HuggingFace (HF) T5 model with FSDP for text summarization as a working example.
+
+The example uses Wikihow and for simplicity, we will showcase the training on a single node, P4dn instance with 8 A100 GPUs. We will soon have a blog post on large scale FSDP training on a multi-node cluster, please stay tuned for that on the PyTorch medium channel.
+
+FSDP is a production ready package with focus on ease of use, performance, and long-term support.
+One of the main benefits of FSDP is reducing the memory footprint on each GPU. This enables training of larger models with lower total memory vs DDP, and leverages the overlap of computation and communication to train models efficiently.
+This reduced memory pressure can be leveraged to either train larger models or increase batch size, potentially helping overall training throughput.
+You can read more about PyTorch FSDP `here `__.
+
+
+FSDP Features in This Tutorial
+------------------------------
+* Transformer Auto Wrap Policy
+* Mixed Precision
+* Initializing FSDP Model on Device
+* Sharding Strategy
+* Backward Prefetch
+* Model Checkpoint Saving via Streaming to CPU
+
+
+
+Recap on How FSDP Works
+-----------------------
+
+At a high level FDSP works as follow:
+
+*In constructor*
+
+* Shard model parameters and each rank only keeps its own shard
+
+*In forward pass*
+
+* Run `all_gather` to collect all shards from all ranks to recover the full parameter for this FSDP unit
+* Run forward computation
+* Discard non-owned parameter shards it has just collected to free memory
+
+*In backward pass*
+
+* Run `all_gather` to collect all shards from all ranks to recover the full parameter in this FSDP unit
+* Run backward computation
+* Discard non-owned parameters to free memory.
+* Run reduce_scatter to sync gradients
+
+
+Fine-tuning HF T5
+-----------------
+HF T5 pre-trained models are available in four different sizes, ranging from small with 60 Million parameters to XXL with 11 Billion parameters. In this tutorial, we demonstrate the fine-tuning of a T5 3B with FSDP for text summarization using WikiHow dataset.
+The main focus of this tutorial is to highlight different available features in FSDP that are helpful for training large scale model above 3B parameters. Also, we cover specific features for Transformer based models. The code for this tutorial is available in `Pytorch Examples `__.
+
+
+*Setup*
+
+1.1 Install PyTorch Nightlies
+
+We will install PyTorch nightlies, as some of the features such as activation checkpointing is available in nightlies and will be added in next PyTorch release after 1.12.
+
+.. code-block:: bash
+
+ pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html
+
+1.2 Dataset Setup
+
+Please create a `data` folder, download the WikiHow dataset from `wikihowAll.csv `__ and `wikihowSep.cs `__, and place them in the `data` folder.
+We will use the wikihow dataset from `summarization_dataset `__.
+
+Next, we add the following code snippets to a Python script “T5_training.py”. Note - The full source code for this tutorial is available in `PyTorch examples `__.
+
+1.3 Import necessary packages:
+
+.. code-block:: python
+
+ import os
+ import argparse
+ import torch
+ import torch.nn as nn
+ import torch.nn.functional as F
+ import torch.optim as optim
+ from transformers import AutoTokenizer, GPT2TokenizerFast
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
+ import functools
+ from torch.optim.lr_scheduler import StepLR
+ import torch.nn.functional as F
+ import torch.distributed as dist
+ import torch.multiprocessing as mp
+ from torch.nn.parallel import DistributedDataParallel as DDP
+ from torch.utils.data.distributed import DistributedSampler
+ from transformers.models.t5.modeling_t5 import T5Block
+
+ from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+ checkpoint_wrapper,
+ CheckpointImpl,
+ apply_activation_checkpointing_wrapper)
+
+ from torch.distributed.fsdp import (
+ FullyShardedDataParallel as FSDP,
+ MixedPrecision,
+ BackwardPrefetch,
+ ShardingStrategy,
+ FullStateDictConfig,
+ StateDictType,
+ )
+ from torch.distributed.fsdp.wrap import (
+ transformer_auto_wrap_policy,
+ enable_wrap,
+ wrap,
+ )
+ from functools import partial
+ from torch.utils.data import DataLoader
+ from pathlib import Path
+ from summarization_dataset import *
+ from transformers.models.t5.modeling_t5 import T5Block
+ from typing import Type
+ import time
+ import tqdm
+ from datetime import datetime
+
+1.4 Distributed training setup.
+Here we use two helper functions to initialize the processes for distributed training, and then to clean up after training completion.
+In this tutorial, we are going to use torch elastic, using `torchrun `__ , which will set the worker `RANK` and `WORLD_SIZE` automatically.
+
+.. code-block:: python
+
+ def setup():
+ # initialize the process group
+ dist.init_process_group("nccl")
+
+ def cleanup():
+ dist.destroy_process_group()
+
+2.1 Set up the HuggingFace T5 model:
+
+.. code-block:: python
+
+ def setup_model(model_name):
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
+ return model, tokenizer
+
+We also, add couple of helper functions here for date and formatting memory metrics.
+
+.. code-block:: python
+
+ def get_date_of_run():
+ """create date and time for file save uniqueness
+ example: 2022-05-07-08:31:12_PM'
+ """
+ date_of_run = datetime.now().strftime("%Y-%m-%d-%I:%M:%S_%p")
+ print(f"--> current date and time of run = {date_of_run}")
+ return date_of_run
+
+ def format_metrics_to_gb(item):
+ """quick function to format numbers to gigabyte and round to 4 digit precision"""
+ metric_num = item / g_gigabyte
+ metric_num = round(metric_num, ndigits=4)
+ return metric_num
+
+
+2.2 Define a train function:
+
+.. code-block:: python
+
+ def train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=None):
+ model.train()
+ local_rank = int(os.environ['LOCAL_RANK'])
+ fsdp_loss = torch.zeros(2).to(local_rank)
+
+ if sampler:
+ sampler.set_epoch(epoch)
+ if rank==0:
+ inner_pbar = tqdm.tqdm(
+ range(len(train_loader)), colour="blue", desc="r0 Training Epoch"
+ )
+ for batch in train_loader:
+ for key in batch.keys():
+ batch[key] = batch[key].to(local_rank)
+ optimizer.zero_grad()
+ output = model(input_ids=batch["source_ids"],attention_mask=batch["source_mask"],labels=batch["target_ids"] )
+ loss = output["loss"]
+ loss.backward()
+ optimizer.step()
+ fsdp_loss[0] += loss.item()
+ fsdp_loss[1] += len(batch)
+ if rank==0:
+ inner_pbar.update(1)
+
+ dist.all_reduce(fsdp_loss, op=dist.ReduceOp.SUM)
+ train_accuracy = fsdp_loss[0] / fsdp_loss[1]
+
+
+ if rank == 0:
+ inner_pbar.close()
+ print(
+ f"Train Epoch: \t{epoch}, Loss: \t{train_accuracy:.4f}"
+ )
+ return train_accuracy
+
+2.3 Define a validation function:
+
+.. code-block:: python
+
+ def validation(model, rank, world_size, val_loader):
+ model.eval()
+ correct = 0
+ local_rank = int(os.environ['LOCAL_RANK'])
+ fsdp_loss = torch.zeros(3).to(local_rank)
+ if rank == 0:
+ inner_pbar = tqdm.tqdm(
+ range(len(val_loader)), colour="green", desc="Validation Epoch"
+ )
+ with torch.no_grad():
+ for batch in val_loader:
+ for key in batch.keys():
+ batch[key] = batch[key].to(local_rank)
+ output = model(input_ids=batch["source_ids"],attention_mask=batch["source_mask"],labels=batch["target_ids"])
+ fsdp_loss[0] += output["loss"].item() # sum up batch loss
+ fsdp_loss[1] += len(batch)
+
+ if rank==0:
+ inner_pbar.update(1)
+
+ dist.all_reduce(fsdp_loss, op=dist.ReduceOp.SUM)
+ val_loss = fsdp_loss[0] / fsdp_loss[1]
+ if rank == 0:
+ inner_pbar.close()
+ print(f"Validation Loss: {val_loss:.4f}")
+ return val_loss
+
+
+2.4 Define a distributed train function that wraps the model in FSDP:
+
+
+.. code-block:: python
+
+
+ def fsdp_main(args):
+
+ model, tokenizer = setup_model("t5-base")
+
+ local_rank = int(os.environ['LOCAL_RANK'])
+ rank = int(os.environ['RANK'])
+ world_size = int(os.environ['WORLD_SIZE'])
+
+
+ dataset = load_dataset('wikihow', 'all', data_dir='data/')
+ print(dataset.keys())
+ print("Size of train dataset: ", dataset['train'].shape)
+ print("Size of Validation dataset: ", dataset['validation'].shape)
+
+
+ #wikihow(tokenizer, type_path, num_samples, input_length, output_length, print_text=False)
+ train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False)
+ val_dataset = wikihow(tokenizer, 'validation', 300, 512, 150, False)
+
+ sampler1 = DistributedSampler(train_dataset, rank=rank, num_replicas=world_size, shuffle=True)
+ sampler2 = DistributedSampler(val_dataset, rank=rank, num_replicas=world_size)
+
+ setup()
+
+
+ train_kwargs = {'batch_size': args.batch_size, 'sampler': sampler1}
+ test_kwargs = {'batch_size': args.test_batch_size, 'sampler': sampler2}
+ cuda_kwargs = {'num_workers': 2,
+ 'pin_memory': True,
+ 'shuffle': False}
+ train_kwargs.update(cuda_kwargs)
+ test_kwargs.update(cuda_kwargs)
+
+ train_loader = torch.utils.data.DataLoader(train_dataset,**train_kwargs)
+ val_loader = torch.utils.data.DataLoader(val_dataset, **test_kwargs)
+
+ t5_auto_wrap_policy = functools.partial(
+ transformer_auto_wrap_policy,
+ transformer_layer_cls={
+ T5Block,
+ },
+ )
+ sharding_strategy: ShardingStrategy = ShardingStrategy.SHARD_GRAD_OP #for Zero2 and FULL_SHARD for Zero3
+ torch.cuda.set_device(local_rank)
+
+
+ #init_start_event = torch.cuda.Event(enable_timing=True)
+ #init_end_event = torch.cuda.Event(enable_timing=True)
+
+ #init_start_event.record()
+
+ bf16_ready = (
+ torch.version.cuda
+ and torch.cuda.is_bf16_supported()
+ and LooseVersion(torch.version.cuda) >= "11.0"
+ and dist.is_nccl_available()
+ and nccl.version() >= (2, 10)
+ )
+
+ if bf16_ready:
+ mp_policy = bfSixteen
+ else:
+ mp_policy = None # defaults to fp32
+
+ # model is on CPU before input to FSDP
+ model = FSDP(model,
+ auto_wrap_policy=t5_auto_wrap_policy,
+ mixed_precision=mp_policy,
+ #sharding_strategy=sharding_strategy,
+ device_id=torch.cuda.current_device())
+
+ optimizer = optim.AdamW(model.parameters(), lr=args.lr)
+
+ scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
+ best_val_loss = float("inf")
+ curr_val_loss = float("inf")
+ file_save_name = "T5-model-"
+
+ if rank == 0:
+ time_of_run = get_date_of_run()
+ dur = []
+ train_acc_tracking = []
+ val_acc_tracking = []
+ training_start_time = time.time()
+
+ if rank == 0 and args.track_memory:
+ mem_alloc_tracker = []
+ mem_reserved_tracker = []
+
+ for epoch in range(1, args.epochs + 1):
+ t0 = time.time()
+ train_accuracy = train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
+ if args.run_validation:
+ curr_val_loss = validation(model, rank, world_size, val_loader)
+ scheduler.step()
+
+ if rank == 0:
+
+ print(f"--> epoch {epoch} completed...entering save and stats zone")
+
+ dur.append(time.time() - t0)
+ train_acc_tracking.append(train_accuracy.item())
+
+ if args.run_validation:
+ val_acc_tracking.append(curr_val_loss.item())
+
+ if args.track_memory:
+ mem_alloc_tracker.append(
+ format_metrics_to_gb(torch.cuda.memory_allocated())
+ )
+ mem_reserved_tracker.append(
+ format_metrics_to_gb(torch.cuda.memory_reserved())
+ )
+ print(f"completed save and stats zone...")
+
+ if args.save_model and curr_val_loss < best_val_loss:
+
+ # save
+ if rank == 0:
+ print(f"--> entering save model state")
+
+ save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
+ with FSDP.state_dict_type(
+ model, StateDictType.FULL_STATE_DICT, save_policy
+ ):
+ cpu_state = model.state_dict()
+ #print(f"saving process: rank {rank} done w state_dict")
+
+
+ if rank == 0:
+ print(f"--> saving model ...")
+ currEpoch = (
+ "-" + str(epoch) + "-" + str(round(curr_val_loss.item(), 4)) + ".pt"
+ )
+ print(f"--> attempting to save model prefix {currEpoch}")
+ save_name = file_save_name + "-" + time_of_run + "-" + currEpoch
+ print(f"--> saving as model name {save_name}")
+
+ torch.save(cpu_state, save_name)
+
+ if curr_val_loss < best_val_loss:
+
+ best_val_loss = curr_val_loss
+ if rank==0:
+ print(f"-->>>> New Val Loss Record: {best_val_loss}")
+
+ dist.barrier()
+ cleanup()
+
+
+2.5 Parse the arguments and set the main function:
+
+.. code-block:: python
+
+
+ if __name__ == '__main__':
+ # Training settings
+ parser = argparse.ArgumentParser(description='PyTorch T5 FSDP Example')
+ parser.add_argument('--batch-size', type=int, default=4, metavar='N',
+ help='input batch size for training (default: 64)')
+ parser.add_argument('--test-batch-size', type=int, default=4, metavar='N',
+ help='input batch size for testing (default: 1000)')
+ parser.add_argument('--epochs', type=int, default=2, metavar='N',
+ help='number of epochs to train (default: 3)')
+ parser.add_argument('--lr', type=float, default=.002, metavar='LR',
+ help='learning rate (default: .002)')
+ parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
+ help='Learning rate step gamma (default: 0.7)')
+ parser.add_argument('--no-cuda', action='store_true', default=False,
+ help='disables CUDA training')
+ parser.add_argument('--seed', type=int, default=1, metavar='S',
+ help='random seed (default: 1)')
+ parser.add_argument('--track_memory', action='store_false', default=True,
+ help='track the gpu memory')
+ parser.add_argument('--run_validation', action='store_false', default=True,
+ help='running the validation')
+ parser.add_argument('--save-model', action='store_false', default=True,
+ help='For Saving the current Model')
+ args = parser.parse_args()
+
+ torch.manual_seed(args.seed)
+
+ fsdp_main(args)
+
+
+To run the the training using torchrun:
+
+.. code-block:: bash
+
+ torchrun --nnodes 1 --nproc_per_node 4 T5_training.py
+
+.. _transformer_wrapping_policy:
+Transformer Wrapping Policy
+---------------------------
+As discussed in the `previous tutorial `__, auto_wrap_policy is one of the FSDP features that make it easy to automatically shard a given model and put the model, optimizer and gradient shards into distinct FSDP units.
+
+For some architectures such as Transformer encoder-decoders, some parts of the model such as embedding table is being shared with both encoder and decoder.
+In this case, we need to place the embedding table in the outer FSDP unit so that it could be accessed from both encoder and decoder. In addition, by registering the layer class for a transformer, the sharding plan can be made much more communication efficient. In PyTorch 1.12, FSDP added this support and now we have a wrapping policy for transfomers.
+
+It can be created as follows, where the T5Block represents the T5 transformer layer class (holding MHSA and FFN).
+
+
+.. code-block:: python
+
+ t5_auto_wrap_policy = functools.partial(
+ transformer_auto_wrap_policy,
+ transformer_layer_cls={
+ T5Block,
+ },
+ )
+ torch.cuda.set_device(local_rank)
+
+
+ model = FSDP(model,
+ fsdp_auto_wrap_policy=t5_auto_wrap_policy)
+
+To see the wrapped model, you can easily print the model and visually inspect the sharding and FSDP units as well.
+
+
+Mixed Precision
+---------------
+FSDP supports flexible mixed precision training allowing for arbitrary reduced precision types (such as fp16 or bfloat16). Currently BFloat16 is only available on Ampere GPUs, so you need to confirm native support before you use it. On V100s for example, BFloat16 can still be run but due to it running non-natively, it can result in significant slowdowns.
+
+To check if BFloat16 is natively supported, you can use the following :
+
+.. code-block:: python
+
+ bf16_ready = (
+ torch.version.cuda
+ and torch.cuda.is_bf16_supported()
+ and LooseVersion(torch.version.cuda) >= "11.0"
+ and dist.is_nccl_available()
+ and nccl.version() >= (2, 10)
+ )
+
+One of the advantages of mixed percision in FSDP is providing granular control over different precision levels for parameters, gradients, and buffers as follows:
+
+.. code-block:: python
+
+ fpSixteen = MixedPrecision(
+ param_dtype=torch.float16,
+ # Gradient communication precision.
+ reduce_dtype=torch.float16,
+ # Buffer precision.
+ buffer_dtype=torch.float16,
+ )
+
+ bfSixteen = MixedPrecision(
+ param_dtype=torch.bfloat16,
+ # Gradient communication precision.
+ reduce_dtype=torch.bfloat16,
+ # Buffer precision.
+ buffer_dtype=torch.bfloat16,
+ )
+
+ fp32_policy = MixedPrecision(
+ param_dtype=torch.float32,
+ # Gradient communication precision.
+ reduce_dtype=torch.float32,
+ # Buffer precision.
+ buffer_dtype=torch.float32,
+ )
+
+Note that if a certain type (parameter, reduce, buffer) is not specified, they will not be casted at all.
+
+This flexibility allows users fine grained control, such as only setting gradient communication to happen in reduced precision, and all parameters / buffer computation to be done in full precision. This is potentially useful in cases where intra-node communication is the main bottleneck and parameters / buffers must be in full precision to avoid accuracy issues. This can be done with the following policy:
+
+.. code-block:: bash
+
+ grad_bf16 = MixedPrecision(reduce_dtype=torch.bfloat16)
+
+
+In 2.4 we just add the relevant mixed precision policy to the FSDP wrapper:
+
+
+.. code-block:: python
+
+ model = FSDP(model,
+ auto_wrap_policy=t5_auto_wrap_policy,
+ mixed_precision=bfSixteen)
+
+In our experiments, we have observed up to 4x speed up by using BFloat16 for training and memory reduction of approximately 30% in some experiments that can be used for batch size increases.
+
+
+Intializing FSDP Model on Device
+--------------------------------
+In 1.12, FSDP supports a `device_id` argument meant to initialize input CPU module on the device given by `device_id`. This is useful when the entire model does not fit on a single GPU, but fits in a host's CPU memory. When `device_id` is specified, FSDP will move the model to the specified device on a per-FSDP unit basis, avoiding GPU OOM issues while initializing several times faster than CPU-based initialization:
+
+.. code-block:: python
+
+ torch.cuda.set_device(local_rank)
+
+ model = FSDP(model,
+ auto_wrap_policy=t5_auto_wrap_policy,
+ mixed_precision=bfSixteen,
+ device_id=torch.cuda.current_device())
+
+
+
+Sharding Strategy
+-----------------
+FSDP sharding strategy by default is set to fully shard the model parameters, gradients and optimizer states get sharded across all ranks. (also termed Zero3 sharding). In case you are interested to have Zero2 sharding strategy, where only optimizer states and gradients are sharded, FSDP support this feature by passing the Sharding strategy by using "ShardingStrategy.SHARD_GRAD_OP", instead of "ShardingStrategy.FULL_SHARD" to the FSDP initialization as follows:
+
+.. code-block:: python
+
+ torch.cuda.set_device(local_rank)
+
+ model = FSDP(model,
+ auto_wrap_policy=t5_auto_wrap_policy,
+ mixed_precision=bfSixteen,
+ device_id=torch.cuda.current_device(),
+ sharding_strategy=ShardingStrategy.SHARD_GRAD_OP # FULL_SHARD)
+
+This will reduce the communication overhead in FSDP, in this case, it holds full parameters after forward and through the backwards pass.
+
+This saves an all_gather during backwards so there is less communication at the cost of a higher memory footprint. Note that full model params are freed at the end of backwards and all_gather will happen on the next forward pass.
+
+Backward Prefetch
+-----------------
+The backward prefetch setting controls the timing of when the next FSDP unit's parameters should be requested. By setting it to `BACKWARD_PRE`, the next FSDP's unit params can begin to be requested and arrive sooner before the computation of the current unit starts. This overlaps the `all_gather` communication and gradient computation which can increase the training speed in exchange for slightly higher memory consumption. It can be utilized in the FSDP wrapper in 2.4 as follows:
+
+.. code-block:: python
+
+ torch.cuda.set_device(local_rank)
+
+ model = FSDP(model,
+ auto_wrap_policy=t5_auto_wrap_policy,
+ mixed_precision=bfSixteen,
+ device_id=torch.cuda.current_device(),
+ backward_prefetch = BackwardPrefetch.BACKWARD_PRE)
+
+`backward_prefetch` has two modes, `BACKWARD_PRE` and `BACKWARD_POST`. `BACKWARD_POST` means that the next FSDP unit's params will not be requested until the current FSDP unit processing is complete, thus minimizing memory overhead. In some cases, using `BACKWARD_PRE` can increase model training speed up to 2-10%, with even higher speed improvements noted for larger models.
+
+Model Checkpoint Saving, by streaming to the Rank0 CPU
+------------------------------------------------------
+To save model checkpoints using FULL_STATE_DICT saving which saves model in the same fashion as a local model, PyTorch 1.12 offers a few utilities to support the saving of larger models.
+
+First, a FullStateDictConfig can be specified, allowing the state_dict to be populated on rank 0 only and offloaded to the CPU.
+
+When using this configuration, FSDP will allgather model parameters, offloading them to the CPU one by one, only on rank 0. When the state_dict is finally saved, it will only be populated on rank 0 and contain CPU tensors. This avoids potential OOM for models that are larger than a single GPU memory and allows users to checkpoint models whose size is roughly the available CPU RAM on the user's machine.
+
+This feature can be run as follows:
+
+.. code-block:: python
+
+ save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
+ with FSDP.state_dict_type(
+ model, StateDictType.FULL_STATE_DICT, save_policy
+ ):
+ cpu_state = model.state_dict()
+ if rank == 0:
+ save_name = file_save_name + "-" + time_of_run + "-" + currEpoch
+ torch.save(cpu_state, save_name)
+
+Summary
+-------
+In this tutorial, we have introduced many new features for FSDP available in Pytorch 1.12 and used HF T5 as the running example.
+Using the proper wrapping policy especially for transformer models, along with mixed precision and backward prefetch should speed up your training runs. Also, features such as initializing the model on device, and checkpoint saving via streaming to CPU should help to avoid OOM error in dealing with large models.
+
+We are actively working to add new features to FSDP for the next release. If you have feedback, feature requests, questions or are encountering issues using FSDP, please feel free to contact us by opening an issue at `PyTorch Github repository `__.
diff --git a/intermediate_source/FSDP_tutorial.rst b/intermediate_source/FSDP_tutorial.rst
index d51f38800..421e966ee 100644
--- a/intermediate_source/FSDP_tutorial.rst
+++ b/intermediate_source/FSDP_tutorial.rst
@@ -3,6 +3,8 @@ Getting Started with Fully Sharded Data Parallel(FSDP)
**Author**: `Hamid Shojanazeri `__, `Yanli Zhao `__, `Shen Li `__
+.. note::
+ View the source code for this tutorial in `github `__.
Training AI models at a large scale is a challenging task that requires a lot of compute power and resources.
It also comes with considerable engineering complexity to handle the training of these very large models.
@@ -33,13 +35,13 @@ At high level FDSP works as follow:
*In forward path*
-* Run allgather to collect all shards from all ranks to recover the full parameter in this FSDP unit
+* Run all_gather to collect all shards from all ranks to recover the full parameter in this FSDP unit
* Run forward computation
* Discard parameter shards it has just collected
*In backward path*
-* Run allgather to collect all shards from all ranks to recover the full parameter in this FSDP unit
+* Run all_gather to collect all shards from all ranks to recover the full parameter in this FSDP unit
* Run backward computation
* Run reduce_scatter to sync gradients
* Discard parameters.
@@ -153,7 +155,7 @@ We add the following code snippets to a python script “FSDP_mnist.py”.
ddp_loss[0] += loss.item()
ddp_loss[1] += len(data)
- dist.reduce(ddp_loss, 0, op=dist.ReduceOp.SUM)
+ dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
if rank == 0:
print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, ddp_loss[0] / ddp_loss[1]))
@@ -174,7 +176,7 @@ We add the following code snippets to a python script “FSDP_mnist.py”.
ddp_loss[1] += pred.eq(target.view_as(pred)).sum().item()
ddp_loss[2] += len(data)
- dist.reduce(ddp_loss, 0, op=dist.ReduceOp.SUM)
+ dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
if rank == 0:
test_loss = ddp_loss[0] / ddp_loss[2]
diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst
index 7406612da..fcdf92461 100644
--- a/intermediate_source/ddp_tutorial.rst
+++ b/intermediate_source/ddp_tutorial.rst
@@ -6,6 +6,9 @@
**번역**: `조병근 `_
+.. note::
+ 이 튜토리얼의 소스 코드는 `GitHub `__ 에서 확인할 수 있습니다.
+
선수과목(Prerequisites):
- `PyTorch 분산 처리 개요 <../beginner/dist_overview.html>`__
@@ -56,7 +59,7 @@ checkpointing 모델 및 DDP와 모델 병렬 처리의 결합을 포함한 추
기본적인 사용법
---------------
-DDP 모듈을 생성하기 전에 우선 작업 그룹을 올바르게 설정해야 합니다. 자세한 내용은
+DDP 모듈을 생성하기 전에 반드시 우선 작업 그룹을 올바르게 설정해야 합니다. 자세한 내용은
`PYTORCH로 분산 어플리케이션 개발하기 `__\에서 확인할 수 있습니다.
.. code:: python
@@ -167,7 +170,7 @@ DDP를 사용할 때, 최적의 방법은 모델을 한 작업에만 저장하
이는 모든 작업이 같은 매개변수로부터 시작되고 변화도는
역전파 전달로 동기화되므로 옵티마이저(optimizer)는
매개변수를 동일한 값으로 계속 설정해야 하기 때문에 정확합니다. 이러한 최적화를 사용하는 경우,
-저장이 완료되기 전에 읽어오는 작업을 시작하지 않도록 해야 합니다. 게다가, 모듈을 읽어올 때,
+저장이 완료되기 전에 불러오는 어떠한 작업도 시작하지 않도록 해야 합니다. 더불어, 모듈을 읽어올 때
작업이 다른 기기에 접근하지 않도록 적절한 ``map_location`` 인자를 제공해야합니다.
``map_location``\값이 없을 경우, ``torch.load``\는 먼저 모듈을 CPU에 읽어온 다음 각 매개변수가
저장된 위치로 복사하여 동일한 장치를 사용하는 동일한 기기에서 모든 작업을 발생시킵니다.
@@ -182,9 +185,6 @@ DDP를 사용할 때, 최적의 방법은 모델을 한 작업에만 저장하
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
- loss_fn = nn.MSELoss()
- optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
-
CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"
if rank == 0:
# 모든 작업은 같은 매개변수로부터 시작된다고 생각해야 합니다.
@@ -199,10 +199,13 @@ DDP를 사용할 때, 최적의 방법은 모델을 한 작업에만 저장하
ddp_model.load_state_dict(
torch.load(CHECKPOINT_PATH, map_location=map_location))
+ loss_fn = nn.MSELoss()
+ optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
+
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
- loss_fn = nn.MSELoss()
+
loss_fn(outputs, labels).backward()
optimizer.step()
@@ -215,10 +218,10 @@ DDP를 사용할 때, 최적의 방법은 모델을 한 작업에만 저장하
cleanup()
모델 병렬 처리를 활용한 DDP
----------------------------
+------------------------------
-DDP는 다중 – GPU 모델에서도 작동합니다.
-다중 – GPU 모델을 활용한 DDP는 대용량의 데이터를 가진 대용량 모델을 학습시킬 때 특히 유용합니다.
+DDP는 다중 GPU 모델에서도 작동합니다.
+다중 GPU 모델을 활용한 DDP는 대용량의 데이터를 가진 대용량 모델을 학습시킬 때 특히 유용합니다.
.. code:: python
@@ -272,3 +275,76 @@ DDP는 다중 – GPU 모델에서도 작동합니다.
run_demo(demo_basic, world_size)
run_demo(demo_checkpoint, world_size)
run_demo(demo_model_parallel, world_size)
+
+Initialize DDP with torch.distributed.run/torchrun
+--------------------------------------------------------------------
+
+We can leverage PyTorch Elastic to simplify the DDP code and initialize the job more easily.
+Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
+
+.. code:: python
+
+ import torch
+ import torch.distributed as dist
+ import torch.nn as nn
+ import torch.optim as optim
+
+ from torch.nn.parallel import DistributedDataParallel as DDP
+
+ class ToyModel(nn.Module):
+ def __init__(self):
+ super(ToyModel, self).__init__()
+ self.net1 = nn.Linear(10, 10)
+ self.relu = nn.ReLU()
+ self.net2 = nn.Linear(10, 5)
+
+ def forward(self, x):
+ return self.net2(self.relu(self.net1(x)))
+
+ def demo_basic():
+ dist.init_process_group("nccl")
+ rank = dist.get_rank()
+ print(f"Start running basic DDP example on rank {rank}.")
+
+ # create model and move it to GPU with id rank
+ device_id = rank % torch.cuda.device_count()
+ model = ToyModel().to(device_id)
+ ddp_model = DDP(model, device_ids=[device_id])
+
+ loss_fn = nn.MSELoss()
+ optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
+
+ optimizer.zero_grad()
+ outputs = ddp_model(torch.randn(20, 10))
+ labels = torch.randn(20, 5).to(device_id)
+ loss_fn(outputs, labels).backward()
+ optimizer.step()
+
+ if __name__ == "__main__":
+ demo_basic()
+
+One can then run a `torch elastic/torchrun`__ command
+on all nodes to initialize the DDP job created above:
+
+.. code:: bash
+ torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py
+
+We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we
+are running it on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes.
+
+Here torchrun will launch 8 process and invoke ``elastic_ddp.py``
+on each process on the node it is launched on, but user also needs to apply cluster
+management tools like slurm to actually run this command on 2 nodes.
+
+For example, on a SLURM enabled cluster, we can write a script to run the command above
+and set ``MASTER_ADDR`` as:
+
+.. code:: bash
+ export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
+
+Then we can just run this script using the SLURM command: ``srun --nodes=2 ./torchrun_script.sh``.
+Of course, this is just an example; you can choose your own cluster scheduling tools
+to initiate the torchrun job.
+
+For more information about Elastic run, one can check this
+`quick start document `__ to learn more.
diff --git a/intermediate_source/dist_pipeline_parallel_tutorial.rst b/intermediate_source/dist_pipeline_parallel_tutorial.rst
index 7bc978898..da57a5c56 100644
--- a/intermediate_source/dist_pipeline_parallel_tutorial.rst
+++ b/intermediate_source/dist_pipeline_parallel_tutorial.rst
@@ -2,11 +2,14 @@ Distributed Pipeline Parallelism Using RPC
==========================================
**Author**: `Shen Li `_
+.. note::
+ View the source code for this tutorial in `github `__.
+
Prerequisites:
- `PyTorch Distributed Overview <../beginner/dist_overview.html>`__
-- `Single-Machine Model Parallel Best Practices `__
-- `Getting started with Distributed RPC Framework `__
+- `Single-Machine Model Parallel Best Practices `__
+- `Getting started with Distributed RPC Framework `__
- RRef helper functions:
`RRef.rpc_sync() `__,
`RRef.rpc_async() `__, and
diff --git a/intermediate_source/dist_tuto.rst b/intermediate_source/dist_tuto.rst
index 292aad7cc..1686bd914 100644
--- a/intermediate_source/dist_tuto.rst
+++ b/intermediate_source/dist_tuto.rst
@@ -3,6 +3,9 @@ PyTorch로 분산 어플리케이션 개발하기
**Author**: `Séb Arnold `_
**번역**: `박정환 `_
+.. note::
+ 이 튜토리얼의 소스 코드는 `GitHub `__ 에서 확인할 수 있습니다.
+
선수과목(Prerequisites):
- `PyTorch Distributed Overview <../beginner/dist_overview.html>`__
diff --git a/intermediate_source/memory_format_tutorial.py b/intermediate_source/memory_format_tutorial.py
index a8aa9d015..cd92877b5 100644
--- a/intermediate_source/memory_format_tutorial.py
+++ b/intermediate_source/memory_format_tutorial.py
@@ -28,7 +28,7 @@
"""
######################################################################
-# Channels last 메모리 형식은 오직 4D NCWH Tensors에서만 실행할 수 있습니다.
+# Channels last 메모리 형식은 오직 4D NCHW Tensors에서만 실행할 수 있습니다.
#
######################################################################
@@ -147,9 +147,10 @@
######################################################################
# 성능 향상
# -------------------------------------------------------------------------------------------
-# 정밀도를 줄인(reduced precision ``torch.float16``) 상태에서 Tensor Cores를 지원하는 Nvidia의 하드웨어에서
-# 가장 의미심장한 성능 향상을 보였습니다. `AMP (Automated Mixed Precision)` 학습 스크립트를 활용하여
-# 연속적인 형식에 비해 Channels last 방식이 22% 이상의 성능 향승을 확인할 수 있었습니다.
+# Channels last 메모리 형식 최적화는 GPU와 CPU에서 모두 사용 가능합니다.
+# GPU에서는 정밀도를 줄인(reduced precision ``torch.float16``) 상태에서 Tensor Cores를 지원하는 Nvidia의
+# 하드웨어에서 가장 의미심장한 성능 향상을 보였습니다. `AMP (Automated Mixed Precision)` 학습 스크립트를
+# 활용하여 연속적인 형식에 비해 Channels last 방식이 22% 이상의 성능 향승을 확인할 수 있었습니다.
# 이 때, Nvidia가 제공하는 AMP를 사용했습니다. https://github.com/NVIDIA/apex
#
# ``python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 ./data``
@@ -232,6 +233,11 @@
# ``alexnet``, ``mnasnet0_5``, ``mnasnet0_75``, ``mnasnet1_0``, ``mnasnet1_3``, ``mobilenet_v2``, ``resnet101``, ``resnet152``, ``resnet18``, ``resnet34``, ``resnet50``, ``resnext50_32x4d``, ``shufflenet_v2_x0_5``, ``shufflenet_v2_x1_0``, ``shufflenet_v2_x1_5``, ``shufflenet_v2_x2_0``, ``squeezenet1_0``, ``squeezenet1_1``, ``vgg11``, ``vgg11_bn``, ``vgg13``, ``vgg13_bn``, ``vgg16``, ``vgg16_bn``, ``vgg19``, ``vgg19_bn``, ``wide_resnet101_2``, ``wide_resnet50_2``
#
+######################################################################
+# 아래 목록의 모델들은 Channels last 형식을 전적으로 지원하며 Intel(R) Xeon(R) Ice Lake (또는 최신) CPU에서 26%-76% 성능 향상을 보여줍니다:
+# ``alexnet``, ``densenet121``, ``densenet161``, ``densenet169``, ``googlenet``, ``inception_v3``, ``mnasnet0_5``, ``mnasnet1_0``, ``resnet101``, ``resnet152``, ``resnet18``, ``resnet34``, ``resnet50``, ``resnext101_32x8d``, ``resnext50_32x4d``, ``shufflenet_v2_x0_5``, ``shufflenet_v2_x1_0``, ``squeezenet1_0``, ``squeezenet1_1``, ``vgg11``, ``vgg11_bn``, ``vgg13``, ``vgg13_bn``, ``vgg16``, ``vgg16_bn``, ``vgg19``, ``vgg19_bn``, ``wide_resnet101_2``, ``wide_resnet50_2``
+#
+
######################################################################
# 기존 모델들 변환하기
# --------------------------
diff --git a/intermediate_source/named_tensor_tutorial.py b/intermediate_source/named_tensor_tutorial.py
deleted file mode 100644
index 349416040..000000000
--- a/intermediate_source/named_tensor_tutorial.py
+++ /dev/null
@@ -1,545 +0,0 @@
-# -*- coding: utf-8 -*-
-"""
-(prototype) Introduction to Named Tensors in PyTorch
-*******************************************************
-**Author**: `Richard Zou `_
-
-Named Tensors aim to make tensors easier to use by allowing users to associate
-explicit names with tensor dimensions. In most cases, operations that take
-dimension parameters will accept dimension names, avoiding the need to track
-dimensions by position. In addition, named tensors use names to automatically
-check that APIs are being used correctly at runtime, providing extra safety.
-Names can also be used to rearrange dimensions, for example, to support
-"broadcasting by name" rather than "broadcasting by position".
-
-This tutorial is intended as a guide to the functionality that will
-be included with the 1.3 launch. By the end of it, you will be able to:
-
-- Create Tensors with named dimensions, as well as remove or rename those
- dimensions
-- Understand the basics of how operations propagate dimension names
-- See how naming dimensions enables clearer code in two key areas:
- - Broadcasting operations
- - Flattening and unflattening dimensions
-
-Finally, we'll put this into practice by writing a multi-head attention module
-using named tensors.
-
-Named tensors in PyTorch are inspired by and done in collaboration with
-`Sasha Rush `_.
-Sasha proposed the original idea and proof of concept in his
-`January 2019 blog post `_.
-
-Basics: named dimensions
-========================
-
-PyTorch now allows Tensors to have named dimensions; factory functions
-take a new `names` argument that associates a name with each dimension.
-This works with most factory functions, such as
-
-- `tensor`
-- `empty`
-- `ones`
-- `zeros`
-- `randn`
-- `rand`
-
-Here we construct a tensor with names:
-"""
-
-import torch
-imgs = torch.randn(1, 2, 2, 3, names=('N', 'C', 'H', 'W'))
-print(imgs.names)
-
-######################################################################
-# Unlike in
-# `the original named tensors blogpost `_,
-# named dimensions are ordered: ``tensor.names[i]`` is the name of the ``i`` th
-# dimension of ``tensor``.
-#
-# There are two ways to rename a ``Tensor``'s dimensions:
-
-# Method #1: set the .names attribute (this changes name in-place)
-imgs.names = ['batch', 'channel', 'width', 'height']
-print(imgs.names)
-
-# Method #2: specify new names (this changes names out-of-place)
-imgs = imgs.rename(channel='C', width='W', height='H')
-print(imgs.names)
-
-######################################################################
-# The preferred way to remove names is to call ``tensor.rename(None)``:
-
-imgs = imgs.rename(None)
-print(imgs.names)
-
-######################################################################
-# Unnamed tensors (tensors with no named dimensions) still work as
-# normal and do not have names in their ``repr``.
-
-unnamed = torch.randn(2, 1, 3)
-print(unnamed)
-print(unnamed.names)
-
-######################################################################
-# Named tensors do not require that all dimensions be named.
-
-imgs = torch.randn(3, 1, 1, 2, names=('N', None, None, None))
-print(imgs.names)
-
-######################################################################
-# Because named tensors can coexist with unnamed tensors, we need a nice way to
-# write named tensor-aware code that works with both named and unnamed tensors.
-# Use ``tensor.refine_names(*names)`` to refine dimensions and lift unnamed
-# dims to named dims. Refining a dimension is defined as a "rename" with the
-# following constraints:
-#
-# - A ``None`` dim can be refined to have any name
-# - A named dim can only be refined to have the same name.
-
-imgs = torch.randn(3, 1, 1, 2)
-named_imgs = imgs.refine_names('N', 'C', 'H', 'W')
-print(named_imgs.names)
-
-# Refine the last two dims to 'H' and 'W'. In Python 2, use the string '...'
-# instead of ...
-named_imgs = imgs.refine_names(..., 'H', 'W')
-print(named_imgs.names)
-
-
-def catch_error(fn):
- try:
- fn()
- assert False
- except RuntimeError as err:
- err = str(err)
- if len(err) > 180:
- err = err[:180] + "..."
- print(err)
-
-
-named_imgs = imgs.refine_names('N', 'C', 'H', 'W')
-
-# Tried to refine an existing name to a different name
-catch_error(lambda: named_imgs.refine_names('N', 'C', 'H', 'width'))
-
-######################################################################
-# Most simple operations propagate names. The ultimate goal for named tensors
-# is for all operations to propagate names in a reasonable, intuitive manner.
-# Support for many common operations has been added at the time of the 1.3
-# release; here, for example, is ``.abs()``:
-
-print(named_imgs.abs().names)
-
-######################################################################
-# Accessors and Reduction
-# -----------------------
-#
-# One can use dimension names to refer to dimensions instead of the positional
-# dimension. These operations also propagate names. Indexing (basic and
-# advanced) has not been implemented yet but is on the roadmap. Using the
-# ``named_imgs`` tensor from above, we can do:
-
-output = named_imgs.sum('C') # Perform a sum over the channel dimension
-print(output.names)
-
-img0 = named_imgs.select('N', 0) # get one image
-print(img0.names)
-
-######################################################################
-# Name inference
-# --------------
-#
-# Names are propagated on operations in a two step process called
-# **name inference**:
-#
-# 1. **Check names**: an operator may perform automatic checks at runtime that
-# check that certain dimension names must match.
-# 2. **Propagate names**: name inference propagates output names to output
-# tensors.
-#
-# Let's go through the very small example of adding 2 one-dim tensors with no
-# broadcasting.
-
-x = torch.randn(3, names=('X',))
-y = torch.randn(3)
-z = torch.randn(3, names=('Z',))
-
-######################################################################
-# **Check names**: first, we will check whether the names of these two tensors
-# *match*. Two names match if and only if they are equal (string equality) or
-# at least one is ``None`` (``None`` is essentially a special wildcard name).
-# The only one of these three that will error, therefore, is ``x + z``:
-
-catch_error(lambda: x + z)
-
-######################################################################
-# **Propagate names**: *unify* the two names by returning the most refined name
-# of the two. With ``x + y``, ``X`` is more refined than ``None``.
-
-print((x + y).names)
-
-######################################################################
-# Most name inference rules are straightforward but some of them can have
-# unexpected semantics. Let's go through a couple you're likely to encounter:
-# broadcasting and matrix multiply.
-#
-# Broadcasting
-# ^^^^^^^^^^^^
-#
-# Named tensors do not change broadcasting behavior; they still broadcast by
-# position. However, when checking two dimensions for if they can be
-# broadcasted, PyTorch also checks that the names of those dimensions match.
-#
-# This results in named tensors preventing unintended alignment during
-# operations that broadcast. In the below example, we apply a
-# ``per_batch_scale`` to ``imgs``.
-
-imgs = torch.randn(2, 2, 2, 2, names=('N', 'C', 'H', 'W'))
-per_batch_scale = torch.rand(2, names=('N',))
-catch_error(lambda: imgs * per_batch_scale)
-
-######################################################################
-# Without ``names``, the ``per_batch_scale`` tensor is aligned with the last
-# dimension of ``imgs``, which is not what we intended. We really wanted to
-# perform the operation by aligning ``per_batch_scale`` with the batch
-# dimension of ``imgs``.
-# See the new "explicit broadcasting by names" functionality for how to
-# align tensors by name, covered below.
-#
-# Matrix multiply
-# ^^^^^^^^^^^^^^^
-#
-# ``torch.mm(A, B)`` performs a dot product between the second dim of ``A``
-# and the first dim of ``B``, returning a tensor with the first dim of ``A``
-# and the second dim of ``B``. (other matmul functions, such as
-# ``torch.matmul``, ``torch.mv``, and ``torch.dot``, behave similarly).
-
-markov_states = torch.randn(128, 5, names=('batch', 'D'))
-transition_matrix = torch.randn(5, 5, names=('in', 'out'))
-
-# Apply one transition
-new_state = markov_states @ transition_matrix
-print(new_state.names)
-
-######################################################################
-# As you can see, matrix multiply does not check if the contracted dimensions
-# have the same name.
-#
-# Next, we'll cover two new behaviors that named tensors enable: explicit
-# broadcasting by names and flattening and unflattening dimensions by names
-#
-# New behavior: Explicit broadcasting by names
-# --------------------------------------------
-#
-# One of the main complaints about working with multiple dimensions is the need
-# to ``unsqueeze`` "dummy" dimensions so that operations can occur.
-# For example, in our per-batch-scale example before, with unnamed tensors
-# we'd do the following:
-
-imgs = torch.randn(2, 2, 2, 2) # N, C, H, W
-per_batch_scale = torch.rand(2) # N
-
-correct_result = imgs * per_batch_scale.view(2, 1, 1, 1) # N, C, H, W
-incorrect_result = imgs * per_batch_scale.expand_as(imgs)
-assert not torch.allclose(correct_result, incorrect_result)
-
-######################################################################
-# We can make these operations safer (and easily agnostic to the number of
-# dimensions) by using names. We provide a new ``tensor.align_as(other)``
-# operation that permutes the dimensions of tensor to match the order specified
-# in ``other.names``, adding one-sized dimensions where appropriate
-# (``tensor.align_to(*names)`` works as well):
-
-imgs = imgs.refine_names('N', 'C', 'H', 'W')
-per_batch_scale = per_batch_scale.refine_names('N')
-
-named_result = imgs * per_batch_scale.align_as(imgs)
-# note: named tensors do not yet work with allclose
-assert torch.allclose(named_result.rename(None), correct_result)
-
-######################################################################
-# New behavior: Flattening and unflattening dimensions by names
-# -------------------------------------------------------------
-#
-# One common operation is flattening and unflattening dimensions. Right now,
-# users perform this using either ``view``, ``reshape``, or ``flatten``; use
-# cases include flattening batch dimensions to send tensors into operators that
-# must take inputs with a certain number of dimensions (i.e., conv2d takes 4D
-# input).
-#
-# To make these operation more semantically meaningful than view or reshape, we
-# introduce a new ``tensor.unflatten(dim, namedshape)`` method and update
-# ``flatten`` to work with names: ``tensor.flatten(dims, new_dim)``.
-#
-# ``flatten`` can only flatten adjacent dimensions but also works on
-# non-contiguous dims. One must pass into ``unflatten`` a **named shape**,
-# which is a list of ``(dim, size)`` tuples, to specify how to unflatten the
-# dim. It is possible to save the sizes during a ``flatten`` for ``unflatten``
-# but we do not yet do that.
-
-imgs = imgs.flatten(['C', 'H', 'W'], 'features')
-print(imgs.names)
-
-imgs = imgs.unflatten('features', (('C', 2), ('H', 2), ('W', 2)))
-print(imgs.names)
-
-######################################################################
-# Autograd support
-# ----------------
-#
-# Autograd currently ignores names on all tensors and just treats them like
-# regular tensors. Gradient computation is correct but we lose the safety that
-# names give us. It is on the roadmap to introduce handling of names to
-# autograd.
-
-x = torch.randn(3, names=('D',))
-weight = torch.randn(3, names=('D',), requires_grad=True)
-loss = (x - weight).abs()
-grad_loss = torch.randn(3)
-loss.backward(grad_loss)
-
-correct_grad = weight.grad.clone()
-print(correct_grad) # Unnamed for now. Will be named in the future
-
-weight.grad.zero_()
-grad_loss = grad_loss.refine_names('C')
-loss = (x - weight).abs()
-# Ideally we'd check that the names of loss and grad_loss match, but we don't
-# yet
-loss.backward(grad_loss)
-
-print(weight.grad) # still unnamed
-assert torch.allclose(weight.grad, correct_grad)
-
-######################################################################
-# Other supported (and unsupported) features
-# ------------------------------------------
-#
-# `See here `_ for a
-# detailed breakdown of what is supported with the 1.3 release.
-#
-# In particular, we want to call out three important features that are not
-# currently supported:
-#
-# - Saving or loading named tensors via ``torch.save`` or ``torch.load``
-# - Multi-processing via ``torch.multiprocessing``
-# - JIT support; for example, the following will error
-
-imgs_named = torch.randn(1, 2, 2, 3, names=('N', 'C', 'H', 'W'))
-
-
-@torch.jit.script
-def fn(x):
- return x
-
-
-catch_error(lambda: fn(imgs_named))
-
-######################################################################
-# As a workaround, please drop names via ``tensor = tensor.rename(None)``
-# before using anything that does not yet support named tensors.
-#
-# Longer example: Multi-head attention
-# --------------------------------------
-#
-# Now we'll go through a complete example of implementing a common
-# PyTorch ``nn.Module``: multi-head attention. We assume the reader is already
-# familiar with multi-head attention; for a refresher, check out
-# `this explanation `_
-# or
-# `this explanation `_.
-#
-# We adapt the implementation of multi-head attention from
-# `ParlAI `_; specifically
-# `here `_.
-# Read through the code at that example; then, compare with the code below,
-# noting that there are four places labeled (I), (II), (III), and (IV), where
-# using named tensors enables more readable code; we will dive into each of
-# these after the code block.
-
-import torch.nn as nn
-import torch.nn.functional as F
-import math
-
-
-class MultiHeadAttention(nn.Module):
- def __init__(self, n_heads, dim, dropout=0):
- super(MultiHeadAttention, self).__init__()
- self.n_heads = n_heads
- self.dim = dim
-
- self.attn_dropout = nn.Dropout(p=dropout)
- self.q_lin = nn.Linear(dim, dim)
- self.k_lin = nn.Linear(dim, dim)
- self.v_lin = nn.Linear(dim, dim)
- nn.init.xavier_normal_(self.q_lin.weight)
- nn.init.xavier_normal_(self.k_lin.weight)
- nn.init.xavier_normal_(self.v_lin.weight)
- self.out_lin = nn.Linear(dim, dim)
- nn.init.xavier_normal_(self.out_lin.weight)
-
- def forward(self, query, key=None, value=None, mask=None):
- # (I)
- query = query.refine_names(..., 'T', 'D')
- self_attn = key is None and value is None
- if self_attn:
- mask = mask.refine_names(..., 'T')
- else:
- mask = mask.refine_names(..., 'T', 'T_key') # enc attn
-
- dim = query.size('D')
- assert dim == self.dim, \
- f'Dimensions do not match: {dim} query vs {self.dim} configured'
- assert mask is not None, 'Mask is None, please specify a mask'
- n_heads = self.n_heads
- dim_per_head = dim // n_heads
- scale = math.sqrt(dim_per_head)
-
- # (II)
- def prepare_head(tensor):
- tensor = tensor.refine_names(..., 'T', 'D')
- return (tensor.unflatten('D', [('H', n_heads), ('D_head', dim_per_head)])
- .align_to(..., 'H', 'T', 'D_head'))
-
- assert value is None
- if self_attn:
- key = value = query
- elif value is None:
- # key and value are the same, but query differs
- key = key.refine_names(..., 'T', 'D')
- value = key
- dim = key.size('D')
-
- # Distinguish between query_len (T) and key_len (T_key) dims.
- k = prepare_head(self.k_lin(key)).rename(T='T_key')
- v = prepare_head(self.v_lin(value)).rename(T='T_key')
- q = prepare_head(self.q_lin(query))
-
- dot_prod = q.div_(scale).matmul(k.align_to(..., 'D_head', 'T_key'))
- dot_prod.refine_names(..., 'H', 'T', 'T_key') # just a check
-
- # (III)
- attn_mask = (mask == 0).align_as(dot_prod)
- dot_prod.masked_fill_(attn_mask, -float(1e20))
-
- attn_weights = self.attn_dropout(F.softmax(dot_prod / scale,
- dim='T_key'))
-
- # (IV)
- attentioned = (
- attn_weights.matmul(v).refine_names(..., 'H', 'T', 'D_head')
- .align_to(..., 'T', 'H', 'D_head')
- .flatten(['H', 'D_head'], 'D')
- )
-
- return self.out_lin(attentioned).refine_names(..., 'T', 'D')
-
-######################################################################
-# **(I) Refining the input tensor dims**
-
-def forward(self, query, key=None, value=None, mask=None):
- # (I)
- query = query.refine_names(..., 'T', 'D')
-
-######################################################################
-# The ``query = query.refine_names(..., 'T', 'D')`` serves as enforcable documentation
-# and lifts input dimensions to being named. It checks that the last two dimensions
-# can be refined to ``['T', 'D']``, preventing potentially silent or confusing size
-# mismatch errors later down the line.
-#
-# **(II) Manipulating dimensions in prepare_head**
-
-# (II)
-def prepare_head(tensor):
- tensor = tensor.refine_names(..., 'T', 'D')
- return (tensor.unflatten('D', [('H', n_heads), ('D_head', dim_per_head)])
- .align_to(..., 'H', 'T', 'D_head'))
-
-######################################################################
-# The first thing to note is how the code clearly states the input and
-# output dimensions: the input tensor must end with the ``T`` and ``D`` dims
-# and the output tensor ends in ``H``, ``T``, and ``D_head`` dims.
-#
-# The second thing to note is how clearly the code describes what is going on.
-# prepare_head takes the key, query, and value and splits the embedding dim into
-# multiple heads, finally rearranging the dim order to be ``[..., 'H', 'T', 'D_head']``.
-# ParlAI implements ``prepare_head`` as the following, using ``view`` and ``transpose``
-# operations:
-
-def prepare_head(tensor):
- # input is [batch_size, seq_len, n_heads * dim_per_head]
- # output is [batch_size * n_heads, seq_len, dim_per_head]
- batch_size, seq_len, _ = tensor.size()
- tensor = tensor.view(batch_size, tensor.size(1), n_heads, dim_per_head)
- tensor = (
- tensor.transpose(1, 2)
- .contiguous()
- .view(batch_size * n_heads, seq_len, dim_per_head)
- )
- return tensor
-
-######################################################################
-# Our named tensor variant uses ops that, though more verbose, have more
-# semantic meaning than ``view`` and ``transpose`` and includes enforcable
-# documentation in the form of names.
-#
-# **(III) Explicit broadcasting by names**
-
-def ignore():
- # (III)
- attn_mask = (mask == 0).align_as(dot_prod)
- dot_prod.masked_fill_(attn_mask, -float(1e20))
-
-######################################################################
-# ``mask`` usually has dims ``[N, T]`` (in the case of self attention) or
-# ``[N, T, T_key]`` (in the case of encoder attention) while ``dot_prod``
-# has dims ``[N, H, T, T_key]``. To make ``mask`` broadcast correctly with
-# ``dot_prod``, we would usually `unsqueeze` dims ``1`` and ``-1`` in the case
-# of self attention or ``unsqueeze`` dim ``1`` in the case of encoder
-# attention. Using named tensors, we simply align ``attn_mask`` to ``dot_prod``
-# using ``align_as`` and stop worrying about where to ``unsqueeze`` dims.
-#
-# **(IV) More dimension manipulation using align_to and flatten**
-
-def ignore():
- # (IV)
- attentioned = (
- attn_weights.matmul(v).refine_names(..., 'H', 'T', 'D_head')
- .align_to(..., 'T', 'H', 'D_head')
- .flatten(['H', 'D_head'], 'D')
- )
-
-######################################################################
-# Here, as in (II), ``align_to`` and ``flatten`` are more semantically
-# meaningful than ``view`` and ``transpose`` (despite being more verbose).
-#
-# Running the example
-# -------------------
-
-n, t, d, h = 7, 5, 2 * 3, 3
-query = torch.randn(n, t, d, names=('N', 'T', 'D'))
-mask = torch.ones(n, t, names=('N', 'T'))
-attn = MultiHeadAttention(h, d)
-output = attn(query, mask=mask)
-# works as expected!
-print(output.names)
-
-######################################################################
-# The above works as expected. Furthermore, note that in the code we
-# did not mention the name of the batch dimension at all. In fact,
-# our ``MultiHeadAttention`` module is agnostic to the existence of batch
-# dimensions.
-
-query = torch.randn(t, d, names=('T', 'D'))
-mask = torch.ones(t, names=('T',))
-output = attn(query, mask=mask)
-print(output.names)
-
-######################################################################
-# Conclusion
-# ----------
-#
-# Thank you for reading! Named tensors are still very much in development;
-# if you have feedback and/or suggestions for improvement, please let us
-# know by creating `an issue `_.
diff --git a/intermediate_source/process_group_cpp_extension_tutorial.rst b/intermediate_source/process_group_cpp_extension_tutorial.rst
index da70fb62b..de029cb8e 100644
--- a/intermediate_source/process_group_cpp_extension_tutorial.rst
+++ b/intermediate_source/process_group_cpp_extension_tutorial.rst
@@ -3,6 +3,8 @@ Customize Process Group Backends Using Cpp Extensions
**Author**: `Feng Tian `__, `Shen Li `__
+.. note::
+ View the source code for this tutorial in `github `__.
Prerequisites:
diff --git a/intermediate_source/reinforcement_q_learning.py b/intermediate_source/reinforcement_q_learning.py
index 4dd9801ec..96e86601c 100644
--- a/intermediate_source/reinforcement_q_learning.py
+++ b/intermediate_source/reinforcement_q_learning.py
@@ -5,7 +5,7 @@
**Author**: `Adam Paszke `_
**번역**: `황성수 `_
-이 튜토리얼에서는 `OpenAI Gym `__ 의
+이 튜토리얼에서는 `OpenAI Gym `__ 의
CartPole-v0 태스크에서 DQN (Deep Q Learning) 에이전트를 학습하는데
PyTorch를 사용하는 방법을 보여드립니다.
@@ -14,7 +14,7 @@
에이전트는 연결된 막대가 똑바로 서 있도록 카트를 왼쪽이나 오른쪽으로
움직이는 두 가지 동작 중 하나를 선택해야 합니다.
다양한 알고리즘과 시각화 기능을 갖춘 공식 순위표를
-`Gym 웹사이트 `__ 에서 찾을 수 있습니다.
+`Gym 웹사이트 `__ 에서 찾을 수 있습니다.
.. figure:: /_static/img/cartpole.gif
:alt: cartpole
@@ -40,7 +40,7 @@
**패키지**
먼저 필요한 패키지를 가져옵니다. 첫째, 환경을 위해
-`gym `__ 이 필요합니다.
+`gym `__ 이 필요합니다.
(`pip install gym` 을 사용하여 설치하십시오).
또한 PyTorch에서 다음을 사용합니다:
diff --git a/intermediate_source/rpc_async_execution.rst b/intermediate_source/rpc_async_execution.rst
index d2b4ff29f..68158e3b5 100644
--- a/intermediate_source/rpc_async_execution.rst
+++ b/intermediate_source/rpc_async_execution.rst
@@ -2,6 +2,8 @@ Implementing Batch RPC Processing Using Asynchronous Executions
===============================================================
**Author**: `Shen Li `_
+.. note::
+ View the source code for this tutorial in `github `__.
Prerequisites:
@@ -190,7 +192,7 @@ implement batch RPC applications using the
`@rpc.functions.async_execution `__
decorator. In the next section, we re-implement the reinforcement learning
example in the previous
-`Getting started with Distributed RPC Framework `__
+`Getting started with Distributed RPC Framework `__
tutorial using batch processing, and demonstrate its impact on the training
speed.
@@ -264,7 +266,7 @@ which will be presented shortly, and this function will be decorated with
self.select_action = Agent.select_action_batch if batch else Agent.select_action
Compared to the previous tutorial
-`Getting started with Distributed RPC Framework `__,
+`Getting started with Distributed RPC Framework `__,
observers behave a little differently. Instead of exiting when the environment
is stopped, it always runs ``n_steps`` iterations in every episode. When the
environment returns, the observer simply resets the environment and start over
@@ -520,4 +522,4 @@ Learn More
- `Batch-Updating Parameter Server Source Code `__
- `Batch-Processing CartPole Solver `__
- `Distributed Autograd `__
-- `Distributed Pipeline Parallelism `__
\ No newline at end of file
+- `Distributed Pipeline Parallelism `__
diff --git a/intermediate_source/rpc_param_server_tutorial.rst b/intermediate_source/rpc_param_server_tutorial.rst
index 0d5d57b12..6d74f82a2 100644
--- a/intermediate_source/rpc_param_server_tutorial.rst
+++ b/intermediate_source/rpc_param_server_tutorial.rst
@@ -4,6 +4,9 @@ Implementing a Parameter Server Using Distributed RPC Framework
**Author**\ : `Rohan Varma `_
+.. note::
+ View the source code for this tutorial in `github `__.
+
Prerequisites:
- `PyTorch Distributed Overview <../beginner/dist_overview.html>`__
@@ -13,7 +16,7 @@ This tutorial walks through a simple example of implementing a parameter server
Using the Distributed RPC Framework, we'll build an example where multiple trainers use RPC to communicate with the same parameter server and use `RRef `_ to access states on the remote parameter server instance. Each trainer will launch its dedicated backward pass in a distributed fashion through stitching of the autograd graph across multiple nodes using distributed autograd.
-**Note**\ : This tutorial covers the use of the Distributed RPC Framework, which is useful for splitting a model onto multiple machines, or for implementing a parameter-server training strategy where network trainers fetch parameters hosted on a different machine. If instead you are looking for replicating your model across many GPUs, please see the `Distributed Data Parallel tutorial `_. There is also another `RPC tutorial `_ that covers reinforcement learning and RNN use cases.
+**Note**\ : This tutorial covers the use of the Distributed RPC Framework, which is useful for splitting a model onto multiple machines, or for implementing a parameter-server training strategy where network trainers fetch parameters hosted on a different machine. If instead you are looking for replicating your model across many GPUs, please see the `Distributed Data Parallel tutorial `_. There is also another `RPC tutorial `_ that covers reinforcement learning and RNN use cases.
Let's start with the familiar: importing our required modules and defining a simple ConvNet that will train on the MNIST dataset. The below network is largely adopted from the network defined in the `pytorch/examples repo `_.
diff --git a/intermediate_source/rpc_tutorial.rst b/intermediate_source/rpc_tutorial.rst
index 9ab52c718..aaaa6022b 100644
--- a/intermediate_source/rpc_tutorial.rst
+++ b/intermediate_source/rpc_tutorial.rst
@@ -2,6 +2,8 @@ Getting Started with Distributed RPC Framework
=================================================
**Author**: `Shen Li `_
+.. note::
+ View the source code for this tutorial in `github `__.
Prerequisites:
diff --git a/intermediate_source/torchserve_with_ipex.rst b/intermediate_source/torchserve_with_ipex.rst
new file mode 100644
index 000000000..caef69267
--- /dev/null
+++ b/intermediate_source/torchserve_with_ipex.rst
@@ -0,0 +1,394 @@
+Grokking PyTorch Intel CPU performance from first principles
+============================================================
+
+A case study on the TorchServe inference framework optimized with `Intel® Extension for PyTorch* `_.
+
+Authors: Min Jean Cho, Mark Saroufim
+
+Reviewers: Ashok Emani, Jiong Gong
+
+Getting a strong out-of-box performance for deep learning on CPUs can be tricky but it’s much easier if you’re aware of the main problems that affect performance, how to measure them and how to solve them.
+
+TL;DR
+
++-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
+| Problem | How to measure it | Solution |
++-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
+| Bottlenecked GEMM execution units | - `Imbalance or Serial Spinning `_ | Avoid using logical cores by setting thread affinity to physical cores via core pinning |
+| | - `Front-End Bound `_ | |
+| | - `Core Bound `_ | |
++-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
+| Non Uniform Memory Access (NUMA) | - Local vs. remote memory access | Avoid cross-socket computation by setting thread affinity to a specific socket via core pinning |
+| | - `UPI Utilization `_ | |
+| | - Latency in memory accesses | |
+| | - Thread migration | |
++-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
+
+*GEMM (General Matrix Multiply)* run on fused-multiply-add (FMA) or dot-product (DP) execution units which will be bottlenecked and cause delays in thread waiting/*spinning at synchronization* barrier when *hyperthreading* is enabled - because using logical cores causes insufficient concurrency for all working threads as each logical thread *contends for the same core resources*. Instead, if we use 1 thread per physical core, we avoid this contention. So we generally recommend *avoiding logical cores* by setting CPU *thread affinity* to physical cores via *core pinning*.
+
+Multi-socket systems have *Non-Uniform Memory Access (NUMA)* which is a shared memory architecture that describes the placement of main memory modules with respect to processors. But if a process is not NUMA-aware, slow *remote memory* is frequently accessed when *threads migrate* cross socket via *Intel Ultra Path Interconnect (UPI)* during run time. We address this problem by setting CPU *thread affinity* to a specific socket via *core pinning*.
+
+Knowing these principles in mind, proper CPU runtime configuration can significantly boost out-of-box performance.
+
+In this blog, we'll walk you through the important runtime configurations you should be aware of from `CPU Performance Tuning Guide `_, explain how they work, how to profile them and how to integrate them within a model serving framework like `TorchServe `_ via an easy to use `launch script `_ which we’ve `integrated `_ :superscript:`1` natively.
+
+We’ll explain all of these ideas :strong:`visually` from :strong:`first principles` with lots of :strong:`profiles` and show you how we applied our learnings to make out of the box CPU performance on TorchServe better.
+
+1. The feature has to be explicitly enabled by setting *cpu_launcher_enable=true* in *config.properties*.
+
+Avoid logical cores for deep learning
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Avoiding logical cores for deep learning workloads generally improves performance. To understand this, let us take a step back to GEMM.
+
+:strong:`Optimizing GEMM optimizes deep learning`
+
+The majority of time in deep learning training or inference is spent on millions of repeated operations of GEMM which is at the core of fully connected layers. Fully connected layers have been used for decades since multi-layer perceptrons (MLP) `proved to be a universal approximator of any continuous function `_. Any MLP can be entirely represented as GEMM. And even a convolution can be represented as a GEMM by using a `Toepliz matrix `_.
+
+Returning to the original topic, most GEMM operators benefit from using non-hyperthreading, because the majority of time in deep learning training or inference is spent on millions of repeated operations of GEMM running on fused-multiply-add (FMA) or dot-product (DP) execution units shared by hyperthreading cores. With hyperthreading enabled, OpenMP threads will contend for the same GEMM execution units.
+
+.. figure:: /_static/img/torchserve-ipex-images/1_.png
+ :width: 70%
+ :align: center
+
+And if 2 logical threads run GEMM at the same time, they will be sharing the same core resources causing front end bound, such that the overhead from this front end bound is greater than the gain from running both logical threads at the same time.
+
+Therefore we generally recommend avoiding using logical cores for deep learning workloads to achieve good performance. The launch script by default uses physical cores only; however, users can easily experiment with logical vs. physical cores by simply toggling the ``--use_logical_core`` launch script knob.
+
+:strong:`Exercise`
+
+We'll use the following example of feeding ResNet50 dummy tensor:
+
+.. code:: python
+
+ import torch
+ import torchvision.models as models
+ import time
+
+ model = models.resnet50(pretrained=False)
+ model.eval()
+ data = torch.rand(1, 3, 224, 224)
+
+ # warm up
+ for _ in range(100):
+ model(data)
+
+ start = time.time()
+ for _ in range(100):
+ model(data)
+ end = time.time()
+ print('Inference took {:.2f} ms in average'.format((end-start)/100*1000))
+
+Throughout the blog, we'll use `Intel® VTune™ Profiler `_ to profile and verify optimizations. And we'll run all exercises on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. The CPU information is shown in Figure 2.1.
+
+Environment variable ``OMP_NUM_THREADS`` is used to set the number of threads for parallel region. We'll compare ``OMP_NUM_THREADS=2`` with (1) use of logical cores and (2) use of physical cores only.
+
+(1) Both OpenMP threads trying to utilize the same GEMM execution units shared by hyperthreading cores (0, 56)
+
+We can visualize this by running ``htop`` command on Linux as shown below.
+
+.. figure:: /_static/img/torchserve-ipex-images/2.png
+ :width: 100%
+ :align: center
+
+
+.. figure:: /_static/img/torchserve-ipex-images/3.png
+ :width: 100%
+ :align: center
+
+We notice that the Spin Time is flagged, and Imbalance or Serial Spinning contributed to the majority of it - 4.980 seconds out of the 8.982 seconds total. The Imbalance or Serial Spinning when using logical cores is due to insufficient concurrency of working threads as each logical thread contends for the same core resources.
+
+The Top Hotspots section of the execution summary indicates that ``__kmp_fork_barrier`` took 4.589 seconds of CPU time - during 9.33% of the CPU execution time, threads were just spinning at this barrier due to thread synchronization.
+
+(2) Each OpenMP thread utilizing GEMM execution units in respective physical cores (0,1)
+
+
+.. figure:: /_static/img/torchserve-ipex-images/4.png
+ :width: 80%
+ :align: center
+
+
+.. figure:: /_static/img/torchserve-ipex-images/5.png
+ :width: 80%
+ :align: center
+
+We first note that the execution time dropped from 32 seconds to 23 seconds by avoiding logical cores. While there's still some non-negligible Imbalance or Serial Spinning, we note relative improvement from 4.980 seconds to 3.887 seconds.
+
+By not using logical threads (instead, using 1 thread per physical core), we avoid logical threads contending for the same core resources. The Top Hotspots section also indicates relative improvement of ``__kmp_fork_barrier`` time from 4.589 seconds to 3.530 seconds.
+
+Local memory access is always faster than remote memory access
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We generally recommend binding a process to a local socket such that the process does not migrate across sockets. Generally the goal of doing so is to utilize high speed cache on local memory and to avoid remote memory access which can be ~2x slower.
+
+
+.. figure:: /_static/img/torchserve-ipex-images/6.png
+ :width: 80%
+ :align: center
+Figure 1. Two-socket configuration
+
+Figure 1. shows a typical two-socket configuration. Notice that each socket has its own local memory. Sockets are connected to each other via Intel Ultra Path Interconnect (UPI) which allows each socket to access the local memory of another socket called remote memory. Local memory access is always faster than remote memory access.
+
+.. figure:: /_static/img/torchserve-ipex-images/7.png
+ :width: 50%
+ :align: center
+Figure 2.1. CPU information
+
+Users can get their CPU information by running ``lscpu`` command on their Linux machine. Figure 2.1. shows an example of ``lscpu`` execution on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. Notice that there are 28 cores per socket, and 2 threads per core (i.e., hyperthreading is enabled). In other words, there are 28 logical cores in addition to 28 physical cores, giving a total of 56 cores per socket. And there are 2 sockets, giving a total of 112 cores (``Thread(s) per core`` x ``Core(s) per socket`` x ``Socket(s)``).
+
+.. figure:: /_static/img/torchserve-ipex-images/8.png
+ :width: 100%
+ :align: center
+Figure 2.2. CPU information
+
+The 2 sockets are mapped to 2 NUMA nodes (NUMA node 0, NUMA node 1) respectively. Physical cores are indexed prior to logical cores. As shown in Figure 2.2., the first 28 physical cores (0-27) and the first 28 logical cores (56-83) on the first socket are on NUMA node 0. And the second 28 physical cores (28-55) and the second 28 logical cores (84-111) on the second socket are on NUMA node 1. Cores on the same socket share local memory and last level cache (LLC) which is much faster than cross-socket communication via Intel UPI.
+
+Now that we understand NUMA, cross-socket (UPI) traffic, local vs. remote memory access in multi-processor systems, let's profile and verify our understanding.
+
+:strong:`Exercise`
+
+We'll reuse the ResNet50 example above.
+
+As we did not pin threads to processor cores of a specific socket, the operating system periodically schedules threads on processor cores located in different sockets.
+
+.. figure:: /_static/img/torchserve-ipex-images/9.gif
+ :width: 100%
+ :align: center
+
+Figure 3. CPU usage of non NUMA-aware application. 1 main worker thread was launched, then it launched a physical core number (56) of threads on all cores, including logical cores.
+
+(Aside: If the number of threads is not set by `torch.set_num_threads `_, the default number of threads is the number of physical cores in a hyperthreading enabled system. This can be verified by `torch.get_num_threads `_. Hence we see above about half of the cores busy running the example script.)
+
+.. figure:: /_static/img/torchserve-ipex-images/10.png
+ :width: 100%
+ :align: center
+Figure 4. Non-Uniform Memory Access Analysis graph
+
+
+Figure 4. compares local vs. remote memory access over time. We verify usage of remote memory which could result in sub-optimal performance.
+
+:strong:`Set thread affinity to reduce remote memory access and cross-socket (UPI) traffic`
+
+Pinning threads to cores on the same socket helps maintain locality of memory access. In this example, we'll pin to the physical cores on the first NUMA node (0-27). With the launch script, users can easily experiment with NUMA nodes configuration by simply toggling the ``--node_id`` launch script knob.
+
+Let's visualize the CPU usage now.
+
+.. figure:: /_static/img/torchserve-ipex-images/11.gif
+ :width: 100%
+ :align: center
+Figure 5. CPU usage of NUMA-aware application
+
+1 main worker thread was launched, then it launched threads on all physical cores on the first numa node.
+
+.. figure:: /_static/img/torchserve-ipex-images/12.png
+ :width: 100%
+ :align: center
+Figure 6. Non-Uniform Memory Access Analysis graph
+
+As shown in Figure 6., now almost all memory accesses are local accesses.
+
+Efficient CPU usage with core pinning for multi-worker inference
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When running multi-worker inference, cores are overlapped (or shared) between workers causing inefficient CPU usage. To address this problem, the launch script equally divides the number of available cores by the number of workers such that each worker is pinned to assigned cores during runtime.
+
+:strong:`Exercise with TorchServe`
+
+For this exercise, let's apply the CPU performance tuning principles and recommendations that we have discussed so far to `TorchServe apache-bench benchmarking `_.
+
+We'll use ResNet50 with 4 workers, concurrency 100, requests 10,000. All other parameters (e.g., batch_size, input, etc) are the same as the `default parameters `_.
+
+We'll compare the following three configurations:
+
+(1) default TorchServe setting (no core pinning)
+
+(2) `torch.set_num_threads `_ = ``number of physical cores / number of workers`` (no core pinning)
+
+(3) core pinning via the launch script
+
+After this exercise, we'll have verified that we prefer avoiding logical cores and prefer local memory access via core pinning with a real TorchServe use case.
+
+1. Default TorchServe setting (no core pinning)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The `base_handler `_ doesn't explicitly set `torch.set_num_threads `_. Hence the default number of threads is the number of physical CPU cores as described `here `_. Users can check the number of threads by `torch.get_num_threads `_ in the base_handler. Each of the 4 main worker threads launches a physical core number (56) of threads, launching a total of 56x4 = 224 threads, which is more than the total number of cores 112. Therefore cores are guaranteed to be heavily overlapped with high logical core utilization- multiple workers using multiple cores at the same time. Furthermore, because threads are not affinitized to specific CPU cores, the operating system periodically schedules threads to cores located in different sockets.
+
+1. CPU usage
+
+.. figure:: /_static/img/torchserve-ipex-images/13.png
+ :width: 100%
+ :align: center
+
+4 main worker threads were launched, then each launched a physical core number (56) of threads on all cores, including logical cores.
+
+2. Core Bound stalls
+
+.. figure:: /_static/img/torchserve-ipex-images/14.png
+ :width: 80%
+ :align: center
+
+We observe a very high Core Bound stall of 88.4%, decreasing pipeline efficiency. Core Bound stalls indicate sub-optimal use of available execution units in the CPU. For example, several GEMM instructions in a row competing for fused-multiply-add (FMA) or dot-product (DP) execution units shared by hyperthreading cores could cause Core Bound stalls. And as described in the previous section, use of logical cores amplifies this problem.
+
+
+.. figure:: /_static/img/torchserve-ipex-images/15.png
+ :width: 40%
+ :align: center
+
+.. figure:: /_static/img/torchserve-ipex-images/16.png
+ :width: 50%
+ :align: center
+
+An empty pipeline slot not filled with micro-ops (uOps) is attributed to a stall. For example, without core pinning CPU usage may not effectively be on compute but on other operations like thread scheduling from Linux kernel. We see above that ``__sched_yield`` contributed to the majority of the Spin Time.
+
+3. Thread Migration
+
+Without core pinning, scheduler may migrate thread executing on a core to a different core. Thread migration can disassociate the thread from data that has already been fetched into the caches resulting in longer data access latencies. This problem is exacerbated in NUMA systems when thread migrates across sockets. Data that has been fetched to high speed cache on local memory now becomes remote memory, which is much slower.
+
+.. figure:: /_static/img/torchserve-ipex-images/17.png
+ :width: 50%
+ :align: center
+
+Generally the total number of threads should be less than or equal to the total number of threads supported by the core. In the above example, we notice a large number of threads executing on core_51 instead of the expected 2 threads (since hyperthreading is enabled in Intel(R) Xeon(R) Platinum 8180 CPUs) . This indicates thread migration.
+
+.. figure:: /_static/img/torchserve-ipex-images/18.png
+ :width: 80%
+ :align: center
+
+Additionally, notice that thread (TID:97097) was executing on a large number of CPU cores, indicating CPU migration. For example, this thread was executing on cpu_81, then migrated to cpu_14, then migrated to cpu_5, and so on. Furthermore, note that this thread migrated cross socket back and forth many times, resulting in very inefficient memory access. For example, this thread executed on cpu_70 (NUMA node 0), then migrated to cpu_100 (NUMA node 1), then migrated to cpu_24 (NUMA node 0).
+
+4. Non Uniform Memory Access Analysis
+
+.. figure:: /_static/img/torchserve-ipex-images/19.png
+ :width: 100%
+ :align: center
+
+Compare local vs. remote memory access over time. We observe that about half, 51.09%, of the memory accesses were remote accesses, indicating sub-optimal NUMA configuration.
+
+2. torch.set_num_threads = ``number of physical cores / number of workers`` (no core pinning)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For an apple-to-apple comparison with launcher's core pinning, we'll set the number of threads to the number of cores divided by the number of workers (launcher does this internally). Add the following code snippet in the `base_handler `_:
+
+.. code:: python
+
+ torch.set_num_threads(num_physical_cores/num_workers)
+
+As before without core pinning, these threads are not affinitized to specific CPU cores, causing the operating system to periodically schedule threads on cores located in different sockets.
+
+1. CPU usage
+
+.. figure:: /_static/img/torchserve-ipex-images/20.gif
+ :width: 100%
+ :align: center
+
+4 main worker threads were launched, then each launched a ``num_physical_cores/num_workers`` number (14) of threads on all cores, including logical cores.
+
+2. Core Bound stalls
+
+.. figure:: /_static/img/torchserve-ipex-images/21.png
+ :width: 80%
+ :align: center
+
+Although the percentage of Core Bound stalls has decreased from 88.4% to 73.5%, the Core Bound is still very high.
+
+.. figure:: /_static/img/torchserve-ipex-images/22.png
+ :width: 40%
+ :align: center
+
+.. figure:: /_static/img/torchserve-ipex-images/23.png
+ :width: 50%
+ :align: center
+
+3. Thread Migration
+
+.. figure:: /_static/img/torchserve-ipex-images/24.png
+ :width: 75%
+ :align: center
+
+Similar as before, without core pinning thread (TID:94290) was executing on a large number of CPU cores, indicating CPU migration. We notice again cross-socket thread migration, resulting in very inefficient memory access. For example, this thread executed on cpu_78 (NUMA node 0), then migrated to cpu_108 (NUMA node 1).
+
+4. Non Uniform Memory Access Analysis
+
+.. figure:: /_static/img/torchserve-ipex-images/25.png
+ :width: 100%
+ :align: center
+
+Although an improvement from the original 51.09%, still 40.45% of memory access is remote, indicating sub-optimal NUMA configuration.
+
+3. launcher core pinning
+~~~~~~~~~~~~~~~~~~~~~~~~
+Launcher will internally equally distribute physical cores to workers, and bind them to each worker. As a reminder, launcher by default uses physical cores only. In this example, launcher will bind worker 0 to cores 0-13 (NUMA node 0), worker 1 to cores 14-27 (NUMA node 0), worker 2 to cores 28-41 (NUMA node 1), and worker 3 to cores 42-55 (NUMA node 1). Doing so ensures that cores are not overlapped among workers and avoids logical core usage.
+
+1. CPU usage
+
+.. figure:: /_static/img/torchserve-ipex-images/26.gif
+ :width: 100%
+ :align: center
+
+4 main worker threads were launched, then each launched a ``num_physical_cores/num_workers`` number (14) of threads affinitized to the assigned physical cores.
+
+2. Core Bound stalls
+
+.. figure:: /_static/img/torchserve-ipex-images/27.png
+ :width: 80%
+ :align: center
+
+Core Bound stalls has decreased significantly from the original 88.4% to 46.2% - almost a 2x improvement.
+
+.. figure:: /_static/img/torchserve-ipex-images/28.png
+ :width: 40%
+ :align: center
+
+.. figure:: /_static/img/torchserve-ipex-images/29.png
+ :width: 50%
+ :align: center
+
+We verify that with core binding, most CPU time is effectively used on compute - Spin Time of 0.256s.
+
+3. Thread Migration
+
+.. figure:: /_static/img/torchserve-ipex-images/30.png
+ :width: 100%
+ :align: center
+
+We verify that `OMP Primary Thread #0` was bound to assigned physical cores (42-55), and did not migrate cross-socket.
+
+4. Non Uniform Memory Access Analysis
+
+.. figure:: /_static/img/torchserve-ipex-images/31.png
+ :width: 100%
+ :align: center
+
+Now almost all, 89.52%, memory accesses are local accesses.
+
+Conclusion
+~~~~~~~~~~
+
+In this blog, we've showcased that properly setting your CPU runtime configuration can significantly boost out-of-box CPU performance.
+
+We have walked through some general CPU performance tuning principles and recommendations:
+
+- In a hyperthreading enabled system, avoid logical cores by setting thread affinity to physical cores only via core pinning.
+- In a multi-socket system with NUMA, avoid cross-socket remote memory access by setting thread affinity to a specific socket via core pinning.
+
+We have visually explained these ideas from first principles and have verified the performance boost with profiling. And finally, we have applied all of our learnings to TorchServe to boost out-of-box TorchServe CPU performance.
+
+These principles can be automatically configured via an easy to use launch script which has already been integrated into TorchServe.
+
+For interested readers, please check out the following documents:
+
+- `CPU specific optimizations `_
+- `Maximize Performance of Intel® Software Optimization for PyTorch* on CPU `_
+- `Performance Tuning Guide `_
+- `Launch Script Usage Guide `_
+- `Top-down Microarchitecture Analysis Method `_
+- `Configuring oneDNN for Benchmarking `_
+- `Intel® VTune™ Profiler `_
+- `Intel® VTune™ Profiler User Guide `_
+
+And stay tuned for a follow-up posts on optimized kernels on CPU via `Intel® Extension for PyTorch* `_ and advanced launcher configurations such as memory allocator.
+
+Acknowledgement
+~~~~~~~~~~~~~~~
+
+We would like to thank Ashok Emani (Intel) and Jiong Gong (Intel) for their immense guidance and support, and thorough feedback and reviews throughout many steps of this blog. We would also like to thank Hamid Shojanazeri (Meta), Li Ning (AWS) and Jing Xu (Intel) for helpful feedback in code review. And Suraj Subramanian (Meta) and Geeta Chauhan (Meta) for helpful feedback on the blog.
diff --git a/prototype_source/fx_numeric_suite_tutorial.py b/prototype_source/fx_numeric_suite_tutorial.py
new file mode 100644
index 000000000..ac43ae49e
--- /dev/null
+++ b/prototype_source/fx_numeric_suite_tutorial.py
@@ -0,0 +1,231 @@
+# -*- coding: utf-8 -*-
+"""
+PyTorch FX Numeric Suite Core APIs Tutorial
+===========================================
+
+Introduction
+------------
+
+Quantization is good when it works, but it is difficult to know what is wrong
+when it does not satisfy the accuracy we expect. Debugging the accuracy issue
+of quantization is not easy and time-consuming.
+
+One important step of debugging is to measure the statistics of the float model
+and its corresponding quantized model to know where they differ most.
+We built a suite of numeric tools called PyTorch FX Numeric Suite Core APIs in
+PyTorch quantization to enable the measurement of the statistics between
+quantized module and float module to support quantization debugging efforts.
+Even for the quantized model with good accuracy, PyTorch FX Numeric Suite Core
+APIs can still be used as the profiling tool to better understand the
+quantization error within the model and provide the guidance for further
+optimization.
+
+PyTorch FX Numeric Suite Core APIs currently supports models quantized through
+both static quantization and dynamic quantization with unified APIs.
+
+In this tutorial we will use MobileNetV2 as an example to show how to use
+PyTorch FX Numeric Suite Core APIs to measure the statistics between static
+quantized model and float model.
+
+Setup
+^^^^^
+We’ll start by doing the necessary imports:
+"""
+
+##############################################################################
+
+# Imports and util functions
+
+import copy
+import torch
+import torchvision
+import torch.quantization
+import torch.ao.ns._numeric_suite_fx as ns
+import torch.quantization.quantize_fx as quantize_fx
+
+import matplotlib.pyplot as plt
+from tabulate import tabulate
+
+torch.manual_seed(0)
+plt.style.use('seaborn-whitegrid')
+
+
+# a simple line graph
+def plot(xdata, ydata, xlabel, ylabel, title):
+ _ = plt.figure(figsize=(10, 5), dpi=100)
+ plt.xlabel(xlabel)
+ plt.ylabel(ylabel)
+ plt.title(title)
+ ax = plt.axes()
+ ax.plot(xdata, ydata)
+ plt.show()
+
+##############################################################################
+# Then we load the pretrained float MobileNetV2 model, and quantize it.
+
+
+# create float model
+mobilenetv2_float = torchvision.models.quantization.mobilenet_v2(
+ pretrained=True, quantize=False).eval()
+
+# create quantized model
+qconfig_dict = {
+ '': torch.quantization.get_default_qconfig('fbgemm'),
+ # adjust the qconfig to make the results more interesting to explore
+ 'module_name': [
+ # turn off quantization for the first couple of layers
+ ('features.0', None),
+ ('features.1', None),
+ # use MinMaxObserver for `features.17`, this should lead to worse
+ # weight SQNR
+ ('features.17', torch.quantization.default_qconfig),
+ ]
+}
+# Note: quantization APIs are inplace, so we save a copy of the float model for
+# later comparison to the quantized model. This is done throughout the
+# tutorial.
+mobilenetv2_prepared = quantize_fx.prepare_fx(
+ copy.deepcopy(mobilenetv2_float), qconfig_dict)
+datum = torch.randn(1, 3, 224, 224)
+mobilenetv2_prepared(datum)
+# Note: there is a long standing issue that we cannot copy.deepcopy a
+# quantized model. Since quantization APIs are inplace and we need to use
+# different copies of the quantized model throughout this tutorial, we call
+# `convert_fx` on a copy, so we have access to the original `prepared_model`
+# later. This is done throughout the tutorial.
+mobilenetv2_quantized = quantize_fx.convert_fx(
+ copy.deepcopy(mobilenetv2_prepared))
+
+##############################################################################
+# 1. Compare the weights of float and quantized models
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# The first analysis we can do is comparing the weights of the fp32 model and
+# the int8 model by calculating the SQNR between each pair of weights.
+#
+# The `extract_weights` API can be used to extract weights from linear,
+# convolution and LSTM layers. It works for dynamic quantization as well as
+# PTQ/QAT.
+
+# Note: when comparing weights in models with Conv-BN for PTQ, we need to
+# compare weights after Conv-BN fusion for a proper comparison. Because of
+# this, we use `prepared_model` instead of `float_model` when comparing
+# weights.
+
+# Extract conv and linear weights from corresponding parts of two models, and
+# save them in `wt_compare_dict`.
+mobilenetv2_wt_compare_dict = ns.extract_weights(
+ 'fp32', # string name for model A
+ mobilenetv2_prepared, # model A
+ 'int8', # string name for model B
+ mobilenetv2_quantized, # model B
+)
+
+# calculate SQNR between each pair of weights
+ns.extend_logger_results_with_comparison(
+ mobilenetv2_wt_compare_dict, # results object to modify inplace
+ 'fp32', # string name of model A (from previous step)
+ 'int8', # string name of model B (from previous step)
+ torch.ao.ns.fx.utils.compute_sqnr, # tensor comparison function
+ 'sqnr', # the name to use to store the results under
+)
+
+# massage the data into a format easy to graph and print
+mobilenetv2_wt_to_print = []
+for idx, (layer_name, v) in enumerate(mobilenetv2_wt_compare_dict.items()):
+ mobilenetv2_wt_to_print.append([
+ idx,
+ layer_name,
+ v['weight']['int8'][0]['prev_node_target_type'],
+ v['weight']['int8'][0]['values'][0].shape,
+ v['weight']['int8'][0]['sqnr'][0],
+ ])
+
+# plot the SQNR between fp32 and int8 weights for each layer
+plot(
+ [x[0] for x in mobilenetv2_wt_to_print],
+ [x[4] for x in mobilenetv2_wt_to_print],
+ 'idx',
+ 'sqnr',
+ 'weights, idx to sqnr'
+)
+
+##############################################################################
+# Also print out the SQNR, so we can inspect the layer name and type:
+
+print(tabulate(
+ mobilenetv2_wt_to_print,
+ headers=['idx', 'layer_name', 'type', 'shape', 'sqnr']
+))
+
+##############################################################################
+# 2. Compare activations API
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^
+# The second tool allows for comparison of activations between float and
+# quantized models at corresponding locations for the same input.
+#
+# .. figure:: /_static/img/compare_output.png
+#
+# The `add_loggers`/`extract_logger_info` API can be used to to extract
+# activations from any layer with a `torch.Tensor` return type. It works for
+# dynamic quantization as well as PTQ/QAT.
+
+# Compare unshadowed activations
+
+# Create a new copy of the quantized model, because we cannot `copy.deepcopy`
+# a quantized model.
+mobilenetv2_quantized = quantize_fx.convert_fx(
+ copy.deepcopy(mobilenetv2_prepared))
+mobilenetv2_float_ns, mobilenetv2_quantized_ns = ns.add_loggers(
+ 'fp32', # string name for model A
+ copy.deepcopy(mobilenetv2_prepared), # model A
+ 'int8', # string name for model B
+ mobilenetv2_quantized, # model B
+ ns.OutputLogger, # logger class to use
+)
+
+# feed data through network to capture intermediate activations
+mobilenetv2_float_ns(datum)
+mobilenetv2_quantized_ns(datum)
+
+# extract intermediate activations
+mobilenetv2_act_compare_dict = ns.extract_logger_info(
+ mobilenetv2_float_ns, # model A, with loggers (from previous step)
+ mobilenetv2_quantized_ns, # model B, with loggers (from previous step)
+ ns.OutputLogger, # logger class to extract data from
+ 'int8', # string name of model to use for layer names for the output
+)
+
+# add SQNR comparison
+ns.extend_logger_results_with_comparison(
+ mobilenetv2_act_compare_dict, # results object to modify inplace
+ 'fp32', # string name of model A (from previous step)
+ 'int8', # string name of model B (from previous step)
+ torch.ao.ns.fx.utils.compute_sqnr, # tensor comparison function
+ 'sqnr', # the name to use to store the results under
+)
+
+# massage the data into a format easy to graph and print
+mobilenet_v2_act_to_print = []
+for idx, (layer_name, v) in enumerate(mobilenetv2_act_compare_dict.items()):
+ mobilenet_v2_act_to_print.append([
+ idx,
+ layer_name,
+ v['node_output']['int8'][0]['prev_node_target_type'],
+ v['node_output']['int8'][0]['values'][0].shape,
+ v['node_output']['int8'][0]['sqnr'][0]])
+
+# plot the SQNR between fp32 and int8 activations for each layer
+plot(
+ [x[0] for x in mobilenet_v2_act_to_print],
+ [x[4] for x in mobilenet_v2_act_to_print],
+ 'idx',
+ 'sqnr',
+ 'unshadowed activations, idx to sqnr',
+)
+
+##############################################################################
+# Also print out the SQNR, so we can inspect the layer name and type:
+print(tabulate(
+ mobilenet_v2_act_to_print,
+ headers=['idx', 'layer_name', 'type', 'shape', 'sqnr']
+))
diff --git a/recipes_source/recipes/loading_data_recipe.py b/recipes_source/recipes/loading_data_recipe.py
index 0442e85f1..f58bbd899 100644
--- a/recipes_source/recipes/loading_data_recipe.py
+++ b/recipes_source/recipes/loading_data_recipe.py
@@ -68,16 +68,14 @@
#
# ``torchaudio`` 의 YesNo 데이터셋은 한 사람이 히브리어로 yes 혹은
# no를 녹음한 오디오 클립 60개로 구성되어 있습니다. 오디오 클립 각각의 길이는 단어 8개입니다.
-# (`더 알아보기 `__).
+# ( `더 알아보기 `__ ).
#
# ``torchaudio.datasets.YESNO`` 클래스를 사용하여 YesNo 데이터셋을 생성합니다.
torchaudio.datasets.YESNO(
- root,
+ root='./',
url='http://www.openslr.org/resources/1/waves_yesno.tar.gz',
folder_in_archive='waves_yesno',
- download=False,
- transform=None,
- target_transform=None)
+ download=True)
###########################################################################
# 각각의 데이터 항목 (item)은 튜플 형태 (waveform: 파형, sample_rate: 샘플 속도, labels: 라벨)를 갖습니다.
@@ -87,9 +85,7 @@
# 그 외의 매개변수는 선택 사항이며, 위 예시에서 기본값을 확인하실 있습니다. 아래와
# 같은 매개변수도 사용 가능합니다.
#
-# * ``download``: 참인 경우, 데이터셋 파일을 인터넷에서 다운받고 root 폴더에 저장합니다. 파일이 이미 존재하면 다시 다운받지 않습니다.
-# * ``transform``: 데이터를 변환하여 학습에 사용할 수 있도록 이어붙이고 비정규화된 형태로 불러오실 수 있습니다. 라이브러리마다 다양한 transformation을 지원하고 있으며, 앞으로도 추가될 예정입니다.
-# * ``target_transform``: 타겟 데이터를 변환하기 위한 함수 혹은 transform입니다.
+# * ``download``: 참(True)인 경우, 데이터셋 파일을 인터넷에서 다운받고 root 폴더에 저장합니다. 파일이 이미 존재하면 다시 다운받지 않습니다.
#
# 이제 YesNo 데이터를 확인해봅시다:
diff --git a/recipes_source/recipes/tuning_guide.py b/recipes_source/recipes/tuning_guide.py
index 9094da055..86d0d4cf4 100644
--- a/recipes_source/recipes/tuning_guide.py
+++ b/recipes_source/recipes/tuning_guide.py
@@ -137,9 +137,9 @@ def fused_gelu(x):
# Support for ``channels_last`` is experimental, but it's expected to work for
# standard computer vision models (e.g. ResNet-50, SSD). To convert models to
# ``channels_last`` format follow
-# `Channels Last Memory Format Tutorial `_.
+# `Channels Last Memory Format Tutorial `_.
# The tutorial includes a section on
-# `converting existing models `_.
+# `converting existing models `_.
###############################################################################
# Checkpoint intermediate buffers
@@ -236,6 +236,43 @@ def fused_gelu(x):
# export LD_PRELOAD=:$LD_PRELOAD
+###############################################################################
+# Use oneDNN Graph with TorchScript for inference
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# oneDNN Graph can significantly boost inference performance. It fuses some compute-intensive operations such as convolution, matmul with their neighbor operations.
+# Currently, it's supported as an experimental feature for Float32 data-type.
+# oneDNN Graph receives the model’s graph and identifies candidates for operator-fusion with respect to the shape of the example input.
+# A model should be JIT-traced using an example input.
+# Speed-up would then be observed after a couple of warm-up iterations for inputs with the same shape as the example input.
+# The example code-snippets below are for resnet50, but they can very well be extended to use oneDNN Graph with custom models as well.
+
+# Only this extra line of code is required to use oneDNN Graph
+torch.jit.enable_onednn_fusion(True)
+
+###############################################################################
+# Using the oneDNN Graph API requires just one extra line of code.
+# If you are using oneDNN Graph, please avoid calling ``torch.jit.optimize_for_inference``.
+
+# sample input should be of the same shape as expected inputs
+sample_input = [torch.rand(32, 3, 224, 224)]
+# Using resnet50 from TorchVision in this example for illustrative purposes,
+# but the line below can indeed be modified to use custom models as well.
+model = getattr(torchvision.models, "resnet50")().eval()
+# Tracing the model with example input
+traced_model = torch.jit.trace(model, sample_input)
+# Invoking torch.jit.freeze
+traced_model = torch.jit.freeze(traced_model)
+
+###############################################################################
+# Once a model is JIT-traced with a sample input, it can then be used for inference after a couple of warm-up runs.
+
+with torch.no_grad():
+ # a couple of warmup runs
+ traced_model(*sample_input)
+ traced_model(*sample_input)
+ # speedup would be observed after warmup runs
+ traced_model(*sample_input)
+
###############################################################################
# Train a model on CPU with PyTorch DistributedDataParallel(DDP) functionality
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -326,7 +363,7 @@ def fused_gelu(x):
# * native PyTorch AMP is available starting from PyTorch 1.6:
# `documentation `_,
# `examples `_,
-# `tutorial `_
+# `tutorial `_
#
#
diff --git a/requirements-noplot.txt b/requirements-noplot.txt
index 44540fc8f..4b761567c 100644
--- a/requirements-noplot.txt
+++ b/requirements-noplot.txt
@@ -15,7 +15,8 @@ torchvision
torchtext
torchaudio
torchdata
-#functorch
+# Functorch is not needed, as intermediate_source/forward_ad_usage.py is not rendered
+# functorch
# PyTorch Theme
pytorch-sphinx-theme @ https://github.com/PyTorchKorea/pytorch_sphinx_theme/archive/master.zip
diff --git a/requirements.txt b/requirements.txt
index 8c3e0b482..5a26253ab 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -19,7 +19,8 @@ torchvision
torchtext
torchaudio
torchdata
-functorch
+# Functorch is not needed, as intermediate_source/forward_ad_usage.py is not rendered
+# functorch
PyHamcrest
bs4
awscliv2==2.1.1
@@ -42,6 +43,6 @@ scikit-image
scipy
pillow
wget
-gym
+gym==0.24.0
gym-super-mario-bros==7.3.0
timm