Skip to content

Releases: huggingface/accelerate

v0.31.0: Better support for sharded state dict with FSDP and Bugfixes

07 Jun 15:27
Compare
Choose a tag to compare

Core

  • Set timeout default to PyTorch defaults based on backend by @muellerzr in #2758
  • fix duplicate elements in split_between_processes by @hkunzhe in #2781
  • Add Elastic Launch Support to notebook_launcher by @yhna940 in #2788
  • Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790

FSDP

Megatron

  • Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501

What's Changed

New Contributors

Full Changelog: v0.30.1...v0.31.0

v0.30.1: Bugfixes

10 May 17:47
Compare
Choose a tag to compare

Patchfix

  • Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in #2752
  • Fix issue with missing values in the SageMaker config leading to not being able to launch in #2753
  • Fix CPU OMP num threads setting thanks to @jiqing-feng in #2755
  • Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model #2762
  • Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in #2748
  • Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in #2730
  • Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in #2745

Full Changelog: v0.30.0...v0.30.1

v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

03 May 15:29
Compare
Choose a tag to compare

Core

  • We've simplified the tqdm wrapper to make it fully passthrough, no need to have tqdm(main_process_only, *args), it is now just tqdm(*args) and you can pass in is_main_process as a kwarg.
  • We've added support for advanced optimizer usage:
  • Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in #2655
  • Support dataloader send_to_device calls to use non-blocking by @drhead in #2685
  • allow gather_for_metrics to be more flexible by @SunMarc in #2710
  • Add cann version info to command accelerate env for NPU by @statelesshz in #2689
  • Add MLU rng state setter by @ArthurinRUC in #2664
  • device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602

Documentation

  • Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
  • New distributed inference examples have been added thanks to @SunMarc in #2672
  • Fixed some docs for using internal trackers by @brentyi in #2650

DeepSpeed

  • Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in #2662
  • Allow "auto" for gradient clipping in YAML by @regisss in #2649
  • Introduce a deepspeed-specific Docker image by @muellerzr in #2707. To use, pull the gpu-deepspeed tag docker pull huggingface/accelerate:cuda-deepspeed-nightly

Megatron

Big Modeling

  • Add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641

Bug Fixes

  • Fix up state with xla + performance regression by @muellerzr in #2634
  • Parenthesis on xpu_available by @muellerzr in #2639
  • Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in #2646
  • Fix backend check by @jiqing-feng in #2652
  • Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
  • Block AMP for MPS device by @SunMarc in #2699
  • Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in #2714
  • Fixup free_memory to deal with garbage collection by @muellerzr in #2716
  • Fix sampler serialization failing by @SunMarc in #2723
  • Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in #2717

Full Changelog

New Contributors

Full Changelog: https://github.com/huggingface/acce...

Read more

v0.29.3: Patchfix

17 Apr 15:46
Compare
Choose a tag to compare
  • Fixes issue with backend refactor not working on CPU-based distributed environments by @jiqing-feng: #2670
  • Fixes issue where load_checkpoint_and_dispatch needs a strict argument
  • by @SunMarc: #2641

Full Changelog: v0.29.2...v0.29.3

v0.29.2: Patchfix

09 Apr 12:04
Compare
Choose a tag to compare
  • Fixes xpu missing parenthesis #2639
  • Fixes XLA and performance degradation on init with the state #2634

v0.29.1: Patchfix

05 Apr 17:09
Compare
Choose a tag to compare

Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

05 Apr 14:27
Compare
Choose a tag to compare

Core

  • Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:
from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

  • Allow for setting deterministic algorithms in set_seed by @muellerzr in #2569
  • Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
  • Cambricon MLU device support introduced by @huismiling in #2552
  • A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
  • Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using BatchSamplerShard by @universuen in #2584
  • notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561

Big Model Inference

  • Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
  • Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588

DeepSpeed

  • Fix issue with the mapping of main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
  • Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
  • We now support custom deepspeed env files. Like normal deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in #2566
  • Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578

What's Changed

New Contributors

Full Changelog: v0.28.0...v0.29.0

v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

12 Mar 16:58
Compare
Choose a tag to compare

Core

  • Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator
+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
  • Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

  • Support for XLA on the GPU by @anw90 in #2176
  • Enable gradient accumulation on TPU in #2453

FSDP

  • Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544

launch changes

What's Changed

New Contributors

Full Changelog: v0.27.2...v0.28.0

v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

09 Feb 16:30
Compare
Choose a tag to compare

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

New Contributors

Full Changelog: v0.26.1...v0.27.0

v0.26.1: Patch Release

11 Jan 15:26
Compare
Choose a tag to compare

What's Changed

  • Raise error when using batches of different sizes with dispatch_batches=True by @SunMarc in #2325

Full Changelog: v0.26.0...v0.26.1