Releases: huggingface/accelerate
v0.31.0: Better support for sharded state dict with FSDP and Bugfixes
Core
- Set
timeout
default to PyTorch defaults based on backend by @muellerzr in #2758 - fix duplicate elements in split_between_processes by @hkunzhe in #2781
- Add Elastic Launch Support to
notebook_launcher
by @yhna940 in #2788 - Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
FSDP
- Introduce shard-merging util for FSDP by @muellerzr in #2772
- Enable sharded state dict + offload to cpu resume by @muellerzr in #2762
- Enable config for fsdp activation checkpointing by @helloworld1 in #2779
Megatron
- Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
What's Changed
- Add feature to allow redirecting std streams into log files when using torchrun as the launcher. by @lyuwen in #2740
- Update modeling.py by adding try-catch section to skip the unavailable devices by @MeVeryHandsome in #2681
- Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity by @statelesshz in #2748
- Fix stacklevel in
logging
to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in #2730 - LOMO / FIX: Support multiple optimizers by @younesbelkada in #2745
- Fix max_memory assignment by @SunMarc in #2751
- Fix duplicate environment variable check in multi-cpu condition by @yhna940 in #2752
- Simplify CLI args validation and ensure CLI args take precedence over config file. by @Iain-S in #2757
- Fix sagemaker config by @muellerzr in #2753
- fix cpu omp num threads set by @jiqing-feng in #2755
- Revert "Simplify CLI args validation and ensure CLI args take precedence over config file." by @muellerzr in #2763
- Enable sharded cpu resume by @muellerzr in #2762
- Sets default to PyTorch defaults based on backend by @muellerzr in #2758
- optimize get_module_leaves speed by @BBuf in #2756
- fix minor typo by @TemryL in #2767
- Fix small edge case in get_module_leaves by @SunMarc in #2774
- Skip deepspeed test by @SunMarc in #2776
- Enable config for fsdp activation checkpointing by @helloworld1 in #2779
- Add arg from CLI to fix failing test by @muellerzr in #2783
- Skip tied weights disk offload test by @SunMarc in #2782
- Introduce shard-merging util for FSDP by @muellerzr in #2772
- FIX / FSDP : Guard fsdp utils for earlier PyTorch versions by @younesbelkada in #2794
- Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
- Fixup CLI test by @muellerzr in #2796
- fix duplicate elements in split_between_processes by @hkunzhe in #2781
- Add Elastic Launch Support to
notebook_launcher
by @yhna940 in #2788 - Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
- Fix type in accelerator.py by @qgallouedec in #2800
- fix comet ml test by @SunMarc in #2804
- New template by @muellerzr in #2808
- Fix access error for torch.mps when using torch==1.13.1 on macOS by @SunMarc in #2806
- 4-bit quantization meta device bias loading bug by @SunMarc in #2805
- State dictionary retrieval from offloaded modules by @blbadger in #2619
- add cuda dep for a test by @SunMarc in #2820
- Remove out-dated xpu device check code in
get_balanced_memory
by @faaany in #2826 - Fix DeepSpeed config validation error by changing
stage3_prefetch_bucket_size
value to an integer by @adk9 in #2814 - Improve test speeds by up to 30% in multi-gpu settings by @muellerzr in #2830
- monitor-interval, take 2 by @muellerzr in #2833
- Optimize the megatron plugin by @zhangsheng377 in #2822
- fix fstr format by @Jintao-Huang in #2810
New Contributors
- @lyuwen made their first contribution in #2740
- @MeVeryHandsome made their first contribution in #2681
- @luowyang made their first contribution in #2730
- @Iain-S made their first contribution in #2757
- @BBuf made their first contribution in #2756
- @TemryL made their first contribution in #2767
- @helloworld1 made their first contribution in #2779
- @hkunzhe made their first contribution in #2781
- @adk9 made their first contribution in #2814
- @Jintao-Huang made their first contribution in #2810
Full Changelog: v0.30.1...v0.31.0
v0.30.1: Bugfixes
Patchfix
- Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in #2752
- Fix issue with missing values in the SageMaker config leading to not being able to launch in #2753
- Fix CPU OMP num threads setting thanks to @jiqing-feng in #2755
- Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model #2762
- Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in #2748
- Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in #2730
- Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in #2745
Full Changelog: v0.30.0...v0.30.1
v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more
Core
- We've simplified the
tqdm
wrapper to make it fully passthrough, no need to havetqdm(main_process_only, *args)
, it is now justtqdm(*args)
and you can pass inis_main_process
as a kwarg. - We've added support for advanced optimizer usage:
- Schedule free optimizer introduced by Meta by @muellerzr in #2631
- LOMO optimizer introduced by OpenLMLab by @younesbelkada in #2695
- Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in #2655
- Support dataloader send_to_device calls to use non-blocking by @drhead in #2685
- allow gather_for_metrics to be more flexible by @SunMarc in #2710
- Add
cann
version info to command accelerate env for NPU by @statelesshz in #2689 - Add MLU rng state setter by @ArthurinRUC in #2664
- device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602
Documentation
- Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
- New distributed inference examples have been added thanks to @SunMarc in #2672
- Fixed some docs for using internal trackers by @brentyi in #2650
DeepSpeed
- Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in #2662
- Allow "auto" for gradient clipping in YAML by @regisss in #2649
- Introduce a
deepspeed
-specific Docker image by @muellerzr in #2707. To use, pull thegpu-deepspeed
tagdocker pull huggingface/accelerate:cuda-deepspeed-nightly
Megatron
- Megatron plugin can support NPU by @zhangsheng377 in #2667
Big Modeling
Bug Fixes
- Fix up state with xla + performance regression by @muellerzr in #2634
- Parenthesis on xpu_available by @muellerzr in #2639
- Fix
is_train_batch_min
type in DeepSpeedPlugin by @yhna940 in #2646 - Fix backend check by @jiqing-feng in #2652
- Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
- Block AMP for MPS device by @SunMarc in #2699
- Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in #2714
- Fixup
free_memory
to deal with garbage collection by @muellerzr in #2716 - Fix sampler serialization failing by @SunMarc in #2723
- Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in #2717
Full Changelog
- Schedule free optimizer support by @muellerzr in #2631
- Fix up state with xla + performance regression by @muellerzr in #2634
- Parenthesis on xpu_available by @muellerzr in #2639
- add third-party device prefix to
execution_device
by @faaany in #2612 - add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641
- device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602
- Docs fix for using internal trackers by @brentyi in #2650
- Allow "auto" for gradient clipping in YAML by @regisss in #2649
- Fix
is_train_batch_min
type in DeepSpeedPlugin by @yhna940 in #2646 - Don't use deprecated
Repository
anymore by @Wauplin in #2658 - Fix test_from_pretrained_low_cpu_mem_usage_measured failure by @yuanwu2017 in #2644
- Add MLU rng state setter by @ArthurinRUC in #2664
- fix backend check by @jiqing-feng in #2652
- Megatron plugin can support NPU by @zhangsheng377 in #2667
- Revert "fix backend check" by @muellerzr in #2669
tqdm
:*args
should come ahead ofmain_process_only
by @rb-synth in #2654- Handle MoE models with DeepSpeed by @pacman100 in #2662
- Fix deepspeed moe test with version check by @pacman100 in #2677
- Pin DS...again.. by @muellerzr in #2679
- fix backend check by @jiqing-feng in #2670
- Deprecate tqdm args + slight logic tweaks by @muellerzr in #2673
- Enable BF16 autocast to everything during FP8 + some tweaks to enable FSDP by @muellerzr in #2655
- Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
- Simplify test logic by @pacman100 in #2697
- Add source code for DataLoader Animation by @muellerzr in #2696
- Block AMP for MPS device by @SunMarc in #2699
- Do a pip freeze during workflows by @muellerzr in #2704
- add cann version info to command accelerate env by @statelesshz in #2689
- Add version checks for the import of DeepSpeed moe utils by @pacman100 in #2705
- Change dataloader send_to_device calls to non-blocking by @drhead in #2685
- add distributed examples by @SunMarc in #2672
- Add diffusers to req by @muellerzr in #2711
- fix bnb multi gpu training by @SunMarc in #2714
- allow gather_for_metrics to be more flexible by @SunMarc in #2710
- Add Upcasting for FSDP in Mixed Precision. Add Concept Guide for FSPD and DeepSpeed. by @fabianlim in #2674
- Segment out a deepspeed docker image by @muellerzr in #2707
- Fixup
free_memory
to deal with garbage collection by @muellerzr in #2716 - fix sampler serialization by @SunMarc in #2723
- Fix sampler failing test by @SunMarc in #2728
- Docs: Fix build main documentation by @SunMarc in #2729
- Fix Documentation in FSDP and DeepSpeed Concept Guide by @fabianlim in #2725
- Fix deepspeed offload device type by @yhna940 in #2717
- FEAT: Add LOMO optimizer by @younesbelkada in #2695
- Fix tests on main by @muellerzr in #2739
New Contributors
- @brentyi made their first contribution in #2650
- @regisss made their first contribution in #2649
- @yhna940 made their first contribution in #2646
- @Wauplin made their first contribution in #2658
- @ArthurinRUC made their first contribution in #2664
- @jiqing-feng made their first contribution in #2652
- @zhangsheng377 made their first contribution in #2667
- @rb-synth made their first contribution in #2654
- @drhead made their first contribution in #2685
Full Changelog: https://github.com/huggingface/acce...
v0.29.3: Patchfix
- Fixes issue with backend refactor not working on CPU-based distributed environments by @jiqing-feng: #2670
- Fixes issue where
load_checkpoint_and_dispatch
needs astrict
argument - by @SunMarc: #2641
Full Changelog: v0.29.2...v0.29.3
v0.29.2: Patchfix
v0.29.1: Patchfix
Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed
v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements
Core
- Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during
accelerate config
, set theACCELERATE_CPU_AFFINITY=1
env variable, or manually using the following:
from accelerate.utils import set_numa_affinity
# For GPU 0
set_numa_affinity(0)
Big thanks to @stas00 for the recommendation, request, and feedback during development
- Allow for setting deterministic algorithms in
set_seed
by @muellerzr in #2569 - Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
- Cambricon MLU device support introduced by @huismiling in #2552
- A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
- Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using
BatchSamplerShard
by @universuen in #2584 notebook_launcher
can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561
Big Model Inference
- Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
- Fix
load_checkpoint_in_model
behavior when unexpected keys are in the checkpoint by @fxmarty in #2588
DeepSpeed
- Fix issue with the mapping of
main_process_ip
andmaster_addr
when not using standard as deepspeed launcher by @asdfry in #2495 - Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
- We now support custom deepspeed env files. Like normal
deepspeed
, set it with theDS_ENV_FILE
environmental variable by @muellerzr in #2566 - Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578
What's Changed
- Fix test_script.py on TPU v2/v3 by @vanbasten23 in #2542
- Add mapping
main_process_ip
andmaster_addr
when not using standard as deepspeed launcher by @asdfry in #2495 - split_between_processes for Dataset by @geronimi73 in #2433
- Include working driver check by @muellerzr in #2558
- 🚨🚨🚨Move to using tags rather than latest for docker images and consolidate image repos 🚨 🚨🚨 by @muellerzr in #2554
- Add Cambricon MLU accelerator support by @huismiling in #2552
- Add NUMA affinity control for NVIDIA GPUs by @muellerzr in #2535
- Add log message for RTX 4000 series when performing multi-gpu inference with device_map by @SunMarc in #2557
- Improve deepspeed env gen by @muellerzr in #2565
- Allow for setting deterministic algorithms by @muellerzr in #2569
- Unpin deepspeed by @muellerzr in #2570
- Rm uv install by @muellerzr in #2577
- Allow for custom deepspeed env files by @muellerzr in #2566
- [docs] Missing functions from API by @stevhliu in #2580
- Update data_loader.py to Ensure Reproducibility in Multi-Process Environments with Dataloader Shuffle by @universuen in #2584
- Refactor affinity and make it stateful by @muellerzr in #2579
- Refactor and improve model estimator tool by @muellerzr in #2581
- Fix
load_checkpoint_in_model
behavior when unexpected keys are in the checkpoint by @fxmarty in #2588 - Guard stateful objects by @muellerzr in #2572
- Expound PartialState docstring by @muellerzr in #2589
- [docs] Fix kwarg docstring by @stevhliu in #2590
- Allow notebook_launcher to launch to multiple GPUs from Colab by @StefanTodoran in #2561
- Fix warning log for unused checkpoint keys by @fxmarty in #2594
- Resolve ZeRO-3 Initialization Failure in Pre-Set Torch Distributed Environments (huggingface/transformers#28803) by @sword865 in #2578
- Refactor PartialState and AcceleratorState by @muellerzr in #2576
- Allow for force unwrapping by @muellerzr in #2595
- Pin hub for tests by @muellerzr in #2608
- Default false for trust_remote_code by @muellerzr in #2607
- fix llama example for pippy by @SunMarc in #2616
- Fix links in Quick Tour by @muellerzr in #2617
- Link to bash in env reporting by @muellerzr in #2623
- Unpin hub by @muellerzr in #2625
New Contributors
- @asdfry made their first contribution in #2495
- @geronimi73 made their first contribution in #2433
- @huismiling made their first contribution in #2552
- @universuen made their first contribution in #2584
- @StefanTodoran made their first contribution in #2561
- @sword865 made their first contribution in #2578
Full Changelog: v0.28.0...v0.29.0
v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes
Core
- Introduce a
DataLoaderConfiguration
and begin deprecation of arguments in theAccelerator
+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
- Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+ num_steps=2,
sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)
Torch XLA
FSDP
- Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544
launch
changes
What's Changed
- Fix model metadata issue check by @muellerzr in #2435
- Use py 3.9 by @muellerzr in #2436
- Fix seedable sampler logic and expound docs by @muellerzr in #2434
- Fix tied_pointers_to_remove type by @fxmarty in #2439
- Make test assertions more idiomatic by @akx in #2420
- Prefer
is_torch_tensor
overhasattr
for torch.compile. by @PhilJd in #2387 - Enable more Ruff lints & fix issues by @akx in #2419
- Fix warning when dispatching model by @SunMarc in #2442
- Make torch xla available on GPU by @anw90 in #2176
- Include pippy_file_path by @muellerzr in #2444
- [Big deprecation] Introduces a
DataLoaderConfig
by @muellerzr in #2441 - Check for None by @muellerzr in #2452
- Fix the pytest version to be less than 8.0.1 by @BenjaminBossan in #2461
- Fix wrong
is_namedtuple
implementation by @fxmarty in #2475 - Use grad-accum on TPU by @muellerzr in #2453
- Add pre-commit configuration by @akx in #2451
- Replace
os.path.sep.join
path manipulations with a helper by @akx in #2446 - DOC: Fixes to Accelerator docstring by @BenjaminBossan in #2443
- Context manager fixes by @akx in #2450
- Fix TPU with new
XLA
device type by @will-cromar in #2467 - Free mps memory by @SunMarc in #2483
- [FIX] allow
Accelerator
to detect distributed type from the "LOCAL_RANK" env variable for XPU by @faaany in #2473 - Fix CI tests due to pathlib issues by @muellerzr in #2491
- Remove all cases of torchrun in tests and centralize as
accelerate launch
by @muellerzr in #2498 - Fix link typo by @SunMarc in #2503
- [docs] Accelerator API by @stevhliu in #2465
- Docstring fixup by @muellerzr in #2504
- [docs] Divide training and inference by @stevhliu in #2466
- add custom dtype INT2 by @SunMarc in #2505
- quanto compatibility for cpu/disk offload by @SunMarc in #2481
- [docs] Quicktour by @stevhliu in #2456
- Check if hub down by @muellerzr in #2506
- Remove offline stuff by @muellerzr in #2509
- Fixed 0MiB bug in convert_file_size_to_int by @StoyanStAtanasov in #2507
- Fix edge case in infer_auto_device_map when dealing with buffers by @SunMarc in #2511
- [docs] Fix typos by @omahs in #2490
- fix typo in launch.py (
----main_process_port
to--main_process_port
) by @DerrickWang005 in #2516 - Add copyright + some ruff lint things by @muellerzr in #2523
- Don't manage
PYTORCH_NVML_BASED_CUDA_CHECK
when callingaccelerate.utils.imports.is_cuda_available()
by @luiscape in #2524 - Quanto compatibility with QBitsTensor by @SunMarc in #2526
- Remove unnecessary
env=os.environ.copy()
s by @akx in #2449 - Launch mpirun from accelerate launch for multi-CPU training by @dmsuehir in #2493
- Enable using dash or underscore for CLI args by @muellerzr in #2527
- Update the default behavior of
zero_grad(set_to_none=None)
to align with PyTorch by @yongchanghao in #2472 - Update link to dynamo/compile doc by @WarmongeringBeaver in #2533
- Check if the buffers fit GPU memory after device map auto inferred by @notsyncing in #2412
- [Refactor] Refactor send_to_device to treat tensor-like first by @vmoens in #2438
- Overdue email change... by @muellerzr in #2534
- [docs] Troubleshoot by @stevhliu in #2538
- Remove extra double-dash in error message by @drscotthawley in #2541
- Allow Gradients to be Synced Each Data Batch While Performing Gradient Accumulation by @fabianlim in #2531
- Update FSDP mixed precision setter to enable fsdp+qlora by @pacman100 in #2544
- Use uv instead of pip install for github CI by @muellerzr in #2546
New Contributors
- @anw90 made their first contribution in #2176
- @StoyanStAtanasov made their first contribution in #2507
- @omahs made their first contribution in #2490
- @DerrickWang005 made their first contribution in #2516
- @luiscape made their first contribution in #2524
- @dmsuehir made their first contribution in #2493
- @yongchanghao made their first contribution in #2472
- @WarmongeringBeaver made their first contribution in #2533
- @vmoens made their first contribution in #2438
- @drscotthawley made their first contribution in #2541
- @fabianlim made their first contribution in #2531
Full Changelog: v0.27.2...v0.28.0
v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes
PyTorch 2.2.0 Support
With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it
PyTorch-Native Pipeline Parallel Inference
With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto"
. This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.
Requires pippy
of version 0.2.0 or later (pip install torchpippy -U
)
Example usage (combined with accelerate launch
or torchrun
):
from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
output = torch.stack(tuple(output[0]))
print(output.shape)
DeepSpeed
This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany
What's Changed
- Convert model.hf_device_map back to Dict by @SunMarc in #2326
- Fix model memory issue by @muellerzr in #2327
- Fixed typos in readme files of docs folder. by @rishit5 in #2329
- Disable P2P in just the 4000 series by @muellerzr in #2332
- Avoid duplicating memory for tied weights in
dispatch_model
, and in forward with offloading by @fxmarty in #2330 - Show DeepSpeed option when multi-XPU is selected in
accelerate config
by @faaany in #2346 - FIX: add oneCCL environment variable for non-MPI launcher (accelerate launch) by @faaany in #2339
- device agnostic test_accelerator/test_multigpu by @wangshuai09 in #2343
- Fix mpi4py/failing deepspeed test issues by @muellerzr in #2353
- Fix
block_size
picking inmegatron_lm_gpt_pretraining
example. by @nilq in #2342 - Fix dispatch_model with tied weights test on T4 by @fxmarty in #2354
- bugfix to allow usage of TE or MSAMP in
FP8RecipeKwargs
by @sudhakarsingh27 in #2355 - Pin DeepSpeed until patch by @muellerzr in #2366
- Remove init_hook_kwargs by @fxmarty in #2365
- device agnostic optimizer testing by @statelesshz in #2363
add_hook_to_module
andremove_hook_from_module
compatibility with fx.GraphModule by @fxmarty in #2369- Adding
requires_grad
tokwargs
when registering empty parameters. by @BlackSamorez in #2376 - Add
adapter_only
option tosave_fsdp_model
andload_fsdp_model
to only save/load PEFT weights by @AjayP13 in #2321 - device agnostic cli/data_loader/grad_sync/kwargs_handlers/memory_utils testing by @wangshuai09 in #2356
- Fix batch_size sanity check logic for
split_batches
by @izhx in #2344 - Pin Torch version to <2.2.0 by @Rocketknight1 in #2394
- Address PIP-632 deprecation of distutils by @AieatAssam in #2388
- [don't merge yet] unpin torch by @ydshieh in #2406
- Revert "[don't merge yet] unpin torch" by @muellerzr in #2407
- Fix CI due to pytest by @muellerzr in #2408
- Added activateEnviroment.sh to readme by @TJ-Solergibert in #2409
- Fix XPU inference by @notsyncing in #2383
- Fix the size of int and bool type when computing module size by @notsyncing in #2411
- Adding Local SGD support for NPU by @statelesshz in #2415
- Unpin torch by @muellerzr in #2418
- Use Ruff for formatting too by @akx in #2400
- torch-native pipeline parallelism for big models by @muellerzr in #2345
- Update FSDP docs by @pacman100 in #2430
- Make output end up on all GPUs at the end by @muellerzr in #2423
- Migrate pippy examples over and run tests by @muellerzr in #2424
- [FIX] fix the wrong
nproc_per_node
in the multi gpu test by @faaany in #2422 - Fix fp8 things by @muellerzr in #2403
- [FIX] allow
Accelerator
to prepare models in eval mode for XPU&CPU by @faaany in #2426 - [Fix] make all tests pass on XPU by @faaany in #2427
New Contributors
- @rishit5 made their first contribution in #2329
- @faaany made their first contribution in #2346
- @wangshuai09 made their first contribution in #2343
- @nilq made their first contribution in #2342
- @BlackSamorez made their first contribution in #2376
- @AjayP13 made their first contribution in #2321
- @Rocketknight1 made their first contribution in #2394
- @AieatAssam made their first contribution in #2388
- @ydshieh made their first contribution in #2406
- @notsyncing made their first contribution in #2383
- @akx made their first contribution in #2400
Full Changelog: v0.26.1...v0.27.0
v0.26.1: Patch Release
What's Changed
Full Changelog: v0.26.0...v0.26.1