Summary
This release significantly enhances GPU utilization through a new tensor transaction mechanism for batched sync operations and simultaneous reads of multiple bindings for CubeCL runtimes. It also includes multiple performance optimizations like mixed precision support for matrix multiplication and convolution operations, as well as notable GEMM improvements.
Backend capabilities have been expanded with a new remote backend for distributed computing, improved SPIR-V support, custom operations fusion and an experimental fused matrix multiplication.
Training components have been expanded to support semantic segmentation and object detection datasets, new training metrics and improved training performance thanks to an async metric processor.
As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations and enhanced documentation.
Module & Tensor
- Add warning in docstring for indices bound checks (#2462) @laggui
- Add
remainder
op for tensor (#2427) @med1844 - Add float cast tensor op (#2483 #2511 #2538 #2586 #2671) @laggui
- Add step learning rate scheduler (#2423) @towerpark
- Add tensor split operator (#2490) @agelas
- Add tensor transaction mechanism to batch multiple sync operations (#2521) @nathanielsimard
- [Breaking] Make .init() method of LR schedulers return Result (#2527) @towerpark
- Make optimizer state public (#2561) @ArthurBrussee
- Accept function pointer or closure for freq scaling (#2634) @laggui
- Change pad value w/ ElementConversion (#2653) @laggui
- Add checks for even padding when kernel size is even (#2677) @laggui
Bug Fixes
- Fix unsqueeze dims with multiple trailing negative indices (#2496) @laggui
- Fix one_hot implementation for Int Tensors (#2501) @maun
- Fix tensor prod and prod dim containing nan values (#2515) @quinton11
- Expose ItemLazy to be able to implement for custom types (#2525) @laggui
- Check nonzero stride, dilation and groups (#2540) @laggui
- Module derive types should inherit visibility (#2610) @laggui
- Add dropout prob check (#2695) @laggui
Backends
- Add remote Backend (#2463) @nathanielsimard
- Add support for custom operations fusion (#2486) @ArthurBrussee
- [Breaking] Remove precision bridge (#2538) @laggui
- Add fused matmul under fusion experimental feature flag (#2622 #2690) @nathanielsimard
Bug Fixes
- Prevent various OOB accesses and discontiguous buffer bugs (#2467) @wingertge
- Fix autodiff memory management by verifying parent nodes' existence (#2488) @jnamika
- Fix burn remote deadlock + burn fusion draining (#2492) @nathanielsimard
- Remove dtype rewrite (#2528) @ArthurBrussee
- Fix reduce autotune key no anchor (#2696) @nathanielsimard
Documentation & Examples
- Add wgpu-spirv and hip-jit features to text-classification example (#2422) @syl20bnr
- Add tensor basic ops examples (#2468) @quinton11
- Add segmentation mask to burn book (#2495) @anthonytorlucci
- Add numeric tensor examples (#2514) @quinton11
- Add module mapper book examples (#2621 #2632) @laggui
Fixes
- Fix output dim in embedding nn docstring (#2452) @getumen
- Fix tri mask ops return docstring (#2517) @laggui
- Fix the incorrect link in contributor-books (#2583) @tiruka
- Fix the broken WGSL link in the README (#2607) @korbexmachina
- Fix module visitor and mapper trait definition in the book (#2609) @laggui
- Fix load_file usage to keep using model (#2672) @laggui
- Don't mention a fixed candle bug (#2689) @kitterion
ONNX Support
- Format all type names (#2436) @samolego
- Add ONNX op Random Normal Like (#2441) @tiruka
- Add ONNX op Random Uniform Like (#2448) @tiruka
- Infer convolution kernel shape from weight (#2544) @laggui
Enhancements
- Improve ndarray tensor creation from memory (#2439) @nathanielsimard
- Dont attempt naive reduction when reduce_dim is too high (#2414) @ArthurBrussee
- Add more type support for burn-jit (#2454) @wingertge
- Rewrite legacy
cpa
kernels (#2455) @wingertge - Implicit GEMM optimizations/bug fixes (#2499) @wingertge
- Add custom NCHW to NHWC kernel for implicit GEMM (optimization) (#2530) @wingertge
- Support 8-bit bool for JitBackend (#2526) @wingertge
- Implicit gemm rewrite optimization (#2545) @wingertge
- Fix autotune error handling (#2670) @nathanielsimard
- Use float intrinsics for deform_conv2d backward, fix into_data for padded tensors (#2681) @wingertge
Refactoring
- Migrate to
cubecl
IR refactor (#2418) @wingertge - DefaultDevice should be an alias of BestAvailable (#2443) @ArthurBrussee
- Replace crates by dependi (#2477) @vincentmasse
- Refactor quantization tensor data representation (#2479) @laggui
- Use alias for more consistent typing (#2497) @loganbnielsen
- Add
QTensorOps
docs + refactor tests to simplify inputs (#2557) @laggui - Update for rust 1.83 (#2562 #2605) @laggui
- Matmul + CubeCL Update (#2551) @nathanielsimard
- Migrate matmul autotune to macro and fix accelerated (#2584) @wingertge
- Refactor jit quantized tensor representation (#2604) @laggui
- [Breaking] Fix alignment issue of TensorData bytes (#2416) @WorldSEnder
- Refactor quantized bytes representation (#2627) @laggui
- Update to new cubecl with improved compilation times (#2654) @nathanielsimard
- Refactor unary + binary kernels (#2665) @nathanielsimard
- Import code from github-device-flow crate for burnbench (#2667) @syl20bnr
- Fix web examples and conflicting feature flags w/
default-features = false
(#2691) @laggui - Use cubecl reduce w/ autotune (#2673) @maxtremblay
Miscellaneous
- Use core::error::Error for no-std (#2346) @antimora
- Update deny.toml to follow the spec changes of cargo-deny (#2408) @tiruka
- Add segmentation mask to ImageFolderDataset (#2426) @anthonytorlucci
- Add ROC AUC metric (#2466) @vincentmasse
- Async Processor: run train metrics & dashboard on another thread (#2482) @nathanielsimard
- Add precision classification metric (#2293) @tsanona
- Add test int one_hot and change ops docs in the book (#2519) @tsanona
- Add option to request manual quit on tui (#2489) @vincentmasse
- Reduce log spam (#2556) @ArthurBrussee
- Add
ImageDatasetItem
image path field (#2558) @wangjiawen2013 - Fix xtask command with last version (#2566 #2582) @syl20bnr
- Remove duplicate jit conv2d test (#2581) @tiruka
- Relax Fn requirements for param map (#2620) @ArthurBrussee
- Extend ImageFolderDataset to support import of COCO detection (#2612) @jin-eld
- Add recall metric (#2518) @tsanona
- Propagate audio feature flag (#2633) @laggui
- Add F-score metric (#2648) @tsanona
- Implement benchmark for reduce kernel (#2692) @maxtremblay