Releases: tracel-ai/burn
v0.16.0
Summary
This release significantly enhances GPU utilization through a new tensor transaction mechanism for batched sync operations and simultaneous reads of multiple bindings for CubeCL runtimes. It also includes multiple performance optimizations like mixed precision support for matrix multiplication and convolution operations, as well as notable GEMM improvements.
Backend capabilities have been expanded with a new remote backend for distributed computing, improved SPIR-V support, custom operations fusion and an experimental fused matrix multiplication.
Training components have been expanded to support semantic segmentation and object detection datasets, new training metrics and improved training performance thanks to an async metric processor.
As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations and enhanced documentation.
Module & Tensor
- Add warning in docstring for indices bound checks (#2462) @laggui
- Add
remainder
op for tensor (#2427) @med1844 - Add float cast tensor op (#2483 #2511 #2538 #2586 #2671) @laggui
- Add step learning rate scheduler (#2423) @towerpark
- Add tensor split operator (#2490) @agelas
- Add tensor transaction mechanism to batch multiple sync operations (#2521) @nathanielsimard
- [Breaking] Make .init() method of LR schedulers return Result (#2527) @towerpark
- Make optimizer state public (#2561) @ArthurBrussee
- Accept function pointer or closure for freq scaling (#2634) @laggui
- Change pad value w/ ElementConversion (#2653) @laggui
- Add checks for even padding when kernel size is even (#2677) @laggui
Bug Fixes
- Fix unsqueeze dims with multiple trailing negative indices (#2496) @laggui
- Fix one_hot implementation for Int Tensors (#2501) @maun
- Fix tensor prod and prod dim containing nan values (#2515) @quinton11
- Expose ItemLazy to be able to implement for custom types (#2525) @laggui
- Check nonzero stride, dilation and groups (#2540) @laggui
- Module derive types should inherit visibility (#2610) @laggui
- Add dropout prob check (#2695) @laggui
Backends
- Add remote Backend (#2463) @nathanielsimard
- Add support for custom operations fusion (#2486) @ArthurBrussee
- [Breaking] Remove precision bridge (#2538) @laggui
- Add fused matmul under fusion experimental feature flag (#2622 #2690) @nathanielsimard
Bug Fixes
- Prevent various OOB accesses and discontiguous buffer bugs (#2467) @wingertge
- Fix autodiff memory management by verifying parent nodes' existence (#2488) @jnamika
- Fix burn remote deadlock + burn fusion draining (#2492) @nathanielsimard
- Remove dtype rewrite (#2528) @ArthurBrussee
- Fix reduce autotune key no anchor (#2696) @nathanielsimard
Documentation & Examples
- Add wgpu-spirv and hip-jit features to text-classification example (#2422) @syl20bnr
- Add tensor basic ops examples (#2468) @quinton11
- Add segmentation mask to burn book (#2495) @anthonytorlucci
- Add numeric tensor examples (#2514) @quinton11
- Add module mapper book examples (#2621 #2632) @laggui
Fixes
- Fix output dim in embedding nn docstring (#2452) @getumen
- Fix tri mask ops return docstring (#2517) @laggui
- Fix the incorrect link in contributor-books (#2583) @tiruka
- Fix the broken WGSL link in the README (#2607) @korbexmachina
- Fix module visitor and mapper trait definition in the book (#2609) @laggui
- Fix load_file usage to keep using model (#2672) @laggui
- Don't mention a fixed candle bug (#2689) @kitterion
ONNX Support
- Format all type names (#2436) @samolego
- Add ONNX op Random Normal Like (#2441) @tiruka
- Add ONNX op Random Uniform Like (#2448) @tiruka
- Infer convolution kernel shape from weight (#2544) @laggui
Enhancements
- Improve ndarray tensor creation from memory (#2439) @nathanielsimard
- Dont attempt naive reduction when reduce_dim is too high (#2414) @ArthurBrussee
- Add more type support for burn-jit (#2454) @wingertge
- Rewrite legacy
cpa
kernels (#2455) @wingertge - Implicit GEMM optimizations/bug fixes (#2499) @wingertge
- Add custom NCHW to NHWC kernel for implicit GEMM (optimization) (#2530) @wingertge
- Support 8-bit bool for JitBackend (#2526) @wingertge
- Implicit gemm rewrite optimization (#2545) @wingertge
- Fix autotune error handling (#2670) @nathanielsimard
- Use float intrinsics for deform_conv2d backward, fix into_data for padded tensors (#2681) @wingertge
Refactoring
- Migrate to
cubecl
IR refactor (#2418) @wingertge - DefaultDevice should be an alias of BestAvailable (#2443) @ArthurBrussee
- Replace crates by dependi (#2477) @vincentmasse
- Refactor quantization tensor data representation (#2479) @laggui
- Use alias for more consistent typing (#2497) @loganbnielsen
- Add
QTensorOps
docs + refactor tests to simplify inputs (#2557) @laggui - Update for rust 1.83 (#2562 #2605) @laggui
- Matmul + CubeCL Update (#2551) @nathanielsimard
- Migrate matmul autotune to macro and fix accelerated (#2584) @wingertge
- Refactor jit quantized tensor representation (#2604) @laggui
- [Breaking] Fix alignment issue of TensorData bytes (#2416) @WorldSEnder
- Refactor quantized bytes representation (#2627) @laggui
- Update to new cubecl with improved compilation times (#2654) @nathanielsimard
- Refactor unary + binary kernels (#2665) @nathanielsimard
- Import code from github-device-flow crate for burnbench (#2667) @syl20bnr
- Fix web examples and conflicting feature flags w/
default-features = false
(#2691) @laggui - Use cubecl reduce w/ autotune (#2673) @maxtremblay
Miscellaneous
- Use core::error::Error for no-std (#2346) @antimora
- Update deny.toml to follow the spec changes of cargo-deny (#2408) @tiruka
- Add segmentation mask to ImageFolderDataset (#2426) @anthonytorlucci
- Add ROC AUC metric (#2466) @vincentmasse
- Async Processor: run train metrics & dashboard on another thread (#2482) @nathanielsimard
- Add precision classification metric (#2293) @tsanona
- Add test int one_hot and change ops docs in the book (#2519) @tsanona
- Add option to request manual quit on tui (#2489) @vincentmasse
- Reduce log spam (#2556) @ArthurBrussee
- Add
ImageDatasetItem
image path field (#2558) @wangjiawen2013 - Fix xtask command with last version (#2566 #2582) @syl20bnr
- Remove duplicate jit conv2d test (#2581) @tiruka
- Relax Fn requirements for param map (#2620) @ArthurBrussee
- Extend ImageFolderDataset to support import of COCO detection (#2612) @jin-eld
- Add recall metric (#2518) @tsanona
- Propagate audio feature flag (#2633) @laggui
- Add F-score metric (#2648) @tsanona
- Implement benchmark for reduce kernel (#2692) @maxtremblay
v0.15.0
Summary
This release brings major performance improvements to tensor operations, particularly in matrix multiplication and convolution, along with experimental ROCm/HIP and SPIR-V support enabled by CubeCL runtimes. It also introduces foundational features for multi-backend compatibility and adds new quantization operations.
Support for ONNX models has been expanded, with additional operators and bug fixes for better operator coverage.
As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations, and enhanced documentation.
Module & Tensor
- Remove copy restriction for const generic modules (#2222) @laggui
- Add deform_conv2d as implemented in torchvision (#2147) @wingertge
- Add dim checks on output rank for unsqueeze and stack (#2331) @laggui
- Add Softmin (#2358) @NoahSchiro
- Add
round
,floor
,ceil
for float tensor (#2372) @med1844 - Make tensor sync (#2392) @kingwingfly
- Add
tensor.one_hot
int operation (#2413) @tsanona - [Breaking] Change LR schedulers to return the initial LR at first
.step()
(#2337) @towerpark - Move LrSchedule generic to make it easier to use (#2309) @ArthurBrussee
- Add quantization ops default implementation (#2125 #2275 2301) @laggui
Bug Fixes
- Avoid 0 denominator in interpolate frac (#2224) @laggui
- Nonzero should return an empty vec for zero tensors (#2212) @laggui
- Change ndarray mask_where implementation to correctly deal with NaNs (#2272) @laggui
- Fix mask_where broadcasted input (#2381) @laggui
- Make powf broadcastable (#2398) @laggui
Backends
- Add candle
CudaDevice
andMetalDevice
to avoid creating a new unique device each time (#2290) @laggui - Add fusion mix precision (#2247) @nathanielsimard
- Add SPIR-V compiler backend to
burn-wgpu
(#2386) @wingertge - Add burn-hip (#2399) @syl20bnr
- Add
BackendRouter
to handle multiple backends on the way to distributed (#2353 #2419) @laggui
Bug Fixes
- Fix autodiff memory leak (#2347) @nathanielsimard
- Fix autodiff abs NaN when output is 0 (#2249) @AsherJingkongChen
Documentation & Examples
- Add documentation for custom
cubecl
kernels, update some outdated docs (#2404) @wingertge - Add comments to burn fusion (#2130) @cBournhonesque
- Improve doc for burn-tch (#2288) @kingwingfly
- Improve regression example (#2405) @laggui
- Create CITATION.cff (#2231) @antimora
- Enable doc_auto_cfg to show feature-req-hint in docs.rs (#2271) @kingwingfly
Fixes
- Fix tensor data elem type conversion in book (#2211) @laggui
- Fix target convert in batcher and align guide imports (#2215) @laggui
- Fix huber loss documentation (#2232) @kingwingfly
- Fix debugger settings doc in contributor book (#2223) @tiruka
- Fixed raspberry pi pico example not compiling (#2220) @BjornTheProgrammer
- Fixed path in book (#2262) @mehmetalianil
- Fix unresolved import
regression
(#2285) @tiruka - Fix burn book links (#2303 #2327) @laggui @tiruka
- Contributor Book: Fix the link of primitive types in the "Serialization" page (#2362) @towerpark
- Fix simple regression batch targets (#2379) @wangjiawen2013
- Fix xtask args which are unmodified when upgrading xtask commands (#2364) @tiruka
ONNX Support
- Add gather support for multi-dim indices (rank > 1) (#2199) @alteredoxide
- Allow onnx-import expand op with non-const shapes (#2189) @hexd0t
- Improve ONNX import tensor shape tracking (#2213) @hexd0t
- Add missing output padding to conv transpose ONNX (#2216) @laggui
- Fix ONNX where op for scalar inputs (#2218) @hexd0t
- simplify scope tracking in burn-import (#2207) @skewballfox
- Add onnx op trilu (#2323) @tiruka
- Add ConvTranspose1d ONNX op (#2349) @tiruka
Enhancements
- Improve slice kernel performance (#2252) @nathanielsimard
- Fix burn-jit conv2d excessive loop unrolling (#2263) @AsherJingkongChen
- Introduce autotuning to
conv2d
andconv_transpose2d
with a newim2col
/GEMM
algorithm (#2287) @wingertge - Further data locality optimizations for implicit GEMM (#2300) @wingertge
- Add utility methods to split gradients to GradientParams (#2311) @ArthurBrussee
- Add bounds checking to implicit GEMM to allow arbitrary input shapes (#2354) @wingertge
- Initialize accumulator to bias for implicit GEMM to save an expensive
float_add
(#2383) @wingertge
Refactoring
- Select kernel from CPA to CubeCL (#2168) @mepatrick73
- Migrate cubecl macro (#2266) @wingertge
- Remove primitves const D generic (#2298) @laggui
- Refactor elemwise fusion (#2344) @nathanielsimard
- Refactor Adaptive Avg Pool to CubeCL (#2351) @nathanielsimard
- Refactor pooling kernels (#2356) @nathanielsimard
- Refactor burn-tensor: Split conv backward ops to allow conditional gradient computation (#2278) @AsherJingkongChen
Miscellaneous
- Fix panic messages being invisible in tui mode (#2226) @PaulWagener
- Refactor xtask to use tracel-xtask and refactor CI workflow (#2063) @syl20bnr
- Automatic minimum rust version in README (#2227) @syl20bnr
- Set MSRV to 1.81 (#2388) @nathanielsimard
- Don't panic when the progress is > 1.0 (#2229) @PaulWagener
- Fix compile for dataset crate with vision feature (#2228) @PaulWagener
- Update CI workflow for last version of setup-linux action (#2248) @syl20bnr
- [CI] Fix llvmpipe, lavapipe install for valgrind and vulnerabilities (#2264) @syl20bnr
- Use CliMetricsRenderer when not in a terminal (#2307) @lancelet
- Update rusqlite and associated libraries (#2328) @paulirotta
- Fix missing fusion feature flag @nathanielsimard
- Move conv autotune under feature flag (except key) (#2330) @laggui
- Add should_run for convs instead of panicking (#2403) @ArthurBrussee
- Make changes for latest ratatui version (#2421) @laggui
- Add Windows/WindowsIterator/WindowsDataset (#2338) @NicoZweifel
v0.14.0
Summary
This release marks the debut of our CubeCL integration, which brings cross-platform GPU programming capabilities directly to Rust.
With CubeCL now supporting both CUDA and WebGPU, Burn benefits from a new CUDA backend that can be enabled using the cuda-jit feature.
Please note that this backend is still considered experimental, and some operations, particularly those related to vision, may experience issues.
Additionally, this release features significant enhancements to ONNX support, including bug fixes, new operators, and improvements in code generation.
As always, it also includes numerous bug fixes, performance enhancements, new tensor operations, and improved documentation.
Burn 0.14.0 introduces a new tensor data format that significantly enhances serialization and deserialization speeds and introduces Quantization, a new Beta feature included in this release. The format is not compatible with previous versions of Burn, but you can migrate your previously saved records using this guide.
Module & Tensor
- (@laggui) Add RoPE init_with_frequency_scaling (#2194)
- (@laggui) Add 0-dim tensor checks for creation ops and validate TensorData shape w/ num values (#2137)
- (@wingertge): Add Hard sigmoid activation function (#2112)
- (@antimora): Add is_nan and contains_nan tensor ops (#2088)
- (@laggui) Convert compatible prelu weights to rank 1 (#2054)
- (@laggui) Refactor tensor quantization for q_* ops (#2025)
- (@RuelYasa): Adding burn::nn::Sigmoid (#2031)
- (@laggui) Module weight quantization (#2000)
- (@louisfd): Cube: Matmul tiling (#1994)
- (@antimora): Enhance slice operation to support more range variation (#1989)
- (@laggui) Add static tensor quantization (#1963)
- (@johnhuichen): Enable negative starts and ends for slice op (#1981)
- (@booti386): Implement 3D and transposed 3D convolutions. (#1945)
- (@antimora): Print module - implement module display for remaining modules (part2) (#1933)
- (@antimora): Print model structure like with PyTorch - Part 1 (#1912)
- (@DieracDelta): Tanh nn wrapper (#1903)
- (@laggui) Implement
Element
forbool
(#1878) - (@LilDojd) Feat: Add
movedim
tensor operator (#1876) - (@ArthurBrussee): Make autodiff compile on wasm (#1889)
- (@ArthurBrussee): Make Param.id public (#1859)
- (@kantic) Remainder operator (#1726)
- (@McArthur-Alford) Indices Operator (#1735)
- (@laggui) Add seq start position when applying RoPE encoding (#1796)
- (@JachymPutta): Adding max import (#1769)
- (@agelas): Feat/squeeze dims (#1779)
- (@wcshds) Implement bidirectional LSTM (#1035)
- (@agelas): Feat/remainder (#1597)
Bug Fixes
- (@laggui) Fix root-mean-square precision issue (#2193)
- (@laggui) Fix indices dim check in gather_update_outputs (#2149)
- (@antimora): Fix #2091 bug (in-place after expand) (#2114)
- (@laggui) Fix aggregation results slice (#2110)
- (@nathanielsimard): Fix: fusion auto bound checks (#2087)
- (@laggui) Extend [min, max] range to ensure zero-point (#2055)
- (@agelas): Bug/Remove Squeeze Panic for Multiple Dimensions (#2035)
- (@nathanielsimard): Fix wgsl remainder definition (#1979)
- (@laggui) Fix output tensor dtype (#1938)
- (@femshima): feat: Make RetroForward public (#1905)
- (@laggui) Fix conv2d_weight_grad_groups (#1891)
- (@nathanielsimard): Fix select assign backward (#1739)
- (@louisfd): Fix repeat for dims > 1 (#1713)
- (@nathanielsimard): Fix lstm batch size bug (#1695)
- (@antimora): Reshape bug fix (#1684)
- (@antimora) Fix bug: Filling tensor containing f32::NEG_INFINITY will result in NaN for burn-ndarray (#2095)
ONNX Support
- (@hexd0t): Allow ONNX scalar greater/less with scalar (#2146)
- (@hexd0t): Implement ONNX Gather for scalar indices (#2141)
- (@mepatrick73): feat: adding shape support for gather ONNX operation (#2128)
- (@mepatrick73): ONNX Tile operation (#2092)
- (@cBournhonesque): Add onnx mean (#2119)
- (@mepatrick73): Repeat operation (#2090)
- (@antimora): Add 1d and 2d modules for interpolate with scaling (also fix ONNX Resize op) (#2081)
- (@johnhuichen): Implement ONNX Pad Operator (#2007)
- (@hexd0t, @antimora): Implement ONNX ConstantOfShape (#1815)
- (@johnhuichen): Add subtract tensor from scalar for ONNX sub op (#1964)
- (@Dirleye): Add ReduceProd ONNX Import (#1955)
- (@JachymPutta) feat: added reduce min onnx import (#1894)
- (@mosure): feat: resize onnx import (#1863)
- (@JachymPutta) feat: added slice onnx import (#1856)
- (@skewballfox): Optimize argument handling and improve ONNX graph building (#1857)
- (@JachymPutta) feat: add sum onnx import (#1846)
- (@agelas): Feat/gather import (#1843)
- (@JachymPutta): feat: expand onnx import (#1813)
- (@JachymPutta): feat: added range onnx import (#1834)
- (@will-maclean): Feature/onnx argmax (#1814)
- (@hexd0t): Feat: Implement ONNX RandomUniform + RandomNormal in burn-import (#1806)
- (@JachymPutta): feat: Greater + GreaterOrEqual onnx import (#1801)
- (@JachymPutta): feat: Less + LessOrEqual onnx import (#1800)
- (@JachymPutta): feat: added min onnx import (#1778)
- (@agelas): Squeeze Onnx Import (#1753)
- (@Arjun31415): Added ONNX AvgPool1d (#1744)
- (@Arjun31415): Add MaxPool1d ONNX Op(#1725)
- (@AntBlo) Add reduce sum onnx ops to burn imports (#1723)
- (@Arjun31415): PReLu ONNX import (#1721)
- (@antimora): Update SUPPORTED-ONNX-OPS.md (#1717)
- (@antimora): ONNX debug improvements (#1712)
- (@antimora): Skip updating shape for linear if not present (#1700)
- (@laggui) Remove leaky relu ONNX file (#1697)
- (@antimora): ONNX support for scalar unsqueeze (#1690)
- (@laggui) Add layer norm onnx op support (#1680)
- (@antimora): Fix reshape bug (support for opset version 1) (#1667)
- (@wufniks) Add sign ONNX op import support (#1663)
- (@laggui) Add where onnx op support (#1653)
- (@laggui) Add matmul ONNX op support (#1638)
- (@laggui) Add reduce max ONNX op support (#1636)
- (@laggui) Add shape ONNX op support (#1639)
- (@laggui) [ONNX] Add not op and extend cast support to tensors (#1634)
- (@laggui) Add reduce mean ONNX op support (#1637)
- (@antimora): Update SUPPORTED-ONNX-OPS.md (#1641)
- (@laggui) Add sin onnx op support (#1633)
Bug Fixes
- (@mepatrick73) Tensor type indent fix (#2196)
- (@mepatrick73) pad-input-fix: adding support for pads as attributes (#2195)
- (@hexd0t) Fix ONNX Gather codegen for Shape input (#2148)
- (@mepatrick73): bug fix: adding bounds checking to pad ONNX inputs (#2120)
- (@laggui) Fix checks_channels_div_groups condition and ONNX conv import with groups (#2051)
- (@nathanielsimard): Support linear 1d (#1682)
- (@laggui) Fix ONNX and PyTorch import section links in burn book (#1681)
- (@antimora): Fix bug 1645 (Unsqueeze OpSet 11) (#1661)
- (@laggui) Fix transpose onnx op (permute) (#1657)
Enhancements
- (@laggui) Add scientific notation formatting for small metric values (#2136)
- (@ArthurBrussee): Always derive Cube features from adapter (#1958)
- (@mepatrick73, @nathanielsimard): Dynamic memory management preset + updated wgpu buffer memory management (#1962)
- (@mepatrick73): Feat/fixed chunk alloc by class (#1960)
- (@ArthurBrussee): Consistent sync/async handling, allow more functions to be async for wasm. (#1936)
- (@varonroy): Replaced
str
withPath
(#1919) - (@louisfd, @nathanielsimard): New autodiff graph memory management strategy (#1698)
- (@syl20bnr): Move HandleContainer and Tensor Ops descriptions from burn-fusion to burn-tensor (#1654)
- (@NicoZweifel) WindowDataset/windows function (#1553)
- (@antimora): Improve pickle (CandleTensor) conversions to NestedValue (#1944)
Refactoring
- (@mepatrick73) Scatter kernel from cpa to cubecl (#2169)
- (@nathanielsimard): Refactor binary op (#2085)
- (@omahs): Fix typos (#2098)
- (@nathanielsimard): Refactor/jit/unary (#1965)
- (@skewballfox): Separating ONNX parsing from burn-import (#1921)
- (@laggui) Refactor tensor data (#1916)
- (@ArthurBrussee): Remove GraphicsAPI generic for WgpuRuntime (#1888)
- (@skewballfox): add dependency management for python (#1887)
- (@louisfd): refactor reduce into separate traits (#1798)
- (@nathanielsimard): Refactor/jit fusion (#1750)
- (@nathanielsimard): Refactor/burn compute (#1580)
Documentation & Examples
- (@nathanielsimard) Enable cuda-jit in burn-core + in text classification example (#2160)
- (@cBournhonesque): Add comments for matmul kernel (#2138)
- (@laggui) Fix inner backend typo in book guide (#2135)
- (@antimora): Improve ONNX import book section (#2059)
- (@antimora): Update slice documentation (#2024)
- (@syl20bnr): Remove mention of example in backend section of the book (#2014)
- (@laggui) Fix image-classsification-web + autotune flag usage (#2011)
- (@nathanielsimard): Cube/doc/readme (#1904)
- (@laggui, @syl20bnr) Add models and examples reference (#1966)
- (@antimora): Print module part3 - Update book (#1940)
- (@towerpark): Book: Fix the link to burn-train in "Learner" page (#1920)
- (@nathanielsimard): Doc: Improve module to_device/fork docs (#1901)
- (@jwric, @ThierryCantin-Demers, @mepatrick73): Add documentation to burn core nn (#1746)
- (@towerpark): Book: Fix typos in the name of MessagePack format (#1868)
- (@Zirconium409122, @kantic): Remainder operator doc (#1836)
- (@nathanielsimard): Fix wasm examples (#1824)
- (@eltociear) docs: update README.md (#1810)
- (@agelas): Contributor Book: Onnx to Burn Conversion (#1771)
- (@benbaarber): update ARCHITECTURE.md links to project architecture section in contributor book (#1759)
- (@jwric): Add hidden code snippets to guide example in Burn book [redo] (#1742)
- (@mepatrick73): Fixing various syntax errors in the Burn book (#1740)
- (@ThierryCantin-Demers) Add indentation to project architecture in contributing book (#1738)
- (@AntBlo) Add info about enabling debugging for new cont...
v0.13.2
v0.13.1
Bugfix
Fix autodiff memory leak and improve performance with a new graph memory management strategy (#1698) @nathanielsimard @louisfd
Fix inplace fused operations (#1682) @nathanielsimard
Improvements
Linear 1D support, helpful for ONNX support (#1682) @nathanielsimard
Upgrade wgpu to 0.19.4 (#1692) @nathanielsimard
v0.13.0
The Burn Release 0.13 is a significant update introducing numerous new features and performance enhancements. One major change is the removal of the Sync
trait implementation from most Burn types, see Core User APIs. Additionally, the release introduces several new tensor operations, module features, optimizers, as well as improvements to the autodiff backend. Notably, a new bridge mechanism facilitates runtime switching between backends, and significant work has been done on the Just-in-Time and Wgpu backends. The release also addresses numerous bug fixes, documentation improvements, infrastructure updates, CI enhancements, and miscellaneous changes to improve code quality and usability.
Core User APIs
A major change in this release is that most Burn types no longer implement the Sync
trait, such as modules, optimizers, and tensors. This change should not impact users of the Learner
struct for model training. However, it may affect those who implemented their own training loop and inference server. While modules, optimizers and tensors can be sent to other threads, they cannot be accessed concurrently by multiple threads. This aligns with Burn's workflow, where each tensor operation requires an owned version of the tensor. The change was made to safely reduce the number of locks needed when modifying the state of the autodiff graph, fusion state, allocation cache, and various other use cases. While not all locks have been removed, the type signature no longer poses a problem for follow-up optimizations. Note that the same tensor can still be sent to multiple threads without copying the underlying data. However it will require cloning before sending a tensor to a thread. (#1575) @nathanielsimard
Tensor
- Support signed value for
Tensor::arange
#1238 @Nikaidou-Shinku - Add
Tensor::unsqueeze_dims
op (#1236) @skewballfox - Add support for
Any
,All
operations to Tensor (#1342) @ashdtu - Add
not_equal
andnot_equal_elem
tensor ops (#1374) @laggui - Element wise
min
/max
between a pair of tensors (#1385) @boondocklabs - Add
is_close
andall_close
tensor operators (#1389) @antimora - Interpolate tensor operation (Inference Only) (#1246) @Nikaidou-Shinku @antimora @ashdtu
- Autodiff/training support for Nearest Interpolation (#1414) @Nikaidou-Shinku @ashdtu @antimora
- Add
argwhere
andnonzero
boolean tensor ops (#1394) @laggui - Add
bool()
op for numerical tensor (#1402) @antimora - Tensor
permute
operator (#1410) @antimora - Add
sign
tensor operator (#1446) @antimora - Rename
diagonal
toeye
tensor op and add missing entry for diagonal to Book tensor section (#1449) @antimora - Add
prod
andprod_dim
tensor ops (#1460) @antimora - Add
tril_mask
,triu_mask
anddiag_mask
ops (#1479) @antimora - Add
flip
tensor operator (#1468) @carrotflakes - Add tensor
sorting
operations (#1488) (#1494) @laggui - Add
topk
tensor operation (#1497) @laggui - Tensor
expand
operator (#1508) @antimora - Provide Tensor Padding Helpers (#960) (#1097) @jcmullwh @antimora
- Move
log_sigmoid
to activation ops (#1558) @laggui - Add
repeat
autodiff and fusion support (#1600) @louisfd
Module
- Feature Addition: PRelu Module (#1328) @Arjun31415
- Implement Instance Normalization (#1321) @tushushu
- Add enum module support (#1337) @laggui
- Make the parameters of
conv1d
andconv2d
public. (#1245) @Arjun31415 - Parameters are now lazy initialized, so you don't need to implement both the
init
andinit_with(record)
method for training/inference. (#1539) @nathanielsimard - Support multilabel binary cross entropy (#1571)
- Implement Huber loss (#1444) @WorldSEnder
- Feat: Add Leaky Relu Model (#1467) @Arjun31415
- Feat/swiglu (#1507) @ashdtu
- Feat: transformer rotary positional encoding to transformer modules (#1604) @ashdtu
Optimizer
- Add linear learning rate scheduler (#1443) @astral4
- Exponential learning rate scheduler @1481 @rubenjr0
- Cosine Annealing learning rate scheduler with cold restarts @1481 @rubenjr0
- Add Rank0 variant to AdaptorRecordV1 and AdaptorRecordItemV1 (#1442) @carrotflakes
Train
- Add multi-label classification dataset and metric (#1572) @laggui
- Add learner training summary (#1591) @laggui
Backend
This release also introduces the backend bridge, a new mechanism for runtime switching between backends.
While an improvement, it remains compatible with previous methods of supporting mixed precision. (#1529) @nathanielsimard
JIT
Significant effort has been devoted over the past few months to refactor the previous Wgpu backend into a shader-agnostic Just-in-Time backend.
All lower-level dependencies have been abstracted into the Just-in-Time Runtime trait, requiring a compiler, compute server, and storage.
The bulk of this work was carried out by @nathanielsimard and @louisfd.
Commits: #1274 #1280 #1313 #1340 #1356 #1359 #1378 #1391 #1396 #1398 #1417 #1429 #1423 #1424 #1433 #1456 #1474 #1457 #1480 #1472 #1493 #1509 #1530 #1528 #1541 #1550 #1569
Wgpu
- Enable
burn-fusion
by default. (#1223) @nathanielsimard - Feature/autotune int ops (#1136) @agelas
- Add runtime options in Wgpu init methods. (#1505) @nathanielsimard
- Decent speedup of transposed convolution @louisfd
Autodiff
Extensive work has also been undertaken on Burn's autodiff backend.
The backend now supports gradient checkpointing to reduce memory usage and has been refactored into a client/server architecture.
These updates result in significantly less blocking when tracking gradients, enhancing performance particularly on smaller models.
Furthermore, various bugs have been fixed where some graph nodes weren't used, potentially truncating the autodiff graph.
Overall, these changes make the autodiff process more reliable and efficient. (#1575) (#1358) @louisfd @nathanielsimard
Candle
Data
- Add an image folder dataset implementation. (#1232) (#1132) @laggui
- Add
burn::data::network::downloader
. (#1283) @laggui
Import
- [PyTorchRecorder] Allow multiple pattern matches in chain. (#1269) @laggui
- [PyTorchRecorder] Pytorch config extraction (#1323) @antimora
- [PyTorchRecorder] Pass top-level key to extract state_dict (#1300) @antimora
- [PyTorchRecorder] print debug option (#1425) @antimora
- [PyTorchRecorder] Truncate debug display for NestedValue (#1428) @antimora
- [PyTorchRecorder] Support for non-contiguous indexes in PyTorchFileRecorder keys (#1432) @antimora
- [PyTorchRecorder] Add Enum module support (#1436) @antimora
- [ONNX] Parser rewrite (#1296) @skewballfox
Benchmarks
We have implemented a system that enables the comparison of backends across a variety of tasks.
Currently, most of these tasks consist of micro-benchmarks, but we plan to expand the range of benchmarks in the future.
To ensure Burn's portability and performance across different devices, the community can run and upload benchmarks! 🔥
- Created the
burnbench
CLI. (#1260) @syl20bnr - Added GitHub authentication to the
burnbench
CLI. (#1285) @syl20bnr - Updated GitHub App ID with the official application. (#1397) @syl20bnr
- Implemented benchmark upload functionality to the server. (#1381) @syl20bnr
- Compiled benchmarks in a dedicated target directory. (#1435) @syl20bnr
- Enhanced benchmark result presentation with a neat table and attempted to run every benchmark. (#1464) @akhildevelops
- Improved access token refreshing and displayed authenticated user name. (#1483) @syl20bnr
- Added system information to benchmark results. (#1495) @syl20bnr
- Included Operating System information in benchmark results. (#1531) @syl20bnr
- Fixed automatic fusion activation issue with Wgpu. (#1542) @syl20bnr
- Tweaked and added kinds to Gelu benchmark names. (#1533) @syl20bnr
- Ensured backend names in JSON reports match the
burnbench
CLI. (#1375) @errordeveloper @syl20bnr - Added 'all' choice to
--benches
and--backends
options. (#1567) @syl20bnr - Revamped
burnbench
output for improved readability and compactness. (#1568) @syl20bnr - Added URL to browse results on the burn.dev website. (#1573) @syl20bnr
Bug Fix
- Fix the pow backward pass when one of the tensor wasn't tracking the gradients. (#1225) (#1224) @nathanielsimard
- Fix batch norm on the LibTorch backend when the aggregation was on the same device. (#1226) @nathanielsimard
- Fix training dashboard metrics switch on Max OS & Linux (#1228) @nathanielsimard
- Fix a bug introduced in (#1138) where arithmetic could fail on usize type. (#1287) @louisfd
- [PyTorchRecorder] Fix out of memory bug (#1270) (#1286) @antimora
- [PyTorchRecorder] Fix chain pattern matching when multiple patterns are provided (#1273) @laggui
- Fix LogEventStore end epoch log (#1314) @laggui
- Huggingface dataset importer: check that pa_type is valid before checking if is_binary (#1354) @laggui
- Fix implicit casting of bool in wgpu backend (#1391) @louisfd
- Fix Switched arguments in reshape_args_usize check (#1409) @jackdarlison
- Fix tch view data corruption (#1434) @nathanielsimard
- Missing Debug derive for Group Norm Config (#1482) @Arjun31415
- Numerically stable log_sigmoid (#1548) @laggui
- Fix pytorch recorder adapt_linear when using autodiff backend (#1576) @laggui
Infrastructure
The minimum Rust version has been updated to 1.75. (#1297) @syl20bnr
Docs
- Improve the doc feature flags for docs.rs (#1212) @syl20bnr
- Include the backends in the documentation (#1229) @nathanielsimard
- Started the burn developer book. (#1184) @skewballfox @syl20bnr @antimora
- Update TORCH_CUDA_VERSION usage. (#1284) @laggui
- fix(book): add missing device parameter to mode.init(). (#1302) @apertureless
- fix(book): add missing second parameter to CrosEntropyLoss constructor (#1301) @apertureless
- docs(book-&-examples):...
v0.12.1
Bugfix
Fix wgpu performance issue: revert to wgpu 0.18.0 #1221 @nathanielsimard
Fix problem with batch norm on LibTorch backend #1226 @nathanielsimard
Fix docs build #1212 #1229 @syl20bnr @nathanielsimard
Fix training dashboard metrics switch #1228 @nathanielsimard
Chores
Put all dependencies versions in workspace #1210 @nathanielsimard
v0.12.0
This release highlights an optimized Wgpu Backend, clearer examples and documentation, and numerous bug fixes.
Notably, breaking changes in device management mandate explicit device specification to prevent potential bugs.
Additionally, the new PyTorch recorder simplifies model porting by enabling automatic import of PyTorch's weights.
We also put a lot of efforts into improving our CI infrastructure for enhanced reliability, efficiency, and scalability.
Changes
Tensor & Module API
- Added support for generic modules #1147 @nathanielsimard
- Added support for tuple modules #1186 @varonroy
- Enabled loading PyTorch .pt (weights/states) files directly to module's record, currently available on Linux & MacOS #1085 @antimora
- Added mish and softplus activation functions #1071 @pacowong
- Improved chunk performance in backends @1032 @Kelvinyu1117
- [Breaking] Added the device as an argument for tensor operations that require it, replacing the previous optional device usage #1081 #518 #1110 @kpot
- Code update involves either using
Default::default
for the same behavior or specifying the desired device.
- Code update involves either using
- Allowed raw tensors to be serialized/deserialized directly with serde #1041 @jmacglashan
- [Breaking] Forced the choice of the device for deserialization #1160 #1165 @nathanielsimard
- Added element-wise pow operation #1133 @skewballfox
- Refactored the tensor backend API names #1174 @skewballfox
- [Breaking] Changed the default recorder to
NamedMpkFileRecorder
#1161 #1151 @laggui- After a bit of exploration, we removed any type of compression because it adds to much overhead
Examples & Documentation
- Updated the text-classification example #1044 @nathanielsimard
- Fixed import and type redefinitions in
mnist-web-inference
#1100 @syl20bnr - Fixed documentation of Tensor::stack #1105 @PonasKovas
- Fixed some typos in links in the burn-book #1127 @laggui
- Added an example for a custom CSV dataset #1129 #1082 @laggui
- Fixed missing ticks in Burn book and removed unused example dependency #1144 @laggui
- Added a new example for regression problems #1150 #1148 @ashdtu
- Added model saving and loading examples in the book #1164 #1156 @laggui
- Added Rust concept notes and explanations to the Burn Book #1169 #1155 @laggui
- Fixed jupyter notebook and ONNX IR example #1170 @unrenormalizable
- Added a custom mnist dataset, removing the Python dependency for running the guide and the mnist example #1176 #1157 @laggui
- Updated documentation and book sections on PyTorch import #1180 @antimora
- Updated burn-book with improved tensor documentation #1183 #1103 @ashdtu
- Updated burn-book with a new dataset transforms section #1183 #1154 @ashdtu
- Update CONTRIBUTING.md with code guidelines. #1134 @syl20bnr
- Fixed documentation of Multi Head Attention #1205 @ashdtu
Wgpu Backend
- Optimized the repeat operation with a new kernel #1068 @louisfd
- Improved reduce autotune by adding the stride to the autotune key #1070 @louisfd
- Refactored binary operations to use the new JIT compiler IR #1078 @nathanielsimard
- Added persistent cache for autotune #1087 @syl20bnr
Fusion
- Refactored burn-fusion, making it possible to eventually save the JIT state #1104 @nathanielsimard
- Improved fusion in the Wgpu backend with caching #1069 @nathanielsimard
- Supported fusing int operations with burn-fusion #1093 @nathanielsimard
- Supported automatic vectorization of operations fused with burn-fusion in WGPU #1123 #1111 @nathanielsimard
- Supported automatically executing in-place operations fused with burn-fusion in WGPU #1128 #1124 @nathanielsimard
- Heavily refactored burn-fusion to better reflect the stream optimization process #1135 @nathanielsimard
- Heavily refactored burn-fusion to save all execution plans for any trigger #1143 @nathanielsimard
- Supported multiple concurrent optimization streams #1149 #1117 @nathanielsimard
- Supported overlapping optimization builders #1162 @nathanielsimard
- Supported fusing
ones
,zeroes
, andfull
operations #1159 @nathanielsimard - Supported autotuning fused element-wise kernels #1188 #1112 @nathanielsimard
Infra
- Support testing accelerate(MacOS) on the burn-ndarray backend #1050 @dcvz
- Improved CI output by introducing groups #1024 @dcvz
- Updated scheduled CI tasks #1028 @Luni-4
- Added support for Windows Pipeline #925 @Luni-4
- Fixed CI for testing the wgpu backend by pinning versions #1120 @syl20bnr
- Fixed burn-compute build command with no-std #1109 @syl20bnr
- Temporarily disabled unnecessary steps on Windows runners to save CI time #1107 @syl20bnr
- Refactored serialization of backend comparison benchmarks #1131 @syl20bnr
- Fixed doc build on docs.rs #1168 @syl20bnr
- Added cargo xtask commands for dependencies and vulnerabilities checks #1181 #965 @syl20bnr
- Added cargo xtask command to manage books #1192 @syl20bnr
Chore
- Shared some properties across the cargo workspace #1039 @dcvz
- Formatted the codebase with nightly where the stable version falls short #1017 @AlexErrant
- Improved panic messages on the web @1051 @dcvz
- Used web-time in wasm #1060 @sigma-andex
- Refactored some tensor tests #1089 @nathanielsimard
- Made Embedding weights public #1094 @unrenormalizable
- Updated candle version and added support for slice_assign #1095 @louisfd
- Records no longer require Debug and Clone #1137 @nathanielsimard
- Removed cargo warning #1108 @syl20bnr
- Updated wgpu version to 0.19.0 #1166 @nathanielsimard
- Added tests for Slice assign vs Cat in LSTM backward #1146 @louisfd
- Updated xtask publish task #1189 @Luni-4
- Enable dependabot daily #1195 @Luni-4
- Updated Ratatui version #1204 @nathanielsimard
- Updated tch version #1206 @laggui
Bug Fixes
- Fixed a slice issue in the LibTorch backend that could corrupt tensors' data #1064 #1055 @nathanielsimard
- Fixed issues with tensor stack and reshape on ndarray #1053 #1058 @AuruTus
- Fixed multithread progress aggregation in dataloader #1083 #1063 @louisfd
- Resolved a numerical bug with tanh on MacOS with Wgpu #1086 #1090 @louisfd
- Fixed burn-fusion, where only operations followed by a sync were fused #1093 @nathanielsimard
- Removed the requirement for users to add
serde
as a dependency for Burn #1091 @nathanielsimard - Fixed transformer prenorm on the residual path #1054 @Philonoist
- Fixed conv2d initialization by supporting fan_out #1138 @laggui
- Resolved the problem of sigmoid gradient generating NaN #1140 #1139 @wcshds
- Fixed
FullPrecisionSettings
type for integers #1163 @laggui - Fixed batchnorm not working properly when training on multiple devices #1167 @wcshds
- Fixed powf function in WGPU, with new tests #1193 #1207 @skewballfox @louisfd
- Fixed regex in PyTorch Recorder #1196 @antimora
v0.11.1
Burn v0.11.1 fixes a few bugs in the recent v0.11.0
Bugfixes
Fix concurrency issue in burn-fusion, related to freeing tensors that are never read @nathanielsimard
Fix typos in the book @shanmo
Fix Readme @nathanielsimard
Fix docs build @dcvz
Thanks
Thanks to all aforementioned contributors
v0.11.0
The main feature of Burn v0.11.0 is automatic kernel fusion, which is still in active development but already usable. Many enhancement and new features have been added throughout the framework, for better efficiency and reliability.
Warnings:
- There are some breaking changes, see below.
- The organization has been renamed from burn-rs to tracel-ai.
Changes
Overall changes
-
[Breaking] Refactor backend names @nathanielsimard
-
[Breaking] Updated the feature flags of burn to improve usability @nathanielsimard
-
Update of Burn's Readme @nathanielsimard @louisfd
Burn Fusion
-
Innovative automatic kernel fusion algorithm @nathanielsimard
-
Relative computation graph cache @nathanielsimard
Burn Core
-
GroupNorm module @dcvz
-
Allow for int and bool constant tensors in modules @nathanielsimard
-
Quiet softmax in transformers @wbrickner
Burn Tensor
-
New operators in tensor API: unsqueeze_dim, narrow, stack, chunk, tril, triu @dcvz
-
Recip operation support on all backends @gzsombor
-
Implement DoubleEndedIterator for DimIter @wcshds
Burn Compute
- Major Autotune refactor @louisfd
Burn Import
-
ONNX Support for Gather @CohenAriel
-
ONNX Support for Cos, Exp, Gelu, Log, Neg @antimora
-
ONNX Support ConvTranspose2D @npatsakula, @antimora,
-
ONNX Support for Sqrt @edmondop
-
Support count_include_pad attr in avg_pool2d ONNX @antimora
Burn Train
- Add warmup consideration for estimated training time @nathanielsimard
Burn WGPU
-
New Matmul kernels @louisfd
-
New Reduce kernel @louisfd
-
Add Autotune capabilities to Matmul and Reduce @louisfd
-
Support of kernel fusion for element-wise operations @nathanielsimard @louisfd
Burn Candle
Backend Comparison
-
Custom Gelu benchmarks @nathanielsimard
-
Persistence of results in json @louisfd
Bugfixes
-
Allow arbitrary precision threshold for float equality assertion @meteor-lsw
-
Update serde_rusqlite to the new version with MIT/Apache2 license @antimora
-
Fix SQLite database tests on Windows @syl20bnr
-
Fix max_dim and min_dim tensor operations @gzsombor
-
Fix inplace double binary broadcasting in the LibTorch backend @nathanielsimard
Documentation
-
Add Python details in the Book's getting started @antimora
Continuous Integration
-
Add test coverage @Luni-4
-
Speedup typos check @Luni-4
-
Dependency checks @Luni-4
-
Vulnerability checks @Luni-4
Thanks
Thanks to all aforemetioned contributors.