v1.2.12
What's Changed
-
Merged PR 2953: Workaround debug mode failures with dimension argument
ordering. [Lisa Ong]- Order dimension arguments after Array args to avoid this lowering issue in Debug mode (until Debug mode is fixed)
test_all_dynamic_sizes_static_unroll_matmul_llvm.mlir:236:28: error: use of value '%7' expects different type than prior uses: 'i64' vs '!llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>' %42 = llvm.insertvalue %7, %41[3, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)> ^ /Users/lisaong/work/staging/Accera/build/lib.macosx-11.1-arm64-3.10/test_acccgen/test_all_dynamic_sizes_static_unroll_matmul/_tmp/test_all_dynamic_sizes_static_unroll_matmul/test_all_dynamic_sizes_static_unroll_matmul_llvm.mlir:201:5: note: prior use here %7 = llvm.insertvalue %arg6, %6[4, 1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)> ^
- Enable DEV_MODE tests in one CI pipeline so that we can catch these in the future
-
Merged PR 2950: [Release] Rev docs to v1.2.12. [Lisa Ong]
In preparation for 1.2.12 release EOW
-
Merged PR 2946: Fix hierarchical partial fusing. [Mason Remy]
Fix hierarchical partial fusing
Index attributes in fragment predicate ops weren't getting updated as
part of fusion mapping old indices to new fused indices. This fix is a
quick change to recursively walk predicates and update their index
attributes manually.
In the future we could use SymbolicIndexOps and rely on
BlockAndValueMapping replacements in clone, however this will also
require that we don't create as many duplicate SymbolicIndexOps for the
same Index -
Merged PR 2942: Hold onto intermediate split indices when fusing.
[Mason Remy]Hold onto intermediate split indices when fusing
When we split a loop multiple times, the outer index references the
inner intermediate split indices in affine expressions, even if those
indices get further split and are no longer loop indices. We have been
discarding them because they aren't loop indices or dimension indices,
but they wound up getting re-added to the transformed domain by
serialization and this led to fusion bugs. -
Merged PR 2834: match and rewrite a pattern to vectorize int16 matmul.
[JUBI TANEJA]This rewrite rule matches the jj and kk loops in int16 matmul, where outer loop
jj
{0..8}
is followed by an inner loopkk
{0..2}
. It vectorizes thejj
andkk
loop and replaces each affine op by a vectorized op. At the end, it generatesvpmaddwd
instruction for MatMul. -
Merged PR 2918: Support vectorization and static size caching for
split dynamic range. [Mason Remy]Support vectorization and static size caching for split dynamic range
loops -
Merged PR 2914: Support static loop splits of dynamic sized ranges.
[Mason Remy]Support static loop splits of dynamic sized ranges
This change creates a specialization of the AffineConstraintsHelper that
works with Loopnest concepts and uses that in LoopNestBuilder to update
the loop split generation -
Merged PR 2911: Support dynamic ranges in ScheduledLoopOp. [Mason
Remy]Support dynamic ranges in ScheduledLoopOp
-
Merged PR 2907: Implement initial affine constraint helper for dynamic
size loop. [Mason Remy]Implement initial affine constraint helper for dynamic size loop
handlingImplements a wrapper around mlir::FlatAffineValueConstraints and a set
of low-level tests using it that enable static-sized splitting of
dynamic loop ranges -
Merged PR 2935: Remove thread coarsening factor > 4 from GPU
benchmarks. [Captain Jack Sparrow]Remove thread coarsening factor > 4 from GPU benchmarks
-
Merged PR 2932: Upgrade to CUDA 11.8. [Captain Jack Sparrow]
Upgrade to CUDA 11.8
-
Merged PR 2931: Update to ROCm 5.3. [Captain Jack Sparrow]
Update to ROCm 5.3
-
Merged PR 2926: Plumb parameter usages to emitted HAT files. [Lisa
Ong] -
Merged PR 2927: Reduce benchmark configs using thread coarsening.
[Captain Jack Sparrow]Reduce benchmark configs using thread coarsening
-
Merged PR 2925: Add optional optimization hint for number of thread
blocks per SM. [Captain Jack Sparrow]Add optional optimization hint for number of thread blocks per SM
Related work items: #3736