v1.2.23
What's Changed
-
Merged PR 3131: Set masked load/store inbounds flag to true. [Mason
Remy]Set masked load/store inbounds flag to true
The mask we generate, as well as the rest of our infrastructure, will
prevent out-of-bounds accesses when used properly. Therefore for
performance reasons we don't want MLIR to generate runtime bounds
checking -
Merged PR 3130: Recognize and simplify always true EQ and NE CmpOps.
[Mason Remy]Recognize and simplify always true EQ and NE CmpOps
These would already get simplified after converting to the builtin
dialects, but this makes them happen earlier in the lowering -
Merged PR 3129: Optimize 1-row horizontal i16->i32 sum reduction.
[Mason Remy]Optimize 1-row horizontal i16->i32 sum reduction
-
Merged PR 3118: vectorize accumulation of results of two masked load
ops. [JUBI TANEJA]This PR vectorizes a pattern that occurs in MMIF where there are two conditional loads, followed by an accumulation operation, and a conditional store. On vectorizing the following DSL:
N_input = 8 N_output = 5 Input = Array(role=Role.INPUT, element_type=ScalarType.int32, shape=(N_input, )) Output = Array(role=Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(N_output, )) nest = Nest(shape=(N_input, )) i, = nest.get_indices() @nest.iteration_logic def _nest(): def store_value(): Output[i] += Input[i] _If(i < N_output, store_value)
It produces the following assembly. We are looking for
vpmaskmovd
instructions that correspond to vector.transfer_read/vector.transfer_write ops in MLIR.0000000000000030 <test_vectorized_masked_accumulate_3e5de44f3dcca64e>: 30: c5 fd 6f 05 00 00 00 vmovdqa 0x0(%rip),%ymm0 # 38 <test_vectorized_masked_accumulate_3e5de44f3dcca64e+0x8> 37: 00 38: c4 e2 7d 8c 0e vpmaskmovd (%rsi),%ymm0,%ymm1 3d: c4 e2 7d 8c 17 vpmaskmovd (%rdi),%ymm0,%ymm2 42: c5 ed fe c9 vpaddd %ymm1,%ymm2,%ymm1 46: c4 e2 7d 8e 0e vpmaskmovd %ymm1,%ymm0,(%rsi) 4b: c5 f8 77 vzeroupper 4e: c3 retq
-
Merged PR 3126: [test] Adds more tests for vectorized transpose. [Kern
Handa][test] Adds more tests for vectorized transpose
-
Merged PR 3121: [nfc] Separate bounds checking into separate pass
file. [Mason Remy][nfc] Separate bounds checking into separate pass file
This removes the bounds checking code from
ExecutionPlanToAffineLoweringPass and creates a separate pass file for
it. There is no change in when and where the checking occurs (currently
it only happens for caching-generated loads and stores).In a future change we will further separate the pass and run it at a
different phase of the lowering and plumb controls for
enabling/disabling it to the DSL -
Merged PR 3122: Fix reinterpret_cast output memref shape. [Mason Remy]
Fix reinterpret_cast output memref shape
-
Merged PR 3115: Normalize AffineForOps to have unit stride and begin
at 0. [Mason Remy]Normalize AffineForOps to have unit stride and begin at 0
-
Merged PR 3117: Vectorize horizontal multi-dim sum reductions. [Mason
Remy]Vectorize horizontal multi-dim sum reductions
Recognizes and vectorizes these sum reductions:
4x16xi16 -> 4x1xi32
4x8xi32 -> 4x1xi32
4x8xf32 -> 4x1xf32 -
Merged PR 3099: Adds pattern rewriting for AVX2 vectorized transpose.
[Kern Handa]
Full Changelog: v1.2.22...v1.2.23