Skip to content

v1.2.23

Compare
Choose a tag to compare
@masonremy masonremy released this 02 Mar 22:04
· 7 commits to main since this release

What's Changed

  • Merged PR 3131: Set masked load/store inbounds flag to true. [Mason
    Remy]

    Set masked load/store inbounds flag to true

    The mask we generate, as well as the rest of our infrastructure, will
    prevent out-of-bounds accesses when used properly. Therefore for
    performance reasons we don't want MLIR to generate runtime bounds
    checking

  • Merged PR 3130: Recognize and simplify always true EQ and NE CmpOps.
    [Mason Remy]

    Recognize and simplify always true EQ and NE CmpOps

    These would already get simplified after converting to the builtin
    dialects, but this makes them happen earlier in the lowering

  • Merged PR 3129: Optimize 1-row horizontal i16->i32 sum reduction.
    [Mason Remy]

    Optimize 1-row horizontal i16->i32 sum reduction

  • Merged PR 3118: vectorize accumulation of results of two masked load
    ops. [JUBI TANEJA]

    This PR vectorizes a pattern that occurs in MMIF where there are two conditional loads, followed by an accumulation operation, and a conditional store. On vectorizing the following DSL:

            N_input = 8
            N_output = 5
            Input = Array(role=Role.INPUT, element_type=ScalarType.int32, shape=(N_input, ))
            Output = Array(role=Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(N_output, ))
            nest = Nest(shape=(N_input, ))
            i, = nest.get_indices()
    
            @nest.iteration_logic
            def _nest():
    
                def store_value():
                    Output[i] += Input[i]
    
                _If(i < N_output, store_value)
    

    It produces the following assembly. We are looking for vpmaskmovd instructions that correspond to vector.transfer_read/vector.transfer_write ops in MLIR.

    0000000000000030 <test_vectorized_masked_accumulate_3e5de44f3dcca64e>:
      30:   c5 fd 6f 05 00 00 00    vmovdqa 0x0(%rip),%ymm0        # 38 <test_vectorized_masked_accumulate_3e5de44f3dcca64e+0x8>
      37:   00
      38:   c4 e2 7d 8c 0e          vpmaskmovd (%rsi),%ymm0,%ymm1
      3d:   c4 e2 7d 8c 17          vpmaskmovd (%rdi),%ymm0,%ymm2
      42:   c5 ed fe c9             vpaddd %ymm1,%ymm2,%ymm1
      46:   c4 e2 7d 8e 0e          vpmaskmovd %ymm1,%ymm0,(%rsi)
      4b:   c5 f8 77                vzeroupper
      4e:   c3                      retq
    
  • Merged PR 3126: [test] Adds more tests for vectorized transpose. [Kern
    Handa]

    [test] Adds more tests for vectorized transpose

  • Merged PR 3121: [nfc] Separate bounds checking into separate pass
    file. [Mason Remy]

    [nfc] Separate bounds checking into separate pass file

    This removes the bounds checking code from
    ExecutionPlanToAffineLoweringPass and creates a separate pass file for
    it. There is no change in when and where the checking occurs (currently
    it only happens for caching-generated loads and stores).

    In a future change we will further separate the pass and run it at a
    different phase of the lowering and plumb controls for
    enabling/disabling it to the DSL

  • Merged PR 3122: Fix reinterpret_cast output memref shape. [Mason Remy]

    Fix reinterpret_cast output memref shape

  • Merged PR 3115: Normalize AffineForOps to have unit stride and begin
    at 0. [Mason Remy]

    Normalize AffineForOps to have unit stride and begin at 0

  • Merged PR 3117: Vectorize horizontal multi-dim sum reductions. [Mason
    Remy]

    Vectorize horizontal multi-dim sum reductions

    Recognizes and vectorizes these sum reductions:
    4x16xi16 -> 4x1xi32
    4x8xi32 -> 4x1xi32
    4x8xf32 -> 4x1xf32

  • Merged PR 3099: Adds pattern rewriting for AVX2 vectorized transpose.
    [Kern Handa]

Full Changelog: v1.2.22...v1.2.23