Skip to content

Releases: microsoft/Accera

v1.2.19

03 Feb 07:34
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3069: Set target device features on module and check when
    matching avx2/512 ops. [Mason Remy]

    Set target device features on module and check when matching avx2/512 ops

  • Merged PR 3060: Adds support for sqrt op in acc-translate. [Kern
    Handa]

Full Changelog: v1.2.18...v1.2.19

v1.2.18

26 Jan 00:22
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3055: Move value unrolling to after function inlining and
    loop simplification. [Mason Remy]

    Move value unrolling to after function inlining and loop simplification

    This enables dynamically-sized inner functions that get inlined into
    statically-sized regions to have loop unrolling affect their
    actually-statically-sized loops when possible

  • Merged PR 3053: Add package.build flags for building with higher-
    precision FP vector ops. [Mason Remy]

    Add package.build flags for building with higher-precision FP vector ops

    Setting this new flag prevents a vmulps -> vaddps sequence
    from being contracted into a vfmaddps

  • Merged PR 3052: Place heap allocations at the top level of the
    function. [Mason Remy]

    Place heap allocations at the top level of the function

  • Merged PR 3050: [non-func, API] Change Nest.get_shape() to always
    return a list. [Captain Jack Sparrow]

    Change Nest.get_shape() to always return a list

  • Merged PR 3030: Include acc-translate whenever accera is installed.
    [Lisa Ong]

    Perhaps a longer-term fix is to merge the accera-gpu package into accera-compilers so we have one less package to maintain.

    However, that adds constraints to the binary size of acc-opt (to not push us past the 100MB PyPI hard limit), so punting until we have cycles for this.

  • Merged PR 3035: [nfc] Adds my machine to targets.py. [Kern Handa]

Full Changelog: v1.2.17...v1.2.18

v1.2.17

18 Jan 01:02
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3029: Work around constraint resolution issues with dynamic
    split size 1. [Mason Remy]

    Work around constraint resolution issues with dynamic split size 1

Full Changelog: v1.2.16...v1.2.17

v1.2.16

16 Jan 02:05
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3027: Hack required to use Array as output element argument
    (Dimension) [Captain Jack Sparrow]

  • Merged PR 3025: Add arg name and size string required for hat
    metadata. [Captain Jack Sparrow]

    Add arg name and size string required for hat metadata

  • Merged PR 3017: Output array supports gather function. [Denny Sun]

    Add the dsl test for gather function.

Full Changelog: v1.2.15...v1.2.16

v1.2.15

12 Jan 03:22
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3018: Use VS 17.4.3-built binaries. This is in a separate
    channel to allow older ve... [Mason Remy]

    Use VS 17.4.3-built binaries. This is in a separate channel to allow older versions to keep working

  • Merged PR 3012: Correctness check for output array support for range
    node. [Denny Sun]

    Successful correctness check means output array support can work end to end.

  • Merged PR 3015: Update hatlib version to support floating type as
    function arg. [Denny Sun]

    Update hatlib version to support floating type as function arg

  • Merged PR 3010: Disable BinOp simplification for floating types.
    [Captain Jack Sparrow]

    Disable BinOp simplification for floating types

  • Merged PR 3013: Apply major version in docs. [Lisa Ong]

    Removes the need to update docs versions every time we release

  • Merged PR 2981: Prologue and Epilogue op support with tensorization
    and caching. [Captain Jack Sparrow]

    • Add optional prologue and epilogue support for tensorization
    • Supported gemm parameters with fragment ops are: {alpha: 1, beta: any} and {alpha: >1, beta: 0}
    • ReLU, SET, SCALE added a standard fragment op

    Related work items: #3704

    Full ChangeLog v1.2.14...v1.2.15

v1.2.14

15 Dec 09:29
Compare
Choose a tag to compare
  • Merged PR 3001: [test] Expect failures on macos for x86 intrinsics
    tests. [Lisa Ong]

    macos does not support x86 and x86 avx intrinsicts

  • Merged PR 3000: Expect failures for macos in vpmaddwd tests. [Lisa
    Ong]

  • Merged PR 2994: Bump hatlib to 0.0.32. [Lisa Ong]

  • Merged PR 2997: Support more casting cases in vpmaddwd matcher. [Mason
    Remy]

    Support more casting cases in vpmaddwd matcher

  • Merged PR 2996: [release] bump docs to 1.2.14 for next release. [Lisa
    Ong]

    bump docs to 1.2.14 for next release

Full Changelog: v1.2.13...v1.2.14

v1.2.13

14 Dec 10:10
Compare
Choose a tag to compare
  • Merged PR 2987: Add support for max/min/round ops and vectorizing
    those ops. [Mason Remy]

    Add support for max/min/round ops and vectorizing those ops

  • Merged PR 2963: Control TEMP array allocation location. [Mason Remy]

    Control TEMP array allocation location

  • Merged PR 2962: Expand vpmaddwd matching and add intrinsic call.
    [Mason Remy]

    Expand vpmaddwd matching and add intrinsic call

    Matches more vpmaddwd cases and creates a pathway to invoking the LLVM
    intrinsic directly.

  • Merged PR 2961: Match more vectorization patterns and support
    vectorized cast. [Mason Remy]

    Match more vectorization patterns and support vectorized cast

    Tries to match and rewrite vectorization patterns:

    • 2-loop interleaving store -> vector shuffle and store
    • simple horizontal reductions (not always efficient currently)
    • vectorized casts

    Makes vectorization of non-innermost loops do a per-op "inplace" unroll and
    vectorize the innermost loop
    TODO : update documentation to describe this behavior better

  • Merged PR 2960: Enable marking functions as no-inline-into. [Mason
    Remy]

    Enable marking functions as no-inline-into

    Functions marked no-inline-into
    won't inline calls to other functions within their body. This
    is a useful compiler performance (not emitted code performance)
    optimization when we have many nested functions calls

  • Merged PR 2986: [output array] Emit range function with input_output
    type arguments. [Denny Sun]

    Instead of using output type, we use input_output instead to generate two functions for the Range function.
    Now Accera can successfully generate code for range function.

    # Generate functions like:
    # get_size(float start, float limit, float delta, int64_t* output_dim);
    # get_array(int64_t input_dim, float* output, float start, float delta);
    
  • Merged PR 2959: Improved affine for op range simplification. [Mason
    Remy]

    Improved affine for op range simplification

    Add range value / constant-cmp-result patterns and affine for op range
    simplifications to the affine simplification pass and run it after
    inlining functions.
    When inlining a dynamically-sized function into a statically-sized
    function, this change is useful for resolving the dynamic ranges to
    constants and pruning dynamic-range loops that are not needed given the
    specific constant value being used.

  • Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest
    or overfused. [Mason Remy]

    Hack to erase loops in a nest to support nest-of-nest or overfused
    scenarios

    This change enables an action plan to erase loops. Typically this would
    be used when an outer nest traverses tiles and invokes an inner nest (or
    multiple nests) which operate within each tile. The outer nest still
    needs to cover the full iteration space, however after splitting by the
    tile sizes a user will not want the outer nest to perform the inner
    loops

  • Merged PR 2985: [release] Rev docs to 1.2.13. [Lisa Ong]

  • Merged PR 2983: Increase timeouts of GPU benchmarks. [Captain Jack
    Sparrow]

    Increase timeouts of GPU benchmarks

  • Merged PR 2982: Work around bug with redundant splits of dynamic
    dimensions. [Mason Remy]

    Work around bug with redundant splits of dynamic dimensions

  • Merged PR 2972: Build both static and dynamic binaries by default, put
    both in aux dependencies. [Kern Handa]

  • Merged PR 2975: Updates llc/opt build flags to enable more
    optimizations by default. [Kern Handa]

    Updates llc/opt build flags to enable more optimizations by default

  • Merged PR 2977: Updates CMake to do FindPython before pybind11 config.
    [Kern Handa]

    Updates CMake to do FindPython before pybind11 config

  • Merged PR 2955: Reduce Linux PR runtime to under 60mins. [Lisa Ong]

    Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort.

Full Changelog: v1.2.12...v1.2.13

v1.2.12

21 Nov 03:02
Compare
Choose a tag to compare

What's Changed

  • Merged PR 2953: Workaround debug mode failures with dimension argument
    ordering. [Lisa Ong]

    • Order dimension arguments after Array args to avoid this lowering issue in Debug mode (until Debug mode is fixed)
    test_all_dynamic_sizes_static_unroll_matmul_llvm.mlir:236:28: error: use of value '%7' expects different type than prior uses: 'i64' vs '!llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>'
        %42 = llvm.insertvalue %7, %41[3, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
                               ^
    /Users/lisaong/work/staging/Accera/build/lib.macosx-11.1-arm64-3.10/test_acccgen/test_all_dynamic_sizes_static_unroll_matmul/_tmp/test_all_dynamic_sizes_static_unroll_matmul/test_all_dynamic_sizes_static_unroll_matmul_llvm.mlir:201:5: note: prior use here
        %7 = llvm.insertvalue %arg6, %6[4, 1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
        ^
    
    • Enable DEV_MODE tests in one CI pipeline so that we can catch these in the future
  • Merged PR 2950: [Release] Rev docs to v1.2.12. [Lisa Ong]

    In preparation for 1.2.12 release EOW

  • Merged PR 2946: Fix hierarchical partial fusing. [Mason Remy]

    Fix hierarchical partial fusing

    Index attributes in fragment predicate ops weren't getting updated as
    part of fusion mapping old indices to new fused indices. This fix is a
    quick change to recursively walk predicates and update their index
    attributes manually.
    In the future we could use SymbolicIndexOps and rely on
    BlockAndValueMapping replacements in clone, however this will also
    require that we don't create as many duplicate SymbolicIndexOps for the
    same Index

  • Merged PR 2942: Hold onto intermediate split indices when fusing.
    [Mason Remy]

    Hold onto intermediate split indices when fusing

    When we split a loop multiple times, the outer index references the
    inner intermediate split indices in affine expressions, even if those
    indices get further split and are no longer loop indices. We have been
    discarding them because they aren't loop indices or dimension indices,
    but they wound up getting re-added to the transformed domain by
    serialization and this led to fusion bugs.

  • Merged PR 2834: match and rewrite a pattern to vectorize int16 matmul.
    [JUBI TANEJA]

    This rewrite rule matches the jj and kk loops in int16 matmul, where outer loop jj {0..8} is followed by an inner loop kk {0..2}. It vectorizes the jj and kk loop and replaces each affine op by a vectorized op. At the end, it generates vpmaddwd instruction for MatMul.

  • Merged PR 2918: Support vectorization and static size caching for
    split dynamic range. [Mason Remy]

    Support vectorization and static size caching for split dynamic range
    loops

  • Merged PR 2914: Support static loop splits of dynamic sized ranges.
    [Mason Remy]

    Support static loop splits of dynamic sized ranges

    This change creates a specialization of the AffineConstraintsHelper that
    works with Loopnest concepts and uses that in LoopNestBuilder to update
    the loop split generation

  • Merged PR 2911: Support dynamic ranges in ScheduledLoopOp. [Mason
    Remy]

    Support dynamic ranges in ScheduledLoopOp

  • Merged PR 2907: Implement initial affine constraint helper for dynamic
    size loop. [Mason Remy]

    Implement initial affine constraint helper for dynamic size loop
    handling

    Implements a wrapper around mlir::FlatAffineValueConstraints and a set
    of low-level tests using it that enable static-sized splitting of
    dynamic loop ranges

  • Merged PR 2935: Remove thread coarsening factor > 4 from GPU
    benchmarks. [Captain Jack Sparrow]

    Remove thread coarsening factor > 4 from GPU benchmarks

  • Merged PR 2932: Upgrade to CUDA 11.8. [Captain Jack Sparrow]

    Upgrade to CUDA 11.8

  • Merged PR 2931: Update to ROCm 5.3. [Captain Jack Sparrow]

    Update to ROCm 5.3

  • Merged PR 2926: Plumb parameter usages to emitted HAT files. [Lisa
    Ong]

  • Merged PR 2927: Reduce benchmark configs using thread coarsening.
    [Captain Jack Sparrow]

    Reduce benchmark configs using thread coarsening

  • Merged PR 2925: Add optional optimization hint for number of thread
    blocks per SM. [Captain Jack Sparrow]

    Add optional optimization hint for number of thread blocks per SM

    Related work items: #3736

v1.2.11

18 Oct 07:45
Compare
Choose a tag to compare

What's Changed

  • Update vcpkg by @AtariDreams in #52
  • Merged PR 2924: Update hatlib dependency in setup.cfg, add comment.
    [Lisa Ong]

  • Merged PR 2922: [Github] Update vcpkg. [Lisa Ong]

    From c2177e6 Mon Sep 17 00:00:00 2001

  • Merged PR 2910: Updates hatlib dependency to 0.0.29. [Kern Handa]

  • Merged PR 2905: Fix internal param name in GPU benchmarks. [Captain
    Jack Sparrow]

    Fix internal param name in GPU benchmarks

  • Merged PR 2902: Increase ROCm baseline benchmark timeout to 10 hours.
    [Captain Jack Sparrow]

    • Increase ROCm baseline benchmark to 10 hours
    • Add category to the gemm input for classification
  • Merged PR 2901: Increase ROCm baseline timeout to 7 hours. [Captain
    Jack Sparrow]

    Increase ROCm baseline timeout to 7 hours

  • Merged PR 2900: Prune gemm benchmark input for big sizes by removing
    NT and TT configs. [Captain Jack Sparrow]

    • Prune gemm benchmark input for big sizes by removing NT and TT configs
    • Disable verification for resnet sizes
    • Fix baseline tagging for pytorch
  • Merged PR 2896: Dynamic shared memory allocation support. [Captain
    Jack Sparrow]

    • Add optional param in plan.cache for memory offset
    • Add optional param in schedule.create_plan for total dynamic memory size in bytes
    • Update benchmarks to allow dynamic shared memory usage

    Related work items: #3735

  • Merged PR 2898: Add pytorch gemm implementation for GPU benchmark
    baselines. [Ritwik Das]

    Add pytorch gemm implementation for GPU benchmark baselines

  • Merged PR 2897: Generalize partial dynamic size support. [Mason Remy]

    Generalize partial dynamic size support

    Plumbs through mappings from arrays to which args provide the dimension
    sizes for those arrays more generically.

    This also generalizes dynamic size support beyond matmul scenarios.

    Note: due to assumptions in the debug mode plumbing, the size arguments
    still must occur first in the argument list, and a later PR should
    generalize that

  • Merged PR 2894: Add one test case for partially dynamic sized array.
    [Denny Sun]

  • Merged PR 2891: [nfc][release] Rev docs to 1.2.11. [Lisa Ong]

  • Merged PR 2882: Add tests for thread coarsening and update GPU
    benchmarks. [Ritwik Das]

    • Add tests for thread coarsening and update GPU benchmarks

    Related work items: #3684

  • Merged PR 2890: Add folding scenario for cast ops where the only
    downcasts are. [Mason Remy]

    Add folding scenario for cast ops where the only downcasts are
    internally-generated

    This is useful for converting uint8uint8->uint8 to
    int16
    int16->int32 using cache element types as is needed in the
    vpmaddwd matmul scenario

  • Merged PR 2889: [refactoring] Prevent overloading of keyword "Tensor"

    • disambiguate with "MMAFragment" [Ritwik Das]

    Prevent overloading of keyword "Tensor" - disambiguate with "MMAFragment"

New Contributors

  • @AtariDreams made their first contribution in #52

Full Changelog: v1.2.10...v1.2.11

v1.2.10

29 Sep 01:33
Compare
Choose a tag to compare

What's Changed

  • Merged PR 2886: [release] Bump docs to 1.2.10, sync GH to ADO. [Lisa
    Ong]

    • Bulk docs version update

    • Bump protobuf from 3.20.1 to 3.20.2 in /accera/onnx-emitter/test (d1b87ec)

    • Also fixing a minor docs bug (errant backtick)

  • Merged PR 2884: Add DSL test for runtime size correctness. [Denny Sun]

  • Merged PR 2878: Optimize warp id calculation by forcing scalar
    registers. [Ritwik Das]

    • ROCM: use __builtin_amdgcn_readfirstlane to force scalar reg usage
    • CUDA: don't use anything special since __shfl_sync seems to generate slower code
  • Merged PR 2885: Updates python dependencies. [Kern Handa]

    Updates hatlib version

  • Merged PR 2881: Fix the runtime crash caused by incorrectly generated
    LLVM IR. [Denny Sun]

    1. Call the specific version of LLVM type converter for dynamic memory
    2. Create MemRefDescriptor from dynamic memory shape by associating the arrays with correct size arguments

    With this change, the following DSL test can succeed and pass correctness check.

            M = Dimension()
            N = Dimension()
            K = Dimension()
    
            A = Array(shape=(M, K), element_type=ScalarType.float32,
                role=Array.Role.INPUT)
    
            B = Array(shape=(K, N), element_type=ScalarType.float32,
                role=Array.Role.INPUT)
    
            C = Array(shape=(M, N),
                        element_type=ScalarType.float32,
                        role=Array.Role.INPUT_OUTPUT)
    
            @nest.iteration_logic
            def _():
                C[i, j] += A[i, k] * B[k, j]
    
            M_test = np.int64(64)
            N_test = np.int64(128)
            K_test = np.int64(32)
            A_test = np.random.random((M_test, K_test)).astype(np.float32)
            B_test = np.random.random((K_test, N_test)).astype(np.float32)
            C_test = np.random.random((M_test, N_test)).astype(np.float32)
    
            correctness_check_values = {
                "pre": [M_test, N_test, K_test, A_test, B_test, C_test],
                "post": [M_test, N_test, K_test, A_test, B_test, C_test + A_test @ B_test],
            }
    
            function = package.add(nest, args=(M, N, K, A, B, C), base_name="runtimesizes")
    
            with verifiers.VerifyPackage(self, "test_runtimesizes", TEST_PACKAGE_DIR) as v:
                package.build("test_runtimesizes", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
                if correctness_check_values:
                    v.check_correctness(
                        function.name,
                        before=correctness_check_values["pre"],
                        after=correctness_check_values["post"],
                    )
    
  • Merged PR 2879: Fix exception in GPU baseline benchmark. [Ritwik Das]

    Fix exception in GPU baseline benchmark

  • Merged PR 2856: Enable output caching in ROCM for all MMA shapes.
    [Ritwik Das]

  • Merged PR 2876: Introduce warp bindings in CUDA. [Ritwik Das]

    • Bind indices to WARP_X/Y along with tensorization (exclusively from thread id mapping)
    • warp x dim is always a multiple of warp size in the x dimension. e.g. if for dividing a 64x64 block tile into 4 subtiles of 32x32 each where each subtile is computed by a single warp then the blockDim would be (64,2,1).
    • This is required since with tensorization we would want block dims to be generated in a specific way than without it. Calculating offsets within the matrix based on warps is non-trivial if not impossible with just thread bindings.

    Related work items: #3726

  • Merged PR 2874: Add unrolled convolution case study link (#50) [Lisa
    Ong]

    Add unrolled convolution case study link (#50)

    • Update README.md

    Add unrolled convolution case study reference link

    • Update the reference link

    Update the reference according to latest updates in the case study

  • Merged PR 2873: Convert function signature from dynamic memref type to
    llvm type. [Denny Sun]

    With this change, Accera is able to write the correct function signature of dynamic memref type to HAT file

  • Merged PR 2871: Update hatlib version. [Denny Sun]

    from 0.0.23 to 0.0.25

  • Merged PR 2870: Filter benchmark kernels based on scheduling policy.
    [Ritwik Das]

    Filter benchmark kernels based on scheduling policy

  • Merged PR 2867: [build][github] Update test path in github actions.
    [Lisa Ong]

    Fixes https://github.com/microsoft/Accera/actions/runs/3071905923

Full Changelog: v1.2.9...v1.2.10