v1.2.13
-
Merged PR 2987: Add support for max/min/round ops and vectorizing
those ops. [Mason Remy]Add support for max/min/round ops and vectorizing those ops
-
Merged PR 2963: Control TEMP array allocation location. [Mason Remy]
Control TEMP array allocation location
-
Merged PR 2962: Expand vpmaddwd matching and add intrinsic call.
[Mason Remy]Expand vpmaddwd matching and add intrinsic call
Matches more vpmaddwd cases and creates a pathway to invoking the LLVM
intrinsic directly. -
Merged PR 2961: Match more vectorization patterns and support
vectorized cast. [Mason Remy]Match more vectorization patterns and support vectorized cast
Tries to match and rewrite vectorization patterns:
- 2-loop interleaving store -> vector shuffle and store
- simple horizontal reductions (not always efficient currently)
- vectorized casts
Makes vectorization of non-innermost loops do a per-op "inplace" unroll and
vectorize the innermost loop
TODO : update documentation to describe this behavior better -
Merged PR 2960: Enable marking functions as no-inline-into. [Mason
Remy]Enable marking functions as no-inline-into
Functions marked no-inline-into
won't inline calls to other functions within their body. This
is a useful compiler performance (not emitted code performance)
optimization when we have many nested functions calls -
Merged PR 2986: [output array] Emit range function with input_output
type arguments. [Denny Sun]Instead of using output type, we use input_output instead to generate two functions for the Range function.
Now Accera can successfully generate code for range function.# Generate functions like: # get_size(float start, float limit, float delta, int64_t* output_dim); # get_array(int64_t input_dim, float* output, float start, float delta);
-
Merged PR 2959: Improved affine for op range simplification. [Mason
Remy]Improved affine for op range simplification
Add range value / constant-cmp-result patterns and affine for op range
simplifications to the affine simplification pass and run it after
inlining functions.
When inlining a dynamically-sized function into a statically-sized
function, this change is useful for resolving the dynamic ranges to
constants and pruning dynamic-range loops that are not needed given the
specific constant value being used. -
Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest
or overfused. [Mason Remy]Hack to erase loops in a nest to support nest-of-nest or overfused
scenariosThis change enables an action plan to erase loops. Typically this would
be used when an outer nest traverses tiles and invokes an inner nest (or
multiple nests) which operate within each tile. The outer nest still
needs to cover the full iteration space, however after splitting by the
tile sizes a user will not want the outer nest to perform the inner
loops -
Merged PR 2985: [release] Rev docs to 1.2.13. [Lisa Ong]
-
Merged PR 2983: Increase timeouts of GPU benchmarks. [Captain Jack
Sparrow]Increase timeouts of GPU benchmarks
-
Merged PR 2982: Work around bug with redundant splits of dynamic
dimensions. [Mason Remy]Work around bug with redundant splits of dynamic dimensions
-
Merged PR 2972: Build both static and dynamic binaries by default, put
both in aux dependencies. [Kern Handa] -
Merged PR 2975: Updates llc/opt build flags to enable more
optimizations by default. [Kern Handa]Updates llc/opt build flags to enable more optimizations by default
-
Merged PR 2977: Updates CMake to do FindPython before pybind11 config.
[Kern Handa]Updates CMake to do FindPython before pybind11 config
-
Merged PR 2955: Reduce Linux PR runtime to under 60mins. [Lisa Ong]
Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort.
Full Changelog: v1.2.12...v1.2.13