Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit a272d35955fe3a05d2c52f54481af40869a74849
Author: Mason Remy <[email protected]>
Date:   Wed Dec 14 06:51:40 2022 +0000

    Merged PR 2987: Add support for max/min/round ops and vectorizing those ops

    Add support for max/min/round ops and vectorizing those ops

commit 375be08681b88df01e2e3043d5094684c134d862
Author: Mason Remy <[email protected]>
Date:   Tue Dec 13 23:30:28 2022 +0000

    Merged PR 2963: Control TEMP array allocation location

    Control TEMP array allocation location

commit 929eeafe8263f866bacc77b958953268f58d8b8e
Author: Mason Remy <[email protected]>
Date:   Tue Dec 13 21:56:38 2022 +0000

    Merged PR 2962: Expand vpmaddwd matching and add intrinsic call

    Expand vpmaddwd matching and add intrinsic call

    Matches more vpmaddwd cases and creates a pathway to invoking the LLVM
    intrinsic directly.

commit e47a02ed4929e8ba9a085c7870cc5e4fe9f0db62
Author: Mason Remy <[email protected]>
Date:   Sat Dec 10 00:40:42 2022 +0000

    Merged PR 2961: Match more vectorization patterns and support vectorized cast

    Match more vectorization patterns and support vectorized cast

    Tries to match and rewrite vectorization patterns:
    - 2-loop interleaving store -> vector shuffle and store
    - simple horizontal reductions (not always efficient currently)
    - vectorized casts

    Makes vectorization of non-innermost loops do a per-op "inplace" unroll and
    vectorize the innermost loop
    TODO : update documentation to describe this behavior better

commit 628983a1a3c5f9ea42dac0cdb7db3cebcb427f43
Author: Mason Remy <[email protected]>
Date:   Fri Dec 9 05:54:01 2022 +0000

    Merged PR 2960: Enable marking functions as no-inline-into

    Enable marking functions as no-inline-into

    Functions marked no-inline-into
    won't inline calls to other functions within their body. This
    is a useful compiler performance (not emitted code performance)
    optimization when we have many nested functions calls

commit d4404ea31cccff456a28ef6998403d228e427507
Author: Denny Sun <[email protected]>
Date:   Fri Dec 9 00:40:16 2022 +0000

    Merged PR 2986: [output array] Emit range function with input_output type arguments

    Instead of using output type, we use input_output instead to generate two functions for the Range function.
    Now Accera can successfully generate code for range function.

    ```
    ```

commit 7d867a33afc36a1a2fa68b49f507b6ad202c14ce
Author: Mason Remy <[email protected]>
Date:   Thu Dec 8 22:12:14 2022 +0000

    Merged PR 2959: Improved affine for op range simplification

    Improved affine for op range simplification

    Add range value / constant-cmp-result patterns and affine for op range
    simplifications to the affine simplification pass and run it after
    inlining functions.
    When inlining a dynamically-sized function into a statically-sized
    function, this change is useful for resolving the dynamic ranges to
    constants and pruning dynamic-range loops that are not needed given the
    specific constant value being used.

commit 511112c61b513c5d8d7ed4dba06ee266d5affbca
Author: Mason Remy <[email protected]>
Date:   Thu Dec 8 17:14:00 2022 +0000

    Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest or overfused

    Hack to erase loops in a nest to support nest-of-nest or overfused
    scenarios

    This change enables an action plan to erase loops. Typically this would
    be used when an outer nest traverses tiles and invokes an inner nest (or
    multiple nests) which operate within each tile. The outer nest still
    needs to cover the full iteration space, however after splitting by the
    tile sizes a user will not want the outer nest to perform the inner
    loops

commit 5dd35c423e3878a8f490de07ca21d3ac261c6224
Author: Lisa Ong <[email protected]>
Date:   Wed Dec 7 01:59:14 2022 +0000

    Merged PR 2985: [release] Rev docs to 1.2.13

commit b5697107f084bf910d4d77e75e67a90363855375
Author: Captain Jack Sparrow <[email protected]>
Date:   Wed Dec 7 00:57:08 2022 +0000

    Merged PR 2983: Increase timeouts of GPU benchmarks

    Increase timeouts of GPU benchmarks

commit 05c096f116216fbc9505c7d9a6f1e88b7626411f
Author: Mason Remy <[email protected]>
Date:   Sat Dec 3 01:25:01 2022 +0000

    Merged PR 2982: Work around bug with redundant splits of dynamic dimensions

    Work around bug with redundant splits of dynamic dimensions

commit 4056d3177c5b14987e4c5fcd4aa91ddac67c4ed1
Author: Kern Handa <[email protected]>
Date:   Wed Nov 30 07:55:06 2022 +0000

    Merged PR 2972: Build both static and dynamic binaries by default, put both in aux dependencies

commit b79602b9cf543b0852c7e0c85e548970d5ac7fbb
Author: Kern Handa <[email protected]>
Date:   Tue Nov 29 22:34:04 2022 +0000

    Merged PR 2975: Updates llc/opt build flags to enable more optimizations by default

    Updates llc/opt build flags to enable more optimizations by default

commit 8a856b8af10227538ebb72486bd0bfd52af98873
Author: Kern Handa <[email protected]>
Date:   Tue Nov 29 21:49:40 2022 +0000

    Merged PR 2977: Updates CMake to do FindPython before pybind11 config

    Updates CMake to do FindPython before pybind11 config

commit 6d05fc0e8a6d1933d7507cfa8b6838c04606a798
Author: Lisa Ong <[email protected]>
Date:   Tue Nov 22 22:34:50 2022 +0000

    Merged PR 2955: Reduce Linux PR runtime to under 60mins

    Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort.
  • Loading branch information
Lisa Ong committed Dec 14, 2022
1 parent 711af89 commit 6c09b4a
Show file tree
Hide file tree
Showing 161 changed files with 4,582 additions and 756 deletions.
2 changes: 1 addition & 1 deletion .azure/cuda/cuda-benchmark-fp16-bert.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trigger: none

jobs:
- job: "CUDA_Benchmarking_FP16_BERT"
timeoutInMinutes: 480
timeoutInMinutes: 600

pool:
name: LinuxNVGPUPool
Expand Down
2 changes: 1 addition & 1 deletion .azure/linux-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ steps:
displayName: Run all ctest targets
workingDirectory: "$(Build.SourcesDirectory)/build"
- bash: python -m unittest discover accera/test *.py
- bash: python -m unittest discover accera/test dsl_tests.py
displayName: Run tests in DEV_MODE
workingDirectory: "$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.9"

Expand Down
2 changes: 1 addition & 1 deletion .azure/rocm/rocm-benchmark-fp16-bert.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trigger: none

jobs:
- job: "ROCM_Benchmarking_FP16_BERT"
timeoutInMinutes: 540
timeoutInMinutes: 600

pool: LinuxAMDGPUPool

Expand Down
2 changes: 1 addition & 1 deletion .azure/rocm/rocm-benchmark-fp16-big.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trigger: none

jobs:
- job: "ROCM_Benchmarking_FP16_Big"
timeoutInMinutes: 540
timeoutInMinutes: 600

pool: LinuxAMDGPUPool

Expand Down
2 changes: 1 addition & 1 deletion .azure/rocm/rocm-benchmark-fp16.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trigger: none

jobs:
- job: "ROCM_Benchmarking_FP16"
timeoutInMinutes: 540
timeoutInMinutes: 600

pool: LinuxAMDGPUPool

Expand Down
2 changes: 1 addition & 1 deletion .azure/rocm/rocm-benchmark-fp32-bert.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
export LD_LIBRARY_PATH=${ROCM_PATH}/lib
echo "LD_LIBRARY_PATH" ${LD_LIBRARY_PATH}
python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check
python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose
displayName: Run fp32 benchmarks BERT
workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
env:
Expand Down
2 changes: 1 addition & 1 deletion .azure/rocm/rocm-benchmark-fp32-big.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trigger: none

jobs:
- job: "ROCM_Benchmarking_FP32_Big"
timeoutInMinutes: 540
timeoutInMinutes: 600

pool: LinuxAMDGPUPool

Expand Down
2 changes: 1 addition & 1 deletion .azure/rocm/rocm-benchmark-fp32.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trigger: none

jobs:
- job: "ROCM_Benchmarking_FP32"
timeoutInMinutes: 540
timeoutInMinutes: 600

pool: LinuxAMDGPUPool

Expand Down
5 changes: 4 additions & 1 deletion CMake/AddPyBind11.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

include(FetchContent)

set(PYBIND_VERSION "2.6.2" CACHE STRING "Version string to use for pybind11")
set(PYBIND_VERSION "2.10.1" CACHE STRING "Version string to use for pybind11")

set(FETCHCONTENT_QUIET FALSE)

Expand All @@ -16,6 +16,9 @@ FetchContent_Declare(

FetchContent_GetProperties(pybind11)

set(Python3_FIND_REGISTRY LAST)
find_package(Python3 COMPONENTS Interpreter Development)

if(NOT pybind11_POPULATED)
FetchContent_Populate(pybind11)
add_subdirectory(${pybind11_SOURCE_DIR} ${pybind11_BINARY_DIR})
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
set(CMAKE_PLATFORM_NO_VERSIONED_SONAME ON)
if(MSVC)
# Set Visual Studio-specific options
add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS)
add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -D_SILENCE_NONFLOATING_COMPLEX_DEPRECATION_WARNING)
add_compile_options(/utf-8)
add_compile_options(/MP)
add_compile_options(/bigobj)
Expand Down
1 change: 1 addition & 0 deletions accera/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
####################################################################################################

set(ACCERA_LIBRARIES_DIR ${CMAKE_CURRENT_LIST_DIR})
set(ACCERA_BIN_DIR ${CMAKE_CURRENT_BINARY_DIR})
include_directories(${ACCERA_LIBRARIES_DIR})

add_subdirectory(acc-opt)
Expand Down
1 change: 1 addition & 0 deletions accera/acc-opt/test/commandline.mlir
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
// RUN: acc-opt --show-dialects | FileCheck %s
// CHECK: Registered Dialects:
// CHECK: accera
// CHECK-NEXT: accintr
// CHECK-NEXT: accln
// CHECK-NEXT: accv
// CHECK-NEXT: accxp
Expand Down
4 changes: 2 additions & 2 deletions accera/acc-opt/test/thrifty_caching.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,8 @@ module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "
// CHECK: affine.for %arg6 = 0 to 16 {
// CHECK: %1 = affine.load %arg1[%arg5, %arg4 + %arg6] : memref<32x32xf32, #map0>
// CHECK: affine.store %1, %0[%arg5, %arg6] : memref<32x16xf32, 3>
// CHECK: } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
// CHECK: } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
// CHECK: } {accxp.access_bounds_check, beginMap = #map1, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
// CHECK: } {accxp.access_bounds_check, beginMap = #map1, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
// CHECK: affine.for %arg5 = 0 to 4 {
// CHECK: affine.for %arg6 = 0 to 16 {
// CHECK: affine.for %arg7 = 0 to 32 {
Expand Down
Loading

0 comments on commit 6c09b4a

Please sign in to comment.