Squashed commit of the following:

commit a272d35955fe3a05d2c52f54481af40869a74849 Author: Mason Remy <[email protected]> Date: Wed Dec 14 06:51:40 2022 +0000 Merged PR 2987: Add support for max/min/round ops and vectorizing those ops Add support for max/min/round ops and vectorizing those ops commit 375be08681b88df01e2e3043d5094684c134d862 Author: Mason Remy <[email protected]> Date: Tue Dec 13 23:30:28 2022 +0000 Merged PR 2963: Control TEMP array allocation location Control TEMP array allocation location commit 929eeafe8263f866bacc77b958953268f58d8b8e Author: Mason Remy <[email protected]> Date: Tue Dec 13 21:56:38 2022 +0000 Merged PR 2962: Expand vpmaddwd matching and add intrinsic call Expand vpmaddwd matching and add intrinsic call Matches more vpmaddwd cases and creates a pathway to invoking the LLVM intrinsic directly. commit e47a02ed4929e8ba9a085c7870cc5e4fe9f0db62 Author: Mason Remy <[email protected]> Date: Sat Dec 10 00:40:42 2022 +0000 Merged PR 2961: Match more vectorization patterns and support vectorized cast Match more vectorization patterns and support vectorized cast Tries to match and rewrite vectorization patterns: - 2-loop interleaving store -> vector shuffle and store - simple horizontal reductions (not always efficient currently) - vectorized casts Makes vectorization of non-innermost loops do a per-op "inplace" unroll and vectorize the innermost loop TODO : update documentation to describe this behavior better commit 628983a1a3c5f9ea42dac0cdb7db3cebcb427f43 Author: Mason Remy <[email protected]> Date: Fri Dec 9 05:54:01 2022 +0000 Merged PR 2960: Enable marking functions as no-inline-into Enable marking functions as no-inline-into Functions marked no-inline-into won't inline calls to other functions within their body. This is a useful compiler performance (not emitted code performance) optimization when we have many nested functions calls commit d4404ea31cccff456a28ef6998403d228e427507 Author: Denny Sun <[email protected]> Date: Fri Dec 9 00:40:16 2022 +0000 Merged PR 2986: [output array] Emit range function with input_output type arguments Instead of using output type, we use input_output instead to generate two functions for the Range function. Now Accera can successfully generate code for range function. ``` ``` commit 7d867a33afc36a1a2fa68b49f507b6ad202c14ce Author: Mason Remy <[email protected]> Date: Thu Dec 8 22:12:14 2022 +0000 Merged PR 2959: Improved affine for op range simplification Improved affine for op range simplification Add range value / constant-cmp-result patterns and affine for op range simplifications to the affine simplification pass and run it after inlining functions. When inlining a dynamically-sized function into a statically-sized function, this change is useful for resolving the dynamic ranges to constants and pruning dynamic-range loops that are not needed given the specific constant value being used. commit 511112c61b513c5d8d7ed4dba06ee266d5affbca Author: Mason Remy <[email protected]> Date: Thu Dec 8 17:14:00 2022 +0000 Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest or overfused Hack to erase loops in a nest to support nest-of-nest or overfused scenarios This change enables an action plan to erase loops. Typically this would be used when an outer nest traverses tiles and invokes an inner nest (or multiple nests) which operate within each tile. The outer nest still needs to cover the full iteration space, however after splitting by the tile sizes a user will not want the outer nest to perform the inner loops commit 5dd35c423e3878a8f490de07ca21d3ac261c6224 Author: Lisa Ong <[email protected]> Date: Wed Dec 7 01:59:14 2022 +0000 Merged PR 2985: [release] Rev docs to 1.2.13 commit b5697107f084bf910d4d77e75e67a90363855375 Author: Captain Jack Sparrow <[email protected]> Date: Wed Dec 7 00:57:08 2022 +0000 Merged PR 2983: Increase timeouts of GPU benchmarks Increase timeouts of GPU benchmarks commit 05c096f116216fbc9505c7d9a6f1e88b7626411f Author: Mason Remy <[email protected]> Date: Sat Dec 3 01:25:01 2022 +0000 Merged PR 2982: Work around bug with redundant splits of dynamic dimensions Work around bug with redundant splits of dynamic dimensions commit 4056d3177c5b14987e4c5fcd4aa91ddac67c4ed1 Author: Kern Handa <[email protected]> Date: Wed Nov 30 07:55:06 2022 +0000 Merged PR 2972: Build both static and dynamic binaries by default, put both in aux dependencies commit b79602b9cf543b0852c7e0c85e548970d5ac7fbb Author: Kern Handa <[email protected]> Date: Tue Nov 29 22:34:04 2022 +0000 Merged PR 2975: Updates llc/opt build flags to enable more optimizations by default Updates llc/opt build flags to enable more optimizations by default commit 8a856b8af10227538ebb72486bd0bfd52af98873 Author: Kern Handa <[email protected]> Date: Tue Nov 29 21:49:40 2022 +0000 Merged PR 2977: Updates CMake to do FindPython before pybind11 config Updates CMake to do FindPython before pybind11 config commit 6d05fc0e8a6d1933d7507cfa8b6838c04606a798 Author: Lisa Ong <[email protected]> Date: Tue Nov 22 22:34:50 2022 +0000 Merged PR 2955: Reduce Linux PR runtime to under 60mins Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort.
microsoft · Dec 14, 2022 · 6c09b4a · 6c09b4a
1 parent 711af89
commit 6c09b4a
Show file tree

Hide file tree

Showing 161 changed files with 4,582 additions and 756 deletions.
diff --git a/.azure/cuda/cuda-benchmark-fp16-bert.yml b/.azure/cuda/cuda-benchmark-fp16-bert.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "CUDA_Benchmarking_FP16_BERT"
-    timeoutInMinutes: 480
+    timeoutInMinutes: 600
 
     pool:
       name: LinuxNVGPUPool

diff --git a/.azure/linux-pr.yml b/.azure/linux-pr.yml
@@ -89,7 +89,7 @@ steps:
     displayName: Run all ctest targets
     workingDirectory: "$(Build.SourcesDirectory)/build"
 
-  - bash: python -m unittest discover accera/test *.py
+  - bash: python -m unittest discover accera/test dsl_tests.py
     displayName: Run tests in DEV_MODE
     workingDirectory: "$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.9"
 

diff --git a/.azure/rocm/rocm-benchmark-fp16-bert.yml b/.azure/rocm/rocm-benchmark-fp16-bert.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP16_BERT"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 

diff --git a/.azure/rocm/rocm-benchmark-fp16-big.yml b/.azure/rocm/rocm-benchmark-fp16-big.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP16_Big"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 

diff --git a/.azure/rocm/rocm-benchmark-fp16.yml b/.azure/rocm/rocm-benchmark-fp16.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP16"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 

diff --git a/.azure/rocm/rocm-benchmark-fp32-bert.yml b/.azure/rocm/rocm-benchmark-fp32-bert.yml
@@ -47,7 +47,7 @@ jobs:
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
           export LD_LIBRARY_PATH=${ROCM_PATH}/lib
           echo "LD_LIBRARY_PATH" ${LD_LIBRARY_PATH}
-          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check
+          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose
         displayName: Run fp32 benchmarks BERT
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/rocm/rocm-benchmark-fp32-big.yml b/.azure/rocm/rocm-benchmark-fp32-big.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP32_Big"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 

diff --git a/.azure/rocm/rocm-benchmark-fp32.yml b/.azure/rocm/rocm-benchmark-fp32.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP32"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 

diff --git a/CMake/AddPyBind11.cmake b/CMake/AddPyBind11.cmake
@@ -5,7 +5,7 @@
 
 include(FetchContent)
 
-set(PYBIND_VERSION "2.6.2" CACHE STRING "Version string to use for pybind11")
+set(PYBIND_VERSION "2.10.1" CACHE STRING "Version string to use for pybind11")
 
 set(FETCHCONTENT_QUIET FALSE)
 
@@ -16,6 +16,9 @@ FetchContent_Declare(
 
 FetchContent_GetProperties(pybind11)
 
+set(Python3_FIND_REGISTRY LAST)
+find_package(Python3 COMPONENTS Interpreter Development)
+
 if(NOT pybind11_POPULATED)
     FetchContent_Populate(pybind11)
     add_subdirectory(${pybind11_SOURCE_DIR} ${pybind11_BINARY_DIR})

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -123,7 +123,7 @@ set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
 set(CMAKE_PLATFORM_NO_VERSIONED_SONAME ON)
 if(MSVC)
   # Set Visual Studio-specific options
-  add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS)
+  add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -D_SILENCE_NONFLOATING_COMPLEX_DEPRECATION_WARNING)
   add_compile_options(/utf-8)
   add_compile_options(/MP)
   add_compile_options(/bigobj)

diff --git a/accera/CMakeLists.txt b/accera/CMakeLists.txt
@@ -4,6 +4,7 @@
 ####################################################################################################
 
 set(ACCERA_LIBRARIES_DIR ${CMAKE_CURRENT_LIST_DIR})
+set(ACCERA_BIN_DIR ${CMAKE_CURRENT_BINARY_DIR})
 include_directories(${ACCERA_LIBRARIES_DIR})
 
 add_subdirectory(acc-opt)

diff --git a/accera/acc-opt/test/commandline.mlir b/accera/acc-opt/test/commandline.mlir
@@ -1,6 +1,7 @@
 // RUN: acc-opt --show-dialects | FileCheck %s
 // CHECK: Registered Dialects:
 // CHECK: accera
+// CHECK-NEXT: accintr
 // CHECK-NEXT: accln
 // CHECK-NEXT: accv
 // CHECK-NEXT: accxp

diff --git a/accera/acc-opt/test/thrifty_caching.mlir b/accera/acc-opt/test/thrifty_caching.mlir
@@ -69,8 +69,8 @@ module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "
 // CHECK:             affine.for %arg6 = 0 to 16 {
 // CHECK:               %1 = affine.load %arg1[%arg5, %arg4 + %arg6] : memref<32x32xf32, #map0>
 // CHECK:               affine.store %1, %0[%arg5, %arg6] : memref<32x16xf32, 3>
-// CHECK:             } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
-// CHECK:           } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
+// CHECK:             } {accxp.access_bounds_check, beginMap = #map1, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
+// CHECK:           } {accxp.access_bounds_check, beginMap = #map1, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
 // CHECK:           affine.for %arg5 = 0 to 4 {
 // CHECK:             affine.for %arg6 = 0 to 16 {
 // CHECK:               affine.for %arg7 = 0 to 32 {