From 6c09b4a5b7b32df658aea5edf66766eb4f9da828 Mon Sep 17 00:00:00 2001 From: Lisa Ong Date: Wed, 14 Dec 2022 17:36:48 +0800 Subject: [PATCH] Squashed commit of the following: commit a272d35955fe3a05d2c52f54481af40869a74849 Author: Mason Remy Date: Wed Dec 14 06:51:40 2022 +0000 Merged PR 2987: Add support for max/min/round ops and vectorizing those ops Add support for max/min/round ops and vectorizing those ops commit 375be08681b88df01e2e3043d5094684c134d862 Author: Mason Remy Date: Tue Dec 13 23:30:28 2022 +0000 Merged PR 2963: Control TEMP array allocation location Control TEMP array allocation location commit 929eeafe8263f866bacc77b958953268f58d8b8e Author: Mason Remy Date: Tue Dec 13 21:56:38 2022 +0000 Merged PR 2962: Expand vpmaddwd matching and add intrinsic call Expand vpmaddwd matching and add intrinsic call Matches more vpmaddwd cases and creates a pathway to invoking the LLVM intrinsic directly. commit e47a02ed4929e8ba9a085c7870cc5e4fe9f0db62 Author: Mason Remy Date: Sat Dec 10 00:40:42 2022 +0000 Merged PR 2961: Match more vectorization patterns and support vectorized cast Match more vectorization patterns and support vectorized cast Tries to match and rewrite vectorization patterns: - 2-loop interleaving store -> vector shuffle and store - simple horizontal reductions (not always efficient currently) - vectorized casts Makes vectorization of non-innermost loops do a per-op "inplace" unroll and vectorize the innermost loop TODO : update documentation to describe this behavior better commit 628983a1a3c5f9ea42dac0cdb7db3cebcb427f43 Author: Mason Remy Date: Fri Dec 9 05:54:01 2022 +0000 Merged PR 2960: Enable marking functions as no-inline-into Enable marking functions as no-inline-into Functions marked no-inline-into won't inline calls to other functions within their body. This is a useful compiler performance (not emitted code performance) optimization when we have many nested functions calls commit d4404ea31cccff456a28ef6998403d228e427507 Author: Denny Sun Date: Fri Dec 9 00:40:16 2022 +0000 Merged PR 2986: [output array] Emit range function with input_output type arguments Instead of using output type, we use input_output instead to generate two functions for the Range function. Now Accera can successfully generate code for range function. ``` ``` commit 7d867a33afc36a1a2fa68b49f507b6ad202c14ce Author: Mason Remy Date: Thu Dec 8 22:12:14 2022 +0000 Merged PR 2959: Improved affine for op range simplification Improved affine for op range simplification Add range value / constant-cmp-result patterns and affine for op range simplifications to the affine simplification pass and run it after inlining functions. When inlining a dynamically-sized function into a statically-sized function, this change is useful for resolving the dynamic ranges to constants and pruning dynamic-range loops that are not needed given the specific constant value being used. commit 511112c61b513c5d8d7ed4dba06ee266d5affbca Author: Mason Remy Date: Thu Dec 8 17:14:00 2022 +0000 Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest or overfused Hack to erase loops in a nest to support nest-of-nest or overfused scenarios This change enables an action plan to erase loops. Typically this would be used when an outer nest traverses tiles and invokes an inner nest (or multiple nests) which operate within each tile. The outer nest still needs to cover the full iteration space, however after splitting by the tile sizes a user will not want the outer nest to perform the inner loops commit 5dd35c423e3878a8f490de07ca21d3ac261c6224 Author: Lisa Ong Date: Wed Dec 7 01:59:14 2022 +0000 Merged PR 2985: [release] Rev docs to 1.2.13 commit b5697107f084bf910d4d77e75e67a90363855375 Author: Captain Jack Sparrow Date: Wed Dec 7 00:57:08 2022 +0000 Merged PR 2983: Increase timeouts of GPU benchmarks Increase timeouts of GPU benchmarks commit 05c096f116216fbc9505c7d9a6f1e88b7626411f Author: Mason Remy Date: Sat Dec 3 01:25:01 2022 +0000 Merged PR 2982: Work around bug with redundant splits of dynamic dimensions Work around bug with redundant splits of dynamic dimensions commit 4056d3177c5b14987e4c5fcd4aa91ddac67c4ed1 Author: Kern Handa Date: Wed Nov 30 07:55:06 2022 +0000 Merged PR 2972: Build both static and dynamic binaries by default, put both in aux dependencies commit b79602b9cf543b0852c7e0c85e548970d5ac7fbb Author: Kern Handa Date: Tue Nov 29 22:34:04 2022 +0000 Merged PR 2975: Updates llc/opt build flags to enable more optimizations by default Updates llc/opt build flags to enable more optimizations by default commit 8a856b8af10227538ebb72486bd0bfd52af98873 Author: Kern Handa Date: Tue Nov 29 21:49:40 2022 +0000 Merged PR 2977: Updates CMake to do FindPython before pybind11 config Updates CMake to do FindPython before pybind11 config commit 6d05fc0e8a6d1933d7507cfa8b6838c04606a798 Author: Lisa Ong Date: Tue Nov 22 22:34:50 2022 +0000 Merged PR 2955: Reduce Linux PR runtime to under 60mins Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort. --- .azure/cuda/cuda-benchmark-fp16-bert.yml | 2 +- .azure/linux-pr.yml | 2 +- .azure/rocm/rocm-benchmark-fp16-bert.yml | 2 +- .azure/rocm/rocm-benchmark-fp16-big.yml | 2 +- .azure/rocm/rocm-benchmark-fp16.yml | 2 +- .azure/rocm/rocm-benchmark-fp32-bert.yml | 2 +- .azure/rocm/rocm-benchmark-fp32-big.yml | 2 +- .azure/rocm/rocm-benchmark-fp32.yml | 2 +- CMake/AddPyBind11.cmake | 5 +- CMakeLists.txt | 2 +- accera/CMakeLists.txt | 1 + accera/acc-opt/test/commandline.mlir | 1 + accera/acc-opt/test/thrifty_caching.mlir | 4 +- accera/acc-opt/test/value_mlir_test.cpp | 62 +- accera/acc-translate/CMakeLists.txt | 16 + .../acc-translate/src/AcceraTranslateMain.cpp | 28 + accera/acc-translate/src/CMakeLists.txt | 7 + .../acc-translate/src/Target/CMakeLists.txt | 6 + .../Target/Cpp/AcceraDialectCppPrinter.cpp | 2 +- .../Target/Cpp/AffineDialectCppPrinter.cpp | 7 +- .../src/Target/LLVMIR/CMakeLists.txt | 24 + .../LLVMIR/IntrinsicToLLVMIRTranslation.cpp | 50 + .../LLVMIR/IntrinsicToLLVMIRTranslation.h | 27 + accera/accc/accc.py | 16 +- accera/ir/CMakeLists.txt | 34 + accera/ir/include/CMakeLists.txt | 1 + accera/ir/include/Common.td | 6 + accera/ir/include/IRUtil.h | 11 +- .../ir/include/intrinsics/AcceraIntrinsics.td | 69 + .../intrinsics/AcceraIntrinsicsDialect.h | 18 + accera/ir/include/intrinsics/CMakeLists.txt | 10 + accera/ir/include/value/ValueAttrs.td | 4 +- accera/ir/include/value/ValueDialect.h | 4 + accera/ir/include/value/ValueOps.td | 66 +- accera/ir/src/DialectRegistry.cpp | 2 + accera/ir/src/IRUtil.cpp | 126 +- .../intrinsics/AcceraIntrinsicsDialect.cpp | 32 + .../ir/src/nest/LoopNestAffineConstraints.cpp | 46 +- accera/ir/src/nest/LoopNestBuilder.cpp | 2 +- accera/python/accera/Debug.py | 2 - accera/python/accera/Package.py | 58 +- accera/python/accera/Targets.py | 1 + accera/python/accera/__init__.py | 4 +- accera/python/accera/lang/Array.py | 12 +- accera/python/accera/lang/Dimension.py | 17 + accera/python/accera/lang/Function.py | 24 +- accera/python/accera/lang/Nest.py | 4 +- accera/python/accera/lang/Plan.py | 16 +- accera/python/accera/lang/__init__.py | 2 +- accera/python/accera/test/dsl_tests.py | 886 ++++++++- accera/python/accera/test/smoke_tests.py | 583 +++++- accera/python/lib/src/ContainerTypes.cpp | 7 +- accera/python/lib/src/ExecutionPlanTypes.cpp | 3 +- accera/python/lib/src/PackagingTypes.cpp | 15 +- accera/transforms/include/AcceraPasses.h | 1 + accera/transforms/include/AcceraPasses.td | 1 + .../include/affine/AffineSimplifications.h | 3 +- .../exec/ExecutionPlanToAffineLoweringPass.h | 1 + .../include/util/RangeValueUtilities.h | 2 + .../include/util/VectorizationUtil.h | 6 +- .../include/value/RangeValueOptimizePass.h | 3 + accera/transforms/src/AcceraPasses.cpp | 1 + .../src/affine/AffineSimplifications.cpp | 147 +- .../ExecutionPlanToAffineLoweringPass.cpp | 34 +- .../transforms/src/nest/LoopNestToValue.cpp | 14 +- .../src/nest/LoopNestToValueFunc.cpp | 17 +- .../src/util/RangeValueUtilities.cpp | 148 +- .../transforms/src/util/VectorizationUtil.cpp | 1624 ++++++++++++++--- .../src/value/RangeValueOptimizePass.cpp | 299 ++- .../src/value/ValueFuncToTargetPass.cpp | 15 +- .../src/value/ValueSimplifyPass.cpp | 2 +- .../src/value/ValueToLLVMLoweringPass.cpp | 112 +- .../src/value/ValueToStandardLoweringPass.cpp | 147 +- accera/value/include/EmitterContext.h | 6 + accera/value/include/FunctionDeclaration.h | 8 + accera/value/include/MLIREmitterContext.h | 2 + accera/value/include/Plan.h | 2 + accera/value/include/ScalarOperations.h | 3 +- accera/value/include/ValueType.h | 8 +- accera/value/src/EmitterContext.cpp | 5 + accera/value/src/FunctionDeclaration.cpp | 14 + accera/value/src/MLIREmitterContext.cpp | 79 +- accera/value/src/Plan.cpp | 13 + accera/value/src/ScalarOperations.cpp | 28 +- docs/.bumpversion.cfg | 2 +- docs/Case Studies/CONTRIBUTING.md | 2 +- docs/Case Studies/README.md | 2 +- docs/Install/Building_on_MacOS.md | 2 +- docs/Install/Building_on_Ubuntu.md | 2 +- docs/Install/Building_on_Windows.md | 2 +- docs/Install/Installing_Accera_on_MacOS.md | 2 +- docs/Install/Installing_Accera_on_Ubuntu.md | 2 +- docs/Install/Installing_Accera_on_Windows.md | 2 +- docs/Install/README.md | 2 +- docs/Manual/00 Introduction.md | 2 +- docs/Manual/01 Arrays and Scalars.md | 2 +- docs/Manual/02 Simple Affine Loop Nests.md | 2 +- docs/Manual/03 Schedules.md | 2 +- docs/Manual/04 Fusing.md | 2 +- docs/Manual/05 Targets.md | 2 +- docs/Manual/06 Plans - Caching.md | 2 +- ...07 Plans - Operations and Optimizations.md | 2 +- .../08 Deferred Layout of Constant Arrays.md | 2 +- docs/Manual/09 Parameters.md | 2 +- docs/Manual/10 Packages.md | 2 +- docs/Manual/README.md | 2 +- docs/Reference/accera.md | 4 +- docs/Reference/classes/Array/Array.md | 4 +- docs/Reference/classes/Array/Layout.md | 4 +- docs/Reference/classes/Array/Role.md | 4 +- .../classes/Array/deferred_layout.md | 4 +- docs/Reference/classes/Array/sub_array.md | 4 +- docs/Reference/classes/Dimension/Dimension.md | 4 +- docs/Reference/classes/Dimension/Role.md | 4 +- docs/Reference/classes/Nest/Nest.md | 4 +- docs/Reference/classes/Nest/create_plan.md | 4 +- .../Reference/classes/Nest/create_schedule.md | 4 +- docs/Reference/classes/Nest/get_indices.md | 4 +- .../Reference/classes/Nest/iteration_logic.md | 4 +- docs/Reference/classes/Package/Format.md | 4 +- docs/Reference/classes/Package/Mode.md | 4 +- docs/Reference/classes/Package/Package.md | 4 +- docs/Reference/classes/Package/Platform.md | 4 +- docs/Reference/classes/Package/add.md | 4 +- .../classes/Package/add_description.md | 4 +- docs/Reference/classes/Package/build.md | 4 +- docs/Reference/classes/Plan/bind.md | 4 +- docs/Reference/classes/Plan/cache.md | 4 +- docs/Reference/classes/Plan/kernelize.md | 4 +- docs/Reference/classes/Plan/parallelize.md | 4 +- docs/Reference/classes/Plan/tensorize.md | 4 +- docs/Reference/classes/Plan/unroll.md | 4 +- docs/Reference/classes/Plan/vectorize.md | 4 +- docs/Reference/classes/Scalar/Scalar.md | 4 +- .../Reference/classes/Schedule/create_plan.md | 4 +- .../classes/Schedule/is_valid_loop_order.md | 4 +- docs/Reference/classes/Schedule/pad.md | 4 +- docs/Reference/classes/Schedule/reorder.md | 4 +- docs/Reference/classes/Schedule/skew.md | 4 +- docs/Reference/classes/Schedule/split.md | 4 +- docs/Reference/classes/Schedule/tile.md | 4 +- docs/Reference/classes/Target/Architecture.md | 4 +- docs/Reference/classes/Target/Category.md | 4 +- docs/Reference/classes/Target/Model.md | 4 +- docs/Reference/classes/Target/Runtime.md | 4 +- docs/Reference/classes/Target/Target.md | 4 +- docs/Reference/enumerations/CacheStrategy.md | 4 +- .../enumerations/MMASchedulingPolicy.md | 4 +- docs/Reference/enumerations/MMAShape.md | 4 +- docs/Reference/enumerations/ScalarType.md | 4 +- docs/Reference/functions/cast.md | 4 +- docs/Reference/functions/create_dimensions.md | 4 +- .../functions/create_parameter_grid.md | 4 +- docs/Reference/functions/create_parameters.md | 4 +- docs/Reference/functions/fuse.md | 4 +- docs/Reference/safety_analysis.md | 4 +- docs/Tutorials/Hello_MatMul.md | 2 +- docs/Tutorials/Hello_MatMul_GPU.md | 2 +- docs/Tutorials/Optimized_MatMul.md | 2 +- docs/Tutorials/Pi3_Cross_Compilation.md | 2 +- docs/Tutorials/README.md | 2 +- 161 files changed, 4582 insertions(+), 756 deletions(-) create mode 100644 accera/acc-translate/src/CMakeLists.txt create mode 100644 accera/acc-translate/src/Target/CMakeLists.txt create mode 100644 accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt create mode 100644 accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp create mode 100644 accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h create mode 100644 accera/ir/include/intrinsics/AcceraIntrinsics.td create mode 100644 accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h create mode 100644 accera/ir/include/intrinsics/CMakeLists.txt create mode 100644 accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp diff --git a/.azure/cuda/cuda-benchmark-fp16-bert.yml b/.azure/cuda/cuda-benchmark-fp16-bert.yml index f7fe35fe..a6ff9236 100644 --- a/.azure/cuda/cuda-benchmark-fp16-bert.yml +++ b/.azure/cuda/cuda-benchmark-fp16-bert.yml @@ -9,7 +9,7 @@ trigger: none jobs: - job: "CUDA_Benchmarking_FP16_BERT" - timeoutInMinutes: 480 + timeoutInMinutes: 600 pool: name: LinuxNVGPUPool diff --git a/.azure/linux-pr.yml b/.azure/linux-pr.yml index eadbb565..8031b316 100644 --- a/.azure/linux-pr.yml +++ b/.azure/linux-pr.yml @@ -89,7 +89,7 @@ steps: displayName: Run all ctest targets workingDirectory: "$(Build.SourcesDirectory)/build" - - bash: python -m unittest discover accera/test *.py + - bash: python -m unittest discover accera/test dsl_tests.py displayName: Run tests in DEV_MODE workingDirectory: "$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.9" diff --git a/.azure/rocm/rocm-benchmark-fp16-bert.yml b/.azure/rocm/rocm-benchmark-fp16-bert.yml index 69ce40dd..f091b042 100644 --- a/.azure/rocm/rocm-benchmark-fp16-bert.yml +++ b/.azure/rocm/rocm-benchmark-fp16-bert.yml @@ -9,7 +9,7 @@ trigger: none jobs: - job: "ROCM_Benchmarking_FP16_BERT" - timeoutInMinutes: 540 + timeoutInMinutes: 600 pool: LinuxAMDGPUPool diff --git a/.azure/rocm/rocm-benchmark-fp16-big.yml b/.azure/rocm/rocm-benchmark-fp16-big.yml index e74faa92..94713bcb 100644 --- a/.azure/rocm/rocm-benchmark-fp16-big.yml +++ b/.azure/rocm/rocm-benchmark-fp16-big.yml @@ -9,7 +9,7 @@ trigger: none jobs: - job: "ROCM_Benchmarking_FP16_Big" - timeoutInMinutes: 540 + timeoutInMinutes: 600 pool: LinuxAMDGPUPool diff --git a/.azure/rocm/rocm-benchmark-fp16.yml b/.azure/rocm/rocm-benchmark-fp16.yml index c92c6d9b..0177f35e 100644 --- a/.azure/rocm/rocm-benchmark-fp16.yml +++ b/.azure/rocm/rocm-benchmark-fp16.yml @@ -9,7 +9,7 @@ trigger: none jobs: - job: "ROCM_Benchmarking_FP16" - timeoutInMinutes: 540 + timeoutInMinutes: 600 pool: LinuxAMDGPUPool diff --git a/.azure/rocm/rocm-benchmark-fp32-bert.yml b/.azure/rocm/rocm-benchmark-fp32-bert.yml index 6b46c7bd..2f620e82 100644 --- a/.azure/rocm/rocm-benchmark-fp32-bert.yml +++ b/.azure/rocm/rocm-benchmark-fp32-bert.yml @@ -47,7 +47,7 @@ jobs: export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8 export LD_LIBRARY_PATH=${ROCM_PATH}/lib echo "LD_LIBRARY_PATH" ${LD_LIBRARY_PATH} - python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check + python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose displayName: Run fp32 benchmarks BERT workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers" env: diff --git a/.azure/rocm/rocm-benchmark-fp32-big.yml b/.azure/rocm/rocm-benchmark-fp32-big.yml index 0e138c36..2218c889 100644 --- a/.azure/rocm/rocm-benchmark-fp32-big.yml +++ b/.azure/rocm/rocm-benchmark-fp32-big.yml @@ -9,7 +9,7 @@ trigger: none jobs: - job: "ROCM_Benchmarking_FP32_Big" - timeoutInMinutes: 540 + timeoutInMinutes: 600 pool: LinuxAMDGPUPool diff --git a/.azure/rocm/rocm-benchmark-fp32.yml b/.azure/rocm/rocm-benchmark-fp32.yml index e6d27aed..3052884f 100644 --- a/.azure/rocm/rocm-benchmark-fp32.yml +++ b/.azure/rocm/rocm-benchmark-fp32.yml @@ -9,7 +9,7 @@ trigger: none jobs: - job: "ROCM_Benchmarking_FP32" - timeoutInMinutes: 540 + timeoutInMinutes: 600 pool: LinuxAMDGPUPool diff --git a/CMake/AddPyBind11.cmake b/CMake/AddPyBind11.cmake index b25ab36b..7622bb82 100644 --- a/CMake/AddPyBind11.cmake +++ b/CMake/AddPyBind11.cmake @@ -5,7 +5,7 @@ include(FetchContent) -set(PYBIND_VERSION "2.6.2" CACHE STRING "Version string to use for pybind11") +set(PYBIND_VERSION "2.10.1" CACHE STRING "Version string to use for pybind11") set(FETCHCONTENT_QUIET FALSE) @@ -16,6 +16,9 @@ FetchContent_Declare( FetchContent_GetProperties(pybind11) +set(Python3_FIND_REGISTRY LAST) +find_package(Python3 COMPONENTS Interpreter Development) + if(NOT pybind11_POPULATED) FetchContent_Populate(pybind11) add_subdirectory(${pybind11_SOURCE_DIR} ${pybind11_BINARY_DIR}) diff --git a/CMakeLists.txt b/CMakeLists.txt index b5f95b7f..0b20f53a 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -123,7 +123,7 @@ set(CMAKE_VISIBILITY_INLINES_HIDDEN ON) set(CMAKE_PLATFORM_NO_VERSIONED_SONAME ON) if(MSVC) # Set Visual Studio-specific options - add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS) + add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -D_SILENCE_NONFLOATING_COMPLEX_DEPRECATION_WARNING) add_compile_options(/utf-8) add_compile_options(/MP) add_compile_options(/bigobj) diff --git a/accera/CMakeLists.txt b/accera/CMakeLists.txt index 8cc3b4a3..f72de227 100644 --- a/accera/CMakeLists.txt +++ b/accera/CMakeLists.txt @@ -4,6 +4,7 @@ #################################################################################################### set(ACCERA_LIBRARIES_DIR ${CMAKE_CURRENT_LIST_DIR}) +set(ACCERA_BIN_DIR ${CMAKE_CURRENT_BINARY_DIR}) include_directories(${ACCERA_LIBRARIES_DIR}) add_subdirectory(acc-opt) diff --git a/accera/acc-opt/test/commandline.mlir b/accera/acc-opt/test/commandline.mlir index 34a1c0c3..d2b4d2f2 100644 --- a/accera/acc-opt/test/commandline.mlir +++ b/accera/acc-opt/test/commandline.mlir @@ -1,6 +1,7 @@ // RUN: acc-opt --show-dialects | FileCheck %s // CHECK: Registered Dialects: // CHECK: accera +// CHECK-NEXT: accintr // CHECK-NEXT: accln // CHECK-NEXT: accv // CHECK-NEXT: accxp diff --git a/accera/acc-opt/test/thrifty_caching.mlir b/accera/acc-opt/test/thrifty_caching.mlir index c8fb4650..7fe325b0 100644 --- a/accera/acc-opt/test/thrifty_caching.mlir +++ b/accera/acc-opt/test/thrifty_caching.mlir @@ -69,8 +69,8 @@ module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = " // CHECK: affine.for %arg6 = 0 to 16 { // CHECK: %1 = affine.load %arg1[%arg5, %arg4 + %arg6] : memref<32x32xf32, #map0> // CHECK: affine.store %1, %0[%arg5, %arg6] : memref<32x16xf32, 3> -// CHECK: } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]} -// CHECK: } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]} +// CHECK: } {accxp.access_bounds_check, beginMap = #map1, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]} +// CHECK: } {accxp.access_bounds_check, beginMap = #map1, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]} // CHECK: affine.for %arg5 = 0 to 4 { // CHECK: affine.for %arg6 = 0 to 16 { // CHECK: affine.for %arg7 = 0 to 32 { diff --git a/accera/acc-opt/test/value_mlir_test.cpp b/accera/acc-opt/test/value_mlir_test.cpp index 7ce33ed5..d1ceb028 100644 --- a/accera/acc-opt/test/value_mlir_test.cpp +++ b/accera/acc-opt/test/value_mlir_test.cpp @@ -115,7 +115,7 @@ TEST_CASE("function_decl1") .Parameters(Value{ ValueType::Float, MemoryLayout{ { 10 } } }) .Define([](Value) {}); CHECK(f3); - // CHECK: accv.func nested @f4_{{[0-9]+}}(%arg0: memref<3x4xf64, #map{{[0-9]*}}>) + // CHECK: accv.func nested @f4_{{[0-9]+}}(%arg0: memref<3x4xf64>) // COM: CHECK: accv.func @f4_{{[0-9]+}}(%arg0: memref<3x4xf64>) // CHECK-NEXT: return // CHECK-NEXT: } @@ -311,9 +311,9 @@ TEST_CASE("mlir_test3") // COM: Doesn't result in emitted code CHECK_NOTHROW(MakeScalar()); - // CHECK-NEXT: [[v0:%[a-z0-9_]+]] = "accv.alloc"() : () -> memref<100xf32, 3> + // CHECK-NEXT: [[v0:%[a-z0-9_]+]] = "accv.alloc"() {allocType = 0 : i64} : () -> memref<100xf32, 3> CHECK_NOTHROW(MakeVector(100)); - // CHECK-NEXT: [[v1:%[a-z0-9_]+]] = "accv.alloc"() : () -> memref<2x3xi16 + // CHECK-NEXT: [[v1:%[a-z0-9_]+]] = "accv.alloc"() {allocType = 0 : i64} : () -> memref<2x3xi16 CHECK_NOTHROW(MakeMatrix(2, 3)); // CHECK-NEXT: return // CHECK-NEXT: } @@ -325,7 +325,7 @@ TEST_CASE("mlir_test3") // CHECK-LABEL: module @mlir_test4 { // CHECK-NEXT: accv.module "mlir_test4" { -// CHECK-NEXT: accv.func nested @foo_{{[0-9]+}}(%arg0: memref<10x10xi32, [[MAP:#map[0-9]*]]>) +// CHECK-NEXT: accv.func nested @foo_{{[0-9]+}}(%arg0: memref<10x10xi32>) // COM: CHECK-NEXT: accv.func @foo_{{[0-9]+}}(%arg0: memref<10x10xi32>) attributes {args_symbol = ["{{[a-z0-9_]+}}"], exec_target = 0 : i64, sym_visibility = "nested"} { // CHECK-NEXT: [[c0:%c[0-9]+]] = arith.constant 0 : index // CHECK-NEXT: [[c10_1:%c[0-9_]+]] = arith.constant 10 : index @@ -372,7 +372,7 @@ TEST_CASE("mlir_test5") .Define([](Scalar i) { CHECK_NOTHROW(StaticAllocate("foo", std::vector{ 1, 2, 3, 4 })); - // CHECK-NEXT: "accv.alloc"() : () -> memref<100xf32, 3> + // CHECK-NEXT: "accv.alloc"() {allocType = 0 : i64} : () -> memref<100xf32, 3> CHECK_NOTHROW(MakeVector(100)); // CHECK-NEXT: return @@ -473,7 +473,7 @@ TEST_CASE("mlir_test11") // CHECK-NEXT: [[c0_0:%c[0-9a-z_]+]] = arith.constant 0 : i32 // CHECK-NEXT: [[c4_0:%c[0-9a-z_]+]] = arith.constant 4 // CHECK-NEXT: [[c4_1:%c[0-9a-z_]+]] = arith.constant 4 - // CHECK-NEXT: [[v0:%[a-z0-9_]+]] = "accv.alloc"() {sym_name = "a"} : () -> memref<1xi32, 3> + // CHECK-NEXT: [[v0:%[a-z0-9_]+]] = "accv.alloc"() {allocType = 0 : i64, sym_name = "a"} : () -> memref<1xi32, 3> Scalar a = MakeVector(1, "a")[0]; Scalar c = 4; // CHECK-NEXT: %[[v1:[a-z0-9_]+]] = arith.index_cast [[c0_0]] : i32 to index @@ -844,10 +844,10 @@ TEST_CASE("mlir_schedule_test_4") // COM: CHECK: memref.subview %arg0[0, %{{[a-z0-9_]+}}] [10, 1] [10, 1] : memref<10x10xf32, #map0> to memref<10xf32, #map3> // COM: CHECK: memref.subview %arg0[%{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}] [3, 4] [1, 1] : memref<10x10xf32, #map0> to memref<3x4xf32, #map4> // COM: CHECK-NEXT: accv.func @MatrixView_{{[0-9]+}}(%arg0: memref<10x10xf32 -// CHECK: "accv.slice"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) {sliceDimensions = [0, 1]} : (memref<10x10xf32, #map0>, index, index) -> memref -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<10x10xf32, #map0>, index) -> memref<10xf32, #map1> -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<10x10xf32, #map0>, index) -> memref<10xf32, #map2> -// CHECK: "accv.view"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) : (memref<10x10xf32, #map0>, !accv.range, !accv.range) -> memref<3x4xf32, #map3> +// CHECK: "accv.slice"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) {sliceDimensions = [0, 1]} : (memref<10x10xf32>, index, index) -> memref +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<10x10xf32>, index) -> memref<10xf32, #map0> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<10x10xf32>, index) -> memref<10xf32, #map1> +// CHECK: "accv.view"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) : (memref<10x10xf32>, !accv.range, !accv.range) -> memref<3x4xf32, #map2> TEST_CASE("mlir_matrix_view_test") { DeclareFunction("MatrixView") @@ -874,13 +874,13 @@ TEST_CASE("mlir_matrix_view_test") // COM: CHECK: memref.subview %arg0[0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}] [5, 1, 1] [150, 15, 1] : memref<5x10x15xf32, #map0> to memref<5xf32, #map7> // COM: CHECK: memref.subview %arg0[%{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}] [3, 2, 1] [1, 1, 1] : memref<5x10x15xf32, #map0> to memref<3x2x1xf32, #map8> // COM: CHECK-NEXT: accv.func @TensorView_{{[0-9]+}}(%arg0: memref<5x10x15xf32 -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<5x10x15xf32, #map0>, index) -> memref<10x15xf32, #map1> -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<5x10x15xf32, #map0>, index) -> memref<5x15xf32, #map2> -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [2]} : (memref<5x10x15xf32, #map0>, index) -> memref<5x10xf32, #map3> -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 1]} : (memref<5x10x15xf32, #map0>, index, index) -> memref<15xf32, #map4> -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 2]} : (memref<5x10x15xf32, #map0>, index, index) -> memref<10xf32, #map5> -// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [1, 2]} : (memref<5x10x15xf32, #map0>, index, index) -> memref<5xf32, #map6> -// CHECK: "accv.view"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) : (memref<5x10x15xf32, #map0>, !accv.range, !accv.range, !accv.range) -> memref<3x2x1xf32, #map7> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<5x10x15xf32>, index) -> memref<10x15xf32, #map0> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<5x10x15xf32>, index) -> memref<5x15xf32, #map1> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [2]} : (memref<5x10x15xf32>, index) -> memref<5x10xf32, #map2> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 1]} : (memref<5x10x15xf32>, index, index) -> memref<15xf32, #map3> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 2]} : (memref<5x10x15xf32>, index, index) -> memref<10xf32, #map4> +// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [1, 2]} : (memref<5x10x15xf32>, index, index) -> memref<5xf32, #map5> +// CHECK: "accv.view"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) : (memref<5x10x15xf32>, !accv.range, !accv.range, !accv.range) -> memref<3x2x1xf32, #map6> TEST_CASE("mlir_tensor_view_test") { DeclareFunction("TensorView") @@ -957,8 +957,8 @@ TEST_CASE("mlir_intrinsic_test") // COM: CHECK-NEXT: %[[v8:[0-9]+]] = "accv.get_element"(%[[v4]]) : (memref) -> f32 // COM: CHECK-NEXT: "accv.copy"(%[[v8]], %[[v6]]) : (f32, memref) -> () // CHECK-NEXT: [[v2:%[0-9]+]] = "accv.bin_op"([[v0]], %[[v1]]) {predicate = 0 : i64} : (index, index) -> index -// CHECK-NEXT: [[v3:%[0-9]+]] = "accv.slice"(%arg0, [[v0]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<8x18xf32, #map0>, index, index) -> memref -// CHECK-NEXT: [[v4:%[0-9]+]] = "accv.slice"(%arg1, [[v0]], %[[v1]]) {sliceDimensions = [0, 1]} : (memref<8x10xf32, #map1>, index, index) -> memref +// CHECK-NEXT: [[v3:%[0-9]+]] = "accv.slice"(%arg0, [[v0]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<8x18xf32>, index, index) -> memref +// CHECK-NEXT: [[v4:%[0-9]+]] = "accv.slice"(%arg1, [[v0]], %[[v1]]) {sliceDimensions = [0, 1]} : (memref<8x10xf32>, index, index) -> memref // CHECK-NEXT: [[v5:%[0-9]+]] = "accv.get_element"([[v3]]) : (memref) -> f32 // CHECK-NEXT: "accv.copy"([[v5]], [[v4]]) : (f32, memref) -> () TEST_CASE("mlir_index_arithmetic_test") @@ -1024,7 +1024,7 @@ TEST_CASE("mlir_scalar_float_test") // COM: CHECK-NEXT: scf.if %[[v4]] { // CHECK-NEXT: [[v0:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index // CHECK-NEXT: [[v1:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index - // CHECK-NEXT: [[v2:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32, #map0>, index, index) -> memref + // CHECK-NEXT: [[v2:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32>, index, index) -> memref // CHECK-NEXT: [[v3:%[0-9]+]] = "accv.get_element"([[v2]]) : (memref) -> f32 // CHECK-NEXT: [[v4:%[0-9]+]] = "accv.cmp"([[v3]], %[[A]]) {predicate = 1 : i64} : (f32, f32) -> i1 // CHECK-NEXT: scf.if [[v4]] { @@ -1045,7 +1045,7 @@ TEST_CASE("mlir_scalar_float_test") // CHECK: [[v3:%[0-9]+]] = "accv.bin_op"([[v2]], [[CST0]]) {predicate = 0 : i64} : (f32, f32) -> f32 // CHECK-NEXT: [[v0:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index // CHECK-NEXT: [[v1:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index - // CHECK-NEXT: [[Cslice:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32, #map0>, index, index) -> memref + // CHECK-NEXT: [[Cslice:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32>, index, index) -> memref // CHECK-NEXT: "accv.copy"([[v3]], [[Cslice]]) : (f32, memref) -> () C(idx, idx) = B[idx] + Cast(c, A.GetType()); @@ -1059,7 +1059,7 @@ TEST_CASE("mlir_scalar_float_test") // CHECK-NEXT: [[v0:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index // CHECK-NEXT: [[v1:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index // CHECK-NEXT: [[v2:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index - // CHECK-NEXT: [[Dslice:%[0-9]+]] = "accv.slice"(%[[D]], [[v0]], [[v1]], [[v2]]) {sliceDimensions = [0, 1, 2]} : (memref<1000x1000x1000xf32, #map1>, index, index, index) -> memref + // CHECK-NEXT: [[Dslice:%[0-9]+]] = "accv.slice"(%[[D]], [[v0]], [[v1]], [[v2]]) {sliceDimensions = [0, 1, 2]} : (memref<1000x1000x1000xf32>, index, index, index) -> memref auto dVal = D(idx, idx, idx); // CHECK-NEXT: %[[v3:[0-9]+]] = "accv.get_element"([[Dslice]]) : (memref) -> f32 @@ -1077,7 +1077,7 @@ TEST_CASE("mlir_scalar_float_test") // CHECK-NEXT: [[v0:%[0-9]+]] = "accv.bin_op"(%[[IDX]], [[c2_0]]) {predicate = 0 : i64} : (i32, i32) -> i32 // CHECK-DAG: [[v1:%[0-9]+]] = arith.index_cast [[v0]] : i32 to index // CHECK-DAG: [[v2:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index - // CHECK: [[v3:%[0-9]+]] = "accv.slice"(%[[C]], [[v1]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32, #map0>, index, index) -> memref + // CHECK: [[v3:%[0-9]+]] = "accv.slice"(%[[C]], [[v1]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32>, index, index) -> memref // CHECK-NEXT: [[v4:%[0-9]+]] = "accv.get_element"([[v3]]) : (memref) -> f32 // CHECK-NEXT: "accv.copy"([[v4]], [[Dslice]]) : (f32, memref) -> () dVal = C(idx + c, idx); @@ -1100,7 +1100,7 @@ TEST_CASE("mlir_scalar_float_test") // CHECK-NEXT: [[v3:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index // CHECK-NEXT: [[v4:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index // CHECK-NEXT: [[v5:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index - // CHECK-NEXT: [[Eslice:%[0-9]+]] = "accv.slice"(%[[E]], [[v2]], [[v3]], [[v4]], [[v5]]) {sliceDimensions = [0, 1, 2, 3]} : (memref<10000x10000x10000x10000xf32, #map2>, index, index, index, index) -> memref + // CHECK-NEXT: [[Eslice:%[0-9]+]] = "accv.slice"(%[[E]], [[v2]], [[v3]], [[v4]], [[v5]]) {sliceDimensions = [0, 1, 2, 3]} : (memref<10000x10000x10000x10000xf32>, index, index, index, index) -> memref auto eVal = E(idx, idx, idx, idx); // CHECK-NEXT: %[[v7:[0-9]+]] = "accv.get_element"([[Eslice]]) : (memref) -> f32 @@ -2287,11 +2287,11 @@ TEST_CASE("jit_float_cached_matrix_multiply_test") // JIT-LABEL: A*B: Print("A*B:\n"s); - // JIT-NEXT: 20832.000000 21328.000000 21824.000000 22320.000000 22816.000000 23312.000000 23808.000000 24304.000000 - // JIT-NEXT: 21824.000000 22352.000000 22880.000000 23408.000000 23936.000000 24464.000000 24992.000000 25520.000000 - // JIT-NEXT: 22816.000000 23376.000000 23936.000000 24496.000000 25056.000000 25616.000000 26176.000000 26736.000000 - // JIT-NEXT: 23808.000000 24400.000000 24992.000000 25584.000000 26176.000000 26768.000000 27360.000000 27952.000000 - // JIT-NEXT: 24800.000000 25424.000000 26048.000000 26672.000000 27296.000000 27920.000000 28544.000000 29168.000000 + // JIT: 20832.000000 21328.000000 21824.000000 22320.000000 22816.000000 23312.000000 23808.000000 24304.000000 + // JIT: 21824.000000 22352.000000 22880.000000 23408.000000 23936.000000 24464.000000 24992.000000 25520.000000 + // JIT: 22816.000000 23376.000000 23936.000000 24496.000000 25056.000000 25616.000000 26176.000000 26736.000000 + // JIT: 23808.000000 24400.000000 24992.000000 25584.000000 26176.000000 26768.000000 27360.000000 27952.000000 + // JIT: 24800.000000 25424.000000 26048.000000 26672.000000 27296.000000 27920.000000 28544.000000 29168.000000 Print(C); }); SUCCEED(); @@ -2404,7 +2404,7 @@ TEST_CASE("jit_matrix_transpose_test") .Public(true) .Decorated(false) .Define([=]() { - // COM: CHECK: [[m:%[0-9]+]] = "accv.alloc"() : () -> memref<3x4xf32, #map0, 3> + // COM: CHECK: [[m:%[0-9]+]] = "accv.alloc"() {allocType = 0 : i64} : () -> memref<3x4xf32, #map0, 3> Matrix m = MakeMatrix(M, N); CHECK(m.GetMatrixLayout() == Matrix::MatrixLayout::rowMajor); @@ -2990,7 +2990,7 @@ TEST_CASE("jit_array_reorder_test1") // COM: CHECK: [[map1:#map[0-9]+]] = affine_map<(d0, d1, d2) -> // COM: CHECK-LABEL: module @jit_array_reorder_test2 { // COM: CHECK-NEXT: accv.module "jit_array_reorder_test2" { -// COM: CHECK: %0 = "accv.alloc"() +// COM: CHECK: %0 = "accv.alloc"() {allocType = 0 : i64} // COM: CHECK-SAME: () -> memref<2x3x4xi32, [[map0]], 3> // COM: CHECK: %1 = memref.transpose %0 (d0, d1, d2) -> (d1, d2, d0) // COM: JIT-LABEL: @jit_array_reorder_test2 diff --git a/accera/acc-translate/CMakeLists.txt b/accera/acc-translate/CMakeLists.txt index 9fb7df89..10d90976 100644 --- a/accera/acc-translate/CMakeLists.txt +++ b/accera/acc-translate/CMakeLists.txt @@ -3,6 +3,15 @@ # Licensed under the MIT License. See LICENSE in the project root for license information. #################################################################################################### +# setup for using LLVM and MLIR +list(APPEND CMAKE_MODULE_PATH "${LLVM_DIR}") +list(APPEND CMAKE_MODULE_PATH "${MLIR_CMAKE_DIR}") +include(TableGen) +include(AddLLVM) +include(AddMLIR) + +add_subdirectory(src) + set(util_name acc-translate) set(target_src @@ -18,6 +27,7 @@ set(target_src src/Target/Cpp/AMDGPU.cpp src/Target/Cpp/VectorDialectCppPrinter.cpp src/Target/Cpp/LLVMDialectCppPrinter.cpp + src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp ) set(target_include @@ -33,6 +43,7 @@ set(target_include src/Target/Cpp/AMDGPU.h src/Target/Cpp/VectorDialectCppPrinter.h src/Target/Cpp/LLVMDialectCppPrinter.h + src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h ) @@ -45,6 +56,9 @@ source_group("include" FILES ${util_include}) add_executable(${util_name} ${util_src} ${util_include}) target_include_directories(${util_name} PRIVATE ${ACCERA_ROOT}/accera) +get_property(dialect_libs GLOBAL PROPERTY MLIR_DIALECT_LIBS) +get_property(translation_libs GLOBAL PROPERTY MLIR_TRANSLATION_LIBS) + target_link_libraries( ${util_name} PRIVATE MLIROptLib @@ -53,6 +67,8 @@ target_link_libraries( transforms value mlirHelpers + ${translation_libs} + ${dialect_libs} ) copy_shared_libraries(${util_name}) diff --git a/accera/acc-translate/src/AcceraTranslateMain.cpp b/accera/acc-translate/src/AcceraTranslateMain.cpp index 8360b341..4fc427f4 100644 --- a/accera/acc-translate/src/AcceraTranslateMain.cpp +++ b/accera/acc-translate/src/AcceraTranslateMain.cpp @@ -5,9 +5,13 @@ //////////////////////////////////////////////////////////////////////////////////////////////////// #include +#include #include #include +#include +#include + #include #include #include @@ -20,6 +24,8 @@ #include "Target/Cpp/TranslateToCpp.h" +#include "Target/LLVMIR/IntrinsicToLLVMIRTranslation.h" + using namespace mlir; @@ -50,11 +56,33 @@ inline void registerArgoTranslations() return true; }(); } + +void registerAcceraToLLVMIRTranslation() { + TranslateFromMLIRRegistration registration( + "acc-to-llvmir", + [](ModuleOp module, llvm::raw_ostream &output) { + llvm::LLVMContext llvmContext; + auto llvmModule = translateModuleToLLVMIR(module, llvmContext); + if (!llvmModule) + return failure(); + + llvmModule->print(output, nullptr); + return success(); + }, + [](DialectRegistry ®istry) { + registerAllDialects(registry); + accera::ir::GetDialectRegistry().appendTo(registry); + accera::transforms::intrinsics::registerIntrinsicsDialectTranslation(registry); + registerAllToLLVMIRTranslations(registry); + }); +} } // namespace int main(int argc, char** argv) { registerArgoTranslations(); + registerAcceraToLLVMIRTranslation(); + mlir::registerAllTranslations(); return failed(mlirTranslateMain(argc, argv, "acc-translate")); } diff --git a/accera/acc-translate/src/CMakeLists.txt b/accera/acc-translate/src/CMakeLists.txt new file mode 100644 index 00000000..a7edbd82 --- /dev/null +++ b/accera/acc-translate/src/CMakeLists.txt @@ -0,0 +1,7 @@ +#################################################################################################### +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. See LICENSE in the project root for license information. +#################################################################################################### + +add_subdirectory(Target) + diff --git a/accera/acc-translate/src/Target/CMakeLists.txt b/accera/acc-translate/src/Target/CMakeLists.txt new file mode 100644 index 00000000..6ce7b2ba --- /dev/null +++ b/accera/acc-translate/src/Target/CMakeLists.txt @@ -0,0 +1,6 @@ +#################################################################################################### +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. See LICENSE in the project root for license information. +#################################################################################################### + +add_subdirectory(LLVMIR) diff --git a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp index 5505582b..d1ec2a10 100644 --- a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp +++ b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp @@ -203,7 +203,7 @@ namespace cpp_printer const auto srcMemSpace = srcMemrefType.getMemorySpaceAsInt(); auto elementType = srcMemrefType.getElementType(); AffineDialectCppPrinter* affineDialectPrinter = dynamic_cast(printer->getDialectPrinter("Affine")); - auto srcMap = srcMemrefType.getLayout().getAffineMap(); + auto srcMap = mlir::getStridedLinearLayoutMap(srcMemrefType); const auto srcRowMajor = mlir::canonicalizeStridedLayout(srcMemrefType).getLayout().isIdentity(); auto dstMemrefType = blockLoadOp.dest().getType().cast(); diff --git a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp index b70c2731..9c02dcdf 100644 --- a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp +++ b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp @@ -47,7 +47,12 @@ void AffineMapVisitor::visit(Type type) } else if (auto memRefType = type.dyn_cast()) { - visit(AffineMapAttr::get(memRefType.getLayout().getAffineMap())); + // Flatten the memref layout map to a N-D -> 1-D map + // This will convert the map for an identity mapped layout like memref<16x16xf32> + // from (d0, d1) -> (d0, d1) + // to (d0, d1) -> (d0 * 16 + d1) + auto stridedLinearLayoutMap = mlir::getStridedLinearLayoutMap(memRefType); + visit(AffineMapAttr::get(stridedLinearLayoutMap)); } else if (auto shapedType = type.dyn_cast()) { diff --git a/accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt b/accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt new file mode 100644 index 00000000..58994a74 --- /dev/null +++ b/accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt @@ -0,0 +1,24 @@ +add_mlir_translation_library(IntrinsicToLLVMIRTranslation + IntrinsicToLLVMIRTranslation.cpp + + ADDITIONAL_HEADER_DIRS + ${ACCERA_BIN_DIR}/accera/ir/include + + DEPENDS + MLIRAcceraIntrinsics + AcceraIntrinsicsConversionsIncGen + + LINK_COMPONENTS + Core + + LINK_LIBS PUBLIC + MLIRIR + MLIRAcceraIntrinsics + MLIRLLVMIR + MLIRSupport + MLIRTargetLLVMIRExport + ) + +target_include_directories(IntrinsicToLLVMIRTranslation PUBLIC + ${ACCERA_BIN_DIR}/ir/include +) diff --git a/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp new file mode 100644 index 00000000..e1647492 --- /dev/null +++ b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp @@ -0,0 +1,50 @@ +//////////////////////////////////////////////////////////////////////////////////////////////////// +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. See LICENSE in the project root for license information. +//////////////////////////////////////////////////////////////////////////////////////////////////// + + +#include "IntrinsicToLLVMIRTranslation.h" + +#include + +#include "mlir/IR/Operation.h" +#include "mlir/Target/LLVMIR/ModuleTranslation.h" + +#include "llvm/IR/IRBuilder.h" +#include "llvm/IR/IntrinsicsX86.h" + +using namespace mlir; +using namespace mlir::LLVM; +using namespace accera::transforms::intrinsics; + +namespace { +class IntrinsicsDialectLLVMIRTranslationInterface + : public LLVMTranslationDialectInterface { +public: + using LLVMTranslationDialectInterface::LLVMTranslationDialectInterface; + + /// Translates the given operation to LLVM IR using the provided IR builder + /// and saving the state in `moduleTranslation`. + LogicalResult + convertOperation(Operation *op, llvm::IRBuilderBase &builder, + LLVM::ModuleTranslation &moduleTranslation) const final { + Operation &opInst = *op; +#include "intrinsics/AcceraIntrinsicsConversions.inc" + + return failure(); + } +}; +} // namespace + +void accera::transforms::intrinsics::registerIntrinsicsDialectTranslation(DialectRegistry ®istry) { + registry.insert(); + registry.addDialectInterface(); +} + +void accera::transforms::intrinsics::registerIntrinsicsDialectTranslation(MLIRContext &context) { + DialectRegistry registry; + registerIntrinsicsDialectTranslation(registry); + context.appendDialectRegistry(registry); +} diff --git a/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h new file mode 100644 index 00000000..5a27797b --- /dev/null +++ b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h @@ -0,0 +1,27 @@ +//////////////////////////////////////////////////////////////////////////////////////////////////// +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. See LICENSE in the project root for license information. +//////////////////////////////////////////////////////////////////////////////////////////////////// + +#pragma once + +namespace mlir +{ + +class DialectRegistry; +class MLIRContext; + +} // namespace mlir + +namespace accera::transforms::intrinsics +{ + +/// Register the Intrinsic dialect and the translation from it to the LLVM IR +/// in the given registry; +void registerIntrinsicsDialectTranslation(mlir::DialectRegistry& registry); + +/// Register the Intrinsic dialect and the translation from it in the registry +/// associated with the given context. +void registerIntrinsicsDialectTranslation(mlir::MLIRContext& context); + +} // namespace accera::transforms::intrinsics diff --git a/accera/accc/accc.py b/accera/accc/accc.py index 08788acd..5871498b 100644 --- a/accera/accc/accc.py +++ b/accera/accc/accc.py @@ -98,7 +98,7 @@ def bstr(val): DEFAULT_ACC_TRANSLATE_ARGS = [] -DEFAULT_MLIR_TRANSLATE_ARGS = ["--mlir-print-op-on-diagnostic", "--mlir-to-llvmir"] +DEFAULT_MLIR_TRANSLATE_ARGS = ["--mlir-print-op-on-diagnostic", "--acc-to-llvmir"] LLVM_TOOLING_OPTS = { SystemTarget.HOST.value: ["-O3", "-fp-contract=fast", "-mcpu=native"], @@ -120,9 +120,17 @@ def bstr(val): ], } -DEFAULT_OPT_ARGS = [] +DEFAULT_LLVM_TOOLING_OPTS = [ + '--enable-unsafe-fp-math', + '--enable-no-infs-fp-math', + '--enable-no-nans-fp-math', + '--enable-no-signed-zeros-fp-math', + '--enable-no-trapping-fp-math' +] -DEFAULT_LLC_ARGS = ["-relocation-model=pic"] +DEFAULT_OPT_ARGS = DEFAULT_LLVM_TOOLING_OPTS + [] + +DEFAULT_LLC_ARGS = DEFAULT_LLVM_TOOLING_OPTS + ["-relocation-model=pic"] def get_default_deploy_shared_libraries(target=CPU_TARGET): @@ -818,7 +826,7 @@ def translate_mlir_with_mlir_translate( stdout = None stderr = None for module_file_set in self.module_file_sets: - mlir_translate_exe = os.path.abspath(ACCCConfig.mlir_translate) + mlir_translate_exe = os.path.abspath(ACCCConfig.acc_translate) full_mlir_translate_args = [] # empty list every iteration full_mlir_translate_args += mlir_translate_args or DEFAULT_MLIR_TRANSLATE_ARGS full_mlir_translate_args += [f'-o="{module_file_set.translated_ll_filepath}"'] diff --git a/accera/ir/CMakeLists.txt b/accera/ir/CMakeLists.txt index f0d6607a..6881b8fe 100644 --- a/accera/ir/CMakeLists.txt +++ b/accera/ir/CMakeLists.txt @@ -32,6 +32,12 @@ set(include include/TranslateToHeader.h ) +set(intrinsics_src + src/intrinsics/AcceraIntrinsicsDialect.cpp + ) +set(intrinsics_include + include/intrinsics/AcceraIntrinsicsDialect.h) + set(accvalue_src src/value/ValueDialect.cpp src/value/ValueCanonicalization.cpp @@ -113,6 +119,21 @@ set(argo_include include/argo/Utils.h ) +add_mlir_dialect_library(MLIRAcceraIntrinsics # This is an accera dialect, but the add_mlir_dialect() cmake function prepends "MLIR" + ${intrinsics_src} + + ADDITIONAL_HEADER_DIRS + ${CMAKE_CURRENT_SOURCE_DIR}/include + + DEPENDS + MLIRAcceraIntrinsicsIncGen + + LINK_LIBS PUBLIC + MLIRIR + ) + +InstallAcceraLibrary(MLIRAcceraIntrinsics) + # This is supposed to be overriden on the command line As of LLVM 8.0.1, the # possible values within the list are: AArch64 AMDGPU ARM BPF Hexagon Lanai Mips # MSP430 NVPTX PowerPC Sparc SystemZ WebAssembly X86 XCore @@ -160,6 +181,7 @@ set(src ${accexec_src} ${accera_src} ${argo_src} + ${intrinsics_src} ) set(include @@ -169,6 +191,7 @@ set(include ${accexec_include} ${accera_include} ${argo_include} + ${intrinsics_include} build/LLVMEmitterTargets.h ) @@ -182,6 +205,15 @@ target_include_directories( $ ) +target_include_directories( + MLIRAcceraIntrinsics PRIVATE ${CMAKE_CURRENT_BINARY_DIR} include + PUBLIC + $ + $ + $ + $ +) + target_include_directories(${library_name} SYSTEM PUBLIC ${LLVM_INCLUDE_DIRS}) target_link_libraries( ${library_name} @@ -207,6 +239,8 @@ add_dependencies( AcceraOpsIncGen ValueAttrsIncGen ValueOpsIncGen + MLIRAcceraIntrinsicsIncGen + MLIRAcceraIntrinsics ArgoOpsIncGen ArgoStructuredOpsIncGen diff --git a/accera/ir/include/CMakeLists.txt b/accera/ir/include/CMakeLists.txt index 16667e3f..a1adb102 100644 --- a/accera/ir/include/CMakeLists.txt +++ b/accera/ir/include/CMakeLists.txt @@ -9,5 +9,6 @@ add_subdirectory(nest) add_subdirectory(exec) add_subdirectory(accera) add_subdirectory(value) +add_subdirectory(intrinsics) add_subdirectory(argo) diff --git a/accera/ir/include/Common.td b/accera/ir/include/Common.td index 2ece7891..bb04addb 100644 --- a/accera/ir/include/Common.td +++ b/accera/ir/include/Common.td @@ -84,6 +84,12 @@ def acc_NumericType : def acc_ScalarOrVectorNumericType : AnyTypeOf<[acc_NumericType, VectorOf<[acc_NumericType]>]>; +def acc_IntegerOrIntegerVectorNumericType : + AnyTypeOf<[AnyInteger, VectorOf<[AnyInteger]>]>; + +def acc_FloatOrFloatVectorNumericType : + AnyTypeOf<[AnyFloat, VectorOf<[AnyFloat]>]>; + class acc_Scalarlike : AnyTypeOf<[type, acc_ContainerOfTypeWithNumElements<[type], 1>]>; diff --git a/accera/ir/include/IRUtil.h b/accera/ir/include/IRUtil.h index 94bacdbc..8dfc3555 100644 --- a/accera/ir/include/IRUtil.h +++ b/accera/ir/include/IRUtil.h @@ -259,6 +259,7 @@ namespace util std::vector AffineValueMapToAffineApplyOps(mlir::OpBuilder& builder, mlir::Location loc, mlir::AffineValueMap affineValueMap); mlir::AffineValueMap SimplifyAffineValueMap(mlir::AffineValueMap affineValueMap); + mlir::Type CloneTypeWithNewElementType(mlir::Type type, mlir::Type newElementType); mlir::Type GetElementType(mlir::Type type); int64_t GetUniqueId(mlir::Operation* where); @@ -358,6 +359,8 @@ namespace util mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::StoreOp op); mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::LoadOp op); + void EraseOps(std::stack& opStack, mlir::PatternRewriter& rewriter); + struct TempOpCleanupGuard { TempOpCleanupGuard(std::stack* opStack, mlir::PatternRewriter& rewriter); @@ -400,11 +403,11 @@ namespace util mlir::Value GetGPUIndex(value::Processor idxType, mlir::OpBuilder& builder, mlir::Location& loc, ir::value::ExecutionRuntime execRuntime = ir::value::ExecutionRuntime::CUDA); - int64_t GetBlockDimSize(mlir::gpu::BlockDimOp op); - int64_t GetGridDimSize(mlir::gpu::GridDimOp op); + std::optional GetBlockDimSize(mlir::gpu::BlockDimOp op); + std::optional GetGridDimSize(mlir::gpu::GridDimOp op); - int64_t GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId); - int64_t GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId); + std::optional GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId); + std::optional GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId); // Gets the flattened thread ID of the current GPU thread within the context of the current block mlir::Value GetCurrentGPUBlockThreadID(mlir::OpBuilder& builder, mlir::Location loc); diff --git a/accera/ir/include/intrinsics/AcceraIntrinsics.td b/accera/ir/include/intrinsics/AcceraIntrinsics.td new file mode 100644 index 00000000..9173db9f --- /dev/null +++ b/accera/ir/include/intrinsics/AcceraIntrinsics.td @@ -0,0 +1,69 @@ +//////////////////////////////////////////////////////////////////////////////////////////////////// +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. See LICENSE in the project root for license information. +//////////////////////////////////////////////////////////////////////////////////////////////////// + +#ifndef ACCERA_intrinsic_OPS +#define ACCERA_intrinsic_OPS + +include "mlir/Dialect/LLVMIR/LLVMOpBase.td" +include "mlir/Interfaces/InferTypeOpInterface.td" + +def AcceraIntrinsics_Dialect : Dialect { + let name = "accintr"; + let cppNamespace = "::accera::ir::intrinsics"; +} + +// Implements the LLVM_IntrOpBase interface (from mlir/Dialect/LLVMIR/LLVMOpBase.td) +// rather than LLVM_OneResultIntrOp because we don't want to put this op in the llvm dialect. +// Otherwise it will screw up how the conversion is handled later in acc-translate. +// However, we still want the other args to be like those in LLVM_OneResultIntrOp and LLVM_IntrOp +def accintr_VpmaddwdOp : LLVM_IntrOpBase/llvm/include/llvm/IR/IntrinsicsX86.td ) + [], // overloadedResults + [], // overloadedOperands + [NoSideEffect], // traits + 1>, // num results + Arguments<(ins LLVM_Type, LLVM_Type)>; + + +// TODO : this may not be needed when we have multi-dimensional reductions supporting max/min +def accintr_VmaxpsOp : LLVM_IntrOpBase/llvm/include/llvm/IR/IntrinsicsX86.td ) + [], // overloadedResults + [], // overloadedOperands + [NoSideEffect], // traits + 1>, // num results + Arguments<(ins LLVM_Type, LLVM_Type)>; + +def accintr_VminpsOp : LLVM_IntrOpBase/llvm/include/llvm/IR/IntrinsicsX86.td ) + [], // overloadedResults + [], // overloadedOperands + [NoSideEffect], // traits + 1>, // num results + Arguments<(ins LLVM_Type, LLVM_Type)>; + +// TODO : remove after the next llvm update. There is a new math::roundeven op that we can use +def accintr_RoundEvenOp : LLVM_IntrOpBase/llvm/include/llvm/IR/IntrinsicsX86.td ) + [], // overloadedResults + [0], // overloadedOperands + [NoSideEffect, SameOperandsAndResultType], // traits + 1>, // num results + Arguments<(ins LLVM_Type)>; + +def accintr_RoundF32VecAVX2 : LLVM_IntrOpBase/llvm/include/llvm/IR/IntrinsicsX86.td ) + [], // overloadedResults + [], // overloadedOperands + [NoSideEffect], // traits + 1>, // num results + Arguments<(ins LLVM_Type)>; + +#endif // ACCERA_intrinsic_OPS diff --git a/accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h b/accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h new file mode 100644 index 00000000..833f4e6b --- /dev/null +++ b/accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h @@ -0,0 +1,18 @@ +//////////////////////////////////////////////////////////////////////////////////////////////////// +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. See LICENSE in the project root for license information. +//////////////////////////////////////////////////////////////////////////////////////////////////// + +#pragma once + +#include "mlir/IR/BuiltinTypes.h" +#include "mlir/IR/Dialect.h" +#include "mlir/IR/OpDefinition.h" +#include "mlir/IR/OpImplementation.h" +#include "mlir/Interfaces/InferTypeOpInterface.h" +#include "mlir/Interfaces/SideEffectInterfaces.h" + +#include "intrinsics/AcceraIntrinsicsDialect.h.inc" + +#define GET_OP_CLASSES +#include "intrinsics/AcceraIntrinsics.h.inc" diff --git a/accera/ir/include/intrinsics/CMakeLists.txt b/accera/ir/include/intrinsics/CMakeLists.txt new file mode 100644 index 00000000..9487f83b --- /dev/null +++ b/accera/ir/include/intrinsics/CMakeLists.txt @@ -0,0 +1,10 @@ +#################################################################################################### +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. See LICENSE in the project root for license information. +#################################################################################################### + +add_mlir_dialect(AcceraIntrinsics accintr) + +set(LLVM_TARGET_DEFINITIONS AcceraIntrinsics.td) +mlir_tablegen(AcceraIntrinsicsConversions.inc -gen-llvmir-conversions) +add_public_tablegen_target(AcceraIntrinsicsConversionsIncGen) diff --git a/accera/ir/include/value/ValueAttrs.td b/accera/ir/include/value/ValueAttrs.td index 7f2fec66..d3da5522 100644 --- a/accera/ir/include/value/ValueAttrs.td +++ b/accera/ir/include/value/ValueAttrs.td @@ -60,10 +60,12 @@ def ProcessorAttr : I64EnumAttr<"Processor", "processor for loop mapping", [ def MEMORY_ALLOC_GLOBAL : I64EnumAttrCase<"Global", 0>; def MEMORY_ALLOC_STACK : I64EnumAttrCase<"Stack", 1>; +def MEMORY_ALLOC_HEAP : I64EnumAttrCase<"Heap", 2>; +def MEMORY_ALLOC_THREAD_LOCAL : I64EnumAttrCase<"ThreadLocal", 3>; // TODO : include in enum below and plumb through to python DSL and add appropriate lowering rewrite def MemoryAllocTypeAttr : I64EnumAttr< "MemoryAllocType", "Describes the memory type in which an allocation resides.", - [ MEMORY_ALLOC_GLOBAL, MEMORY_ALLOC_STACK]> { + [ MEMORY_ALLOC_GLOBAL, MEMORY_ALLOC_STACK, MEMORY_ALLOC_HEAP]> { let cppNamespace = "::accera::ir::value"; } diff --git a/accera/ir/include/value/ValueDialect.h b/accera/ir/include/value/ValueDialect.h index b8a6eff4..1e5c2b41 100644 --- a/accera/ir/include/value/ValueDialect.h +++ b/accera/ir/include/value/ValueDialect.h @@ -7,6 +7,7 @@ #pragma once #include +#include #include #include #include @@ -15,6 +16,7 @@ #include #include #include +#include #include #include @@ -55,6 +57,7 @@ using mlir::FloatType; using mlir::FuncOp; using mlir::FunctionType; using mlir::IndexType; +using mlir::InferTypeOpInterface; using mlir::IntegerAttr; using mlir::Location; using mlir::LogicalResult; @@ -99,6 +102,7 @@ const mlir::StringRef RawPointerAPIAttrName = "accv.emit_raw_pointer_api"; const mlir::StringRef HeaderDeclAttrName = "accv.emit_header_decl"; const mlir::StringRef FunctionTagsAttrName = "accv.function_tags"; const mlir::StringRef NoInlineAttrName = "accv.no_inline"; +const mlir::StringRef NoInlineIntoAttrName = "accv.no_inline_into"; const mlir::StringRef BaseNameAttrName = "accv.base_name"; const mlir::StringRef DynamicArgSizeReferencesAttrName = "accv.dyn_arg_size_refs"; const mlir::StringRef UsagesAttrName = "accv.usages"; diff --git a/accera/ir/include/value/ValueOps.td b/accera/ir/include/value/ValueOps.td index 5f33cace..fa34447d 100644 --- a/accera/ir/include/value/ValueOps.td +++ b/accera/ir/include/value/ValueOps.td @@ -12,9 +12,12 @@ include "ir/include/value/ValueBase.td" include "ir/include/value/ValueAttrs.td" include "mlir/Interfaces/ControlFlowInterfaces.td" +include "mlir/Interfaces/InferTypeOpInterface.td" include "mlir/IR/FunctionInterfaces.td" include "mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.td" +include "mlir/Dialect/LLVMIR/LLVMOpBase.td" + def accv_ValueLambdaOp : accv_Op<"lambda", [ SymbolTable, Symbol, @@ -291,11 +294,15 @@ def accv_BINARY_OP_DIV : I64EnumAttrCase<"DIV", 3>; def accv_BINARY_OP_MOD : I64EnumAttrCase<"MOD", 4>; def accv_BINARY_OP_AND : I64EnumAttrCase<"LOGICAL_AND", 5>; def accv_BINARY_OP_OR : I64EnumAttrCase<"LOGICAL_OR", 6>; +def accv_BINARY_OP_MAX : I64EnumAttrCase<"MAX", 7>; +def accv_BINARY_OP_MIN : I64EnumAttrCase<"MIN", 8>; def accv_BinaryOpPredicateAttr : I64EnumAttr< "BinaryOpPredicate", "", - [accv_BINARY_OP_ADD, accv_BINARY_OP_SUB, accv_BINARY_OP_MUL, accv_BINARY_OP_DIV, accv_BINARY_OP_MOD, - accv_BINARY_OP_AND, accv_BINARY_OP_OR]> { + [accv_BINARY_OP_ADD, accv_BINARY_OP_SUB, + accv_BINARY_OP_MUL, accv_BINARY_OP_DIV, accv_BINARY_OP_MOD, + accv_BINARY_OP_AND, accv_BINARY_OP_OR, + accv_BINARY_OP_MAX, accv_BINARY_OP_MIN]> { let cppNamespace = "::accera::ir::value"; } @@ -374,6 +381,24 @@ def accv_CmpOp : accv_Op<"cmp", }]; } +// TODO : remove after the next llvm update. There is a new math::roundeven op that we can use +// TODO : add more control for rounding modes other than "roundeven" +def accv_RoundOp : accv_Op<"round", [NoSideEffect]> { + let description = [{ + Rounds a given floating point value to an integer of the same bitwidth according to the currently set rounding mode. + }]; + + let arguments = (ins acc_FloatOrFloatVectorNumericType:$val); + let results = (outs acc_IntegerOrIntegerVectorNumericType:$result); + + let extraClassDeclaration = [{ + static bool SupportsVectorization(int count) { + // TODO : generalize this for more target types than AVX-2 + return count == 8; + } + }]; +} + def accv_CopyOp : accv_Op<"copy"> { let description = [{ Copies the data in the input view into the output view. @@ -671,7 +696,7 @@ def accv_MemRefCastOp : accv_Op<"memref_cast", [SameOperandsAndResultShape]> { }]; } -def accv_CastOp : accv_Op<"cast"> { +def accv_CastOp : accv_Op<"cast", [NoSideEffect]> { let summary = "casting operation"; let description = [{ The `accv.cast` operation converts an element to an element of another type. @@ -1493,4 +1518,39 @@ def accv_MMAStoreSyncOp : accv_Op<"wmma_store_sync", [ let verifier = [{ return ::verify(*this); }]; } +// TODO : move to new dialect? +def accv_vpmaddwd : accv_Op<"vpmaddwd", [NoSideEffect]>{ + let summary = "vpmaddwd intrinsic operation"; + + let description = [{ + The `accv.vpmaddwd` operation lowers to the vpmaddwd LLVM intrinsic. + }]; + + let arguments = (ins AnyVector:$lhs, AnyVector:$rhs); // TODO : shape verification + let results = (outs AnyVector:$result); +} + +def accv_vmaxps : accv_Op<"vmaxps", [NoSideEffect]>{ + let summary = "vmaxps intrinsic operation"; + + let description = [{ + The `accv.vmaxps` operation lowers to the vmaxps LLVM intrinsic. + }]; + + let arguments = (ins AnyVector:$lhs, AnyVector:$rhs); // TODO : shape verification + let results = (outs AnyVector:$result); +} + +def accv_vminps : accv_Op<"vminps", [NoSideEffect]>{ + let summary = "vminps intrinsic operation"; + + let description = [{ + The `accv.vminps` operation lowers to the vminps LLVM intrinsic. + }]; + + let arguments = (ins AnyVector:$lhs, AnyVector:$rhs); // TODO : shape verification + let results = (outs AnyVector:$result); +} + + #endif // ACCERA_accv_OPS diff --git a/accera/ir/src/DialectRegistry.cpp b/accera/ir/src/DialectRegistry.cpp index 6e6b63f8..e5d96fca 100644 --- a/accera/ir/src/DialectRegistry.cpp +++ b/accera/ir/src/DialectRegistry.cpp @@ -9,6 +9,7 @@ #include "nest/LoopNestOps.h" #include "accera/AcceraOps.h" #include "value/ValueDialect.h" +#include "intrinsics/AcceraIntrinsicsDialect.h" #include #include @@ -38,6 +39,7 @@ mlir::DialectRegistry& GetDialectRegistry() registry.insert SimplifyAffineValueMapToConstant(mlir::AffineValueMap affineValueMap) + { + auto simplified = SimplifyAffineValueMap(affineValueMap); + auto map = simplified.getAffineMap(); + if (map.isSingleConstant()) + { + return map.getSingleConstantResult(); + } + return std::nullopt; + } + + template + mlir::Type CloneTypeWithNewElementType(ShapedTy type, mlir::Type newElementType) + { + typename ShapedTy::Builder builder(type); + builder.setElementType(newElementType); + + return builder; + } + + mlir::Type CloneTypeWithNewElementType(mlir::Type type, mlir::Type newElementType) + { + auto result = + mlir::TypeSwitch(type) + .Case([&](mlir::MemRefType memrefType) { + return CloneTypeWithNewElementType(memrefType, newElementType); + }) + .Case([&](mlir::VectorType vectorType) { + return CloneTypeWithNewElementType(vectorType, newElementType); + }) + .Default([&](mlir::Type) { + return newElementType; + }); + return result; + } + mlir::Type GetElementType(mlir::Type type) { auto result = @@ -734,42 +770,42 @@ namespace util if (forOp.getLowerBoundMap().getNumResults() != 1) return mlir::failure(); + mlir::OpBuilder::InsertionGuard insertGuard(rewriter); + rewriter.setInsertionPoint(forOp); // Replaces all IV uses to its single iteration value. auto iv = forOp.getInductionVar(); - auto* parentBlock = forOp->getBlock(); + mlir::Value ivValueReplacement; if (!iv.use_empty()) { if (forOp.hasConstantLowerBound()) { - mlir::OpBuilder topBuilder(forOp->getParentOfType().getBody()); - auto constOp = topBuilder.create( + ivValueReplacement = rewriter.create( forOp.getLoc(), forOp.getConstantLowerBound()); - iv.replaceAllUsesWith(constOp); } else { auto lbOperands = forOp.getLowerBoundOperands(); auto lbMap = forOp.getLowerBoundMap(); - mlir::OpBuilder builder(parentBlock, mlir::Block::iterator(forOp)); - if (lbMap == builder.getDimIdentityMap()) + if (lbMap == rewriter.getDimIdentityMap()) { // No need of generating an affine.apply. - iv.replaceAllUsesWith(lbOperands[0]); + ivValueReplacement = lbOperands[0]; } else { - auto affineApplyOp = - builder.create(forOp.getLoc(), lbMap, lbOperands); - iv.replaceAllUsesWith(affineApplyOp); + ivValueReplacement = + rewriter.create(forOp.getLoc(), lbMap, lbOperands); } } + iv.replaceAllUsesWith(ivValueReplacement); } + // Move the loop body operations, except for its terminator, to the loop's // containing block. - rewriter.eraseOp(forOp.getBody()->getTerminator()); - parentBlock->getOperations().splice(mlir::Block::iterator(forOp), - forOp.getBody()->getOperations()); + // Erase the terminator so we don't merge it into the parent block + rewriter.eraseOp(forOp.getBody()->getTerminator()); + rewriter.mergeBlockBefore(forOp.getBody(), forOp, mlir::ValueRange{ ivValueReplacement }); rewriter.eraseOp(forOp); return mlir::success(); @@ -900,6 +936,17 @@ namespace util return GetMemRefIndexToMemoryLocationMap(context, op); } + void EraseOps(std::stack& opStack, mlir::PatternRewriter& rewriter) + { + while (!opStack.empty()) + { + auto eraseOp = opStack.top(); + assert(eraseOp->use_empty()); + rewriter.eraseOp(eraseOp); + opStack.pop(); + } + } + TempOpCleanupGuard::TempOpCleanupGuard(std::stack* opStack, mlir::PatternRewriter& rewriter) : _opStack(opStack), _rewriter(rewriter) @@ -907,13 +954,7 @@ namespace util TempOpCleanupGuard::~TempOpCleanupGuard() { - while (!_opStack->empty()) - { - auto eraseOp = _opStack->top(); - assert(eraseOp->use_empty()); - _rewriter.eraseOp(eraseOp); - _opStack->pop(); - } + EraseOps(*_opStack, _rewriter); } mlir::Attribute MemorySpaceToAttribute(const value::MemorySpace& memorySpace, mlir::MLIRContext* context) @@ -944,14 +985,25 @@ namespace util mlir::Type ToSignlessMLIRType(mlir::OpBuilder& builder, mlir::Type type) { - if (type.isIntOrFloat()) - { - if (auto width = type.getIntOrFloatBitWidth(); type.isInteger(width)) - { - return builder.getIntegerType(width); - } - } - return type; // pass-through, no signless change + auto result = + mlir::TypeSwitch(type) + .Case([&](mlir::MemRefType memrefType) -> mlir::Type { + return CloneTypeWithNewElementType(memrefType, ToSignlessMLIRType(builder, memrefType.getElementType())); + }) + .Case([&](mlir::VectorType vectorType) -> mlir::Type { + return CloneTypeWithNewElementType(vectorType, ToSignlessMLIRType(builder, vectorType.getElementType())); + }) + .Default([&](mlir::Type t) -> mlir::Type { + if (t.isIntOrFloat()) + { + if (auto width = t.getIntOrFloatBitWidth(); t.isInteger(width)) + { + return builder.getIntegerType(width); + } + } + return t; // pass-through, no signless change + }); + return result; } mlir::Value ToSignlessMLIRValue(mlir::OpBuilder& builder, mlir::Value value) @@ -1067,7 +1119,7 @@ namespace util }); } - int64_t GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId) + std::optional GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId) { if (auto gpuFunc = where->getParentOfType()) { @@ -1082,8 +1134,7 @@ namespace util mlir::Operation* vLambdaOp = where->getParentOfType(); if (vFuncOp == nullptr && vLambdaOp == nullptr) { - assert(false && "Can only resolve block dim size inside of a gpu::GPUFuncOp, ir::value::ValueFuncOp, or ir::value::ValueLambdaOp"); - return -1; + return std::nullopt; } // Prefer using the ValueLambdaOp as inner loopnests will be a ValueLambdaOp nested inside of a ValueFuncOp auto op = vLambdaOp != nullptr ? vLambdaOp : vFuncOp; @@ -1094,7 +1145,7 @@ namespace util } } - int64_t GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId) + std::optional GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId) { if (auto gpuFunc = where->getParentOfType()) { @@ -1109,8 +1160,7 @@ namespace util mlir::Operation* vLambdaOp = where->getParentOfType(); if (vFuncOp == nullptr && vLambdaOp == nullptr) { - assert(false && "Can only resolve grid dim size inside of a gpu::GPUFuncOp, ir::value::ValueFuncOp, or ir::value::ValueLambdaOp"); - return -1; + return std::nullopt; } auto op = vLambdaOp != nullptr ? vLambdaOp : vFuncOp; auto gpuParams = GetGPUFuncLaunchInfo(op); @@ -1120,12 +1170,12 @@ namespace util } } - int64_t GetBlockDimSize(mlir::gpu::BlockDimOp op) + std::optional GetBlockDimSize(mlir::gpu::BlockDimOp op) { return GetBlockDimSize(op, op.dimension()); } - int64_t GetGridDimSize(mlir::gpu::GridDimOp op) + std::optional GetGridDimSize(mlir::gpu::GridDimOp op) { return GetGridDimSize(op, op.dimension()); } @@ -1147,9 +1197,9 @@ namespace util auto blockDimXOp = GetGPUIndex(vir::Processor::BlockDimX, builder, loc); auto blockDimYOp = GetGPUIndex(vir::Processor::BlockDimY, builder, loc); auto blockDimZOp = GetGPUIndex(vir::Processor::BlockDimZ, builder, loc); - if (GetBlockDimSize(blockDimZOp.getDefiningOp()) == 1) // 2D or 1D block + if (*(GetBlockDimSize(blockDimZOp.getDefiningOp())) == 1) // 2D or 1D block { - if (GetBlockDimSize(blockDimYOp.getDefiningOp()) == 1) + if (*(GetBlockDimSize(blockDimYOp.getDefiningOp())) == 1) { // 1D block auto flattenedTidMap = mlir::AffineMap::get(0, 1, threadXSym); diff --git a/accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp b/accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp new file mode 100644 index 00000000..8454ba30 --- /dev/null +++ b/accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp @@ -0,0 +1,32 @@ +//////////////////////////////////////////////////////////////////////////////////////////////////// +// Copyright (c) Microsoft Corporation. All rights reserved. +// Licensed under the MIT License. See LICENSE in the project root for license information. +//////////////////////////////////////////////////////////////////////////////////////////////////// + +#include "ir/include/intrinsics/AcceraIntrinsicsDialect.h" + +#include "mlir/Dialect/LLVMIR/LLVMTypes.h" +#include "mlir/IR/Builders.h" +#include "mlir/IR/OpImplementation.h" +#include "mlir/IR/TypeUtilities.h" +#include "mlir/Interfaces/InferTypeOpInterface.h" + +using namespace mlir; + +#include "intrinsics/AcceraIntrinsicsDialect.cpp.inc" + +namespace accera::ir::intrinsics +{ + +void AcceraIntrinsicsDialect::initialize() +{ + addOperations< +#define GET_OP_LIST +#include "intrinsics/AcceraIntrinsics.cpp.inc" + >(); +} + +} // namespace accera::ir::intrinsics + +#define GET_OP_CLASSES +#include "intrinsics/AcceraIntrinsics.cpp.inc" diff --git a/accera/ir/src/nest/LoopNestAffineConstraints.cpp b/accera/ir/src/nest/LoopNestAffineConstraints.cpp index 2d0ddd21..3bf3ae44 100644 --- a/accera/ir/src/nest/LoopNestAffineConstraints.cpp +++ b/accera/ir/src/nest/LoopNestAffineConstraints.cpp @@ -95,11 +95,11 @@ struct SplitLoopInfo IdWrapper largestMainLoopIVId; }; -SplitLoopInfo AddSplitPartitionHelper(LoopNestAffineConstraints& cst, - const Index& loopIndex, - mlir::OpBuilder& builder, - mlir::Location loc, - int64_t stepSize) +std::optional AddSplitPartitionHelper(LoopNestAffineConstraints& cst, + const Index& loopIndex, + mlir::OpBuilder& builder, + mlir::Location loc, + int64_t stepSize) { // Get the [begin, end) range for this loop id LoopNestAffineConstraints resolveRangeCst = cst.Clone(); @@ -107,8 +107,20 @@ SplitLoopInfo AddSplitPartitionHelper(LoopNestAffineConstraints& cst, auto [beginValueMap, endValueMap] = resolveRangeCst.GetLowerAndUpperBound(loopIndex, builder, loc); // Produce a begin and end value using affine apply ops - mlir::Value beginVal = mlir::makeComposedAffineApply(builder, loc, beginValueMap.getAffineMap(), beginValueMap.getOperands()); - mlir::Value endVal = mlir::makeComposedAffineApply(builder, loc, endValueMap.getAffineMap(), endValueMap.getOperands()); + auto beginApplyOp = mlir::makeComposedAffineApply(builder, loc, beginValueMap.getAffineMap(), beginValueMap.getOperands()); + auto endApplyOp = mlir::makeComposedAffineApply(builder, loc, endValueMap.getAffineMap(), endValueMap.getOperands()); + + // If either the begin or end values are empty, then we've recursed into an empty part of the space and we should bail out without creating a loop + auto beginMap = beginApplyOp.getAffineMap(); + auto endMap = endApplyOp.getAffineMap(); + + if (beginMap.isEmpty() || endMap.isEmpty()) + { + return std::nullopt; + } + + mlir::Value beginVal = beginApplyOp.getResult(); + mlir::Value endVal = endApplyOp.getResult(); auto partitionInfo = MakeSplitPartition(builder, beginVal, endVal, stepSize); @@ -300,14 +312,19 @@ namespace loopnest auto levelScopedConstraints = Clone(); auto loopId = levelScopedConstraints.GetId(index); - auto partitionInfo = AddSplitPartitionHelper(levelScopedConstraints, - index, - builder, - loc, - splitSize); - + auto partitionInfoOpt = AddSplitPartitionHelper(levelScopedConstraints, + index, + builder, + loc, + splitSize); std::vector partitionedLoopConstraints; + if (!partitionInfoOpt.has_value()) + { + return partitionedLoopConstraints; + } + auto partitionInfo = *partitionInfoOpt; + // Main loop partition { // Fork the constraints for inside the main loop @@ -338,8 +355,7 @@ namespace loopnest // Set loop id equal to partition value inside the cleanup loop cleanupScopedConstraints.SetEqual(loopId, partitionInfo.partitionValueId); - // Bound loopId >= partition value. This is a looser constraint than we put on the mainScopedConstraints, but it is helpful - // for getting a simpler loop bound + // Bound loopId >= partition value. cleanupResolveConstraints.AddLowerBound(loopId, partitionInfo.partitionValueId); LoopPartitionConstraints cleanupPartitionConstraints(cleanupResolveConstraints, cleanupScopedConstraints); diff --git a/accera/ir/src/nest/LoopNestBuilder.cpp b/accera/ir/src/nest/LoopNestBuilder.cpp index e3ca7a96..3681d259 100644 --- a/accera/ir/src/nest/LoopNestBuilder.cpp +++ b/accera/ir/src/nest/LoopNestBuilder.cpp @@ -627,7 +627,7 @@ namespace loopnest // --> (0..1: S1), (0..N-1: S2), (N1-..N: S2, S3) // prefix of last partition matches entirety of second: move // --> (0..1: S1), (0..N: S2), (N1-..N: S3) - if (schedule.IsDone()) + if (schedule.IsDone() || loops.empty()) { return; } diff --git a/accera/python/accera/Debug.py b/accera/python/accera/Debug.py index f37ea762..0a5758a4 100644 --- a/accera/python/accera/Debug.py +++ b/accera/python/accera/Debug.py @@ -37,8 +37,6 @@ def add_check_allclose(package: Package, array: Array, atol: float = 1e-5, targe resolved_shape = [0 if isinstance(s, Dimension) else s for s in shape] shape_str = '_'.join(map(str, resolved_shape)) - shape = [Dimension(role=Dimension.Role.OUTPUT, value=x) if isinstance(x, Dimension) else x for x in shape] - # placeholders actual = Array(role=Array.Role.INPUT, element_type=element_type, shape=shape, layout=layout) desired = Array(role=Array.Role.INPUT, element_type=element_type, shape=shape, layout=layout) diff --git a/accera/python/accera/Package.py b/accera/python/accera/Package.py index 744951f4..a2c0ff4b 100644 --- a/accera/python/accera/Package.py +++ b/accera/python/accera/Package.py @@ -30,6 +30,11 @@ @singledispatch def _convert_arg(arg: _lang_python._lang._Valor): + if isinstance(arg, lang.Dimension): + arg._native_dim = _lang_python._lang.Scalar(arg) + return arg._native_dim + if isinstance(arg, _lang_python._lang.Scalar): + return _lang_python._lang.Scalar(arg) if arg.layout == _lang_python._MemoryLayout(): return _lang_python._lang.Scalar(arg) else: @@ -224,7 +229,7 @@ def add( base_name: str = "", parameters: Union[dict, List[dict]] = {}, function_opts: dict = {}, - auxiliary: dict = {}, + auxiliary: dict = {} ) -> Union["accera.Function", List["accera.Function"]]: """Adds a function to the package. If multiple parameters are provided, generates and adds them according to the parameter grid. @@ -242,6 +247,16 @@ def add( auxiliary: A dictionary of auxiliary metadata to include in the HAT package. """ + # TEMP arrays in the args list are a programming error because they are meant to be internally defined in a function + # Note: this does not prevent TEMP arrays from being passed as an argument to a function, but they cannot be the + # api-defining arguments for the function + temp_array_pos = [] + for idx, arg in enumerate(args): + if isinstance(arg, lang.Array) and arg.role == lang.Array.Role.TEMP: + temp_array_pos.append(idx) + if len(temp_array_pos) > 0: + raise ValueError(f"Error in package.add() for function {base_name}: args includes TEMP array at positions {temp_array_pos}") + heuristic_parameters_dict = {} if isinstance(source, lang.Plan): heuristic_parameters_dict = self._create_mapping_of_heuristic_parameters_with_possible_values(source) @@ -274,7 +289,7 @@ def _add_function( base_name: str = "", parameters: dict = {}, function_opts: dict = {}, - auxiliary: dict = {}, + auxiliary: dict = {} ) -> "accera.Function": """Adds a function to the package. @@ -385,7 +400,7 @@ def compute_arg_size_references(args, SENTINEL_VALUE=-1): if isinstance(source, lang.Plan): self._dynamic_dependencies.update(source._dynamic_dependencies) source = source._create_function( - args, public=True, no_inline=function_opts.get("no_inline", False) + args, **function_opts ) # fall-through @@ -395,9 +410,8 @@ def compute_arg_size_references(args, SENTINEL_VALUE=-1): # due to the fall-through, we only need to validate here validate_target(source.target) - native_array_dim_args = [arg._get_native_array() if isinstance(arg, lang.Array) else arg._native_dim for arg in args ] + native_array_dim_args = [arg._get_native_array() if isinstance(arg, lang.Array) else arg._native_dim if isinstance(arg, lang.Dimension) else arg for arg in args ] - assert source.public source.name = get_function_name(source.target) source.base_name = base_name source.auxiliary = auxiliary_metadata @@ -422,15 +436,13 @@ def wrapper_fn(args): wrapped_func = lang.Function( name=name, base_name=base_name, - public=True, - decorated=function_opts.get("decorated", False), - no_inline=function_opts.get("no_inline", False), args=tuple(map(_convert_arg, args)), arg_size_references=compute_arg_size_references(args), requested_args=args, definition=wrapper_fn, auxiliary=auxiliary_metadata, target=Target.HOST, + **function_opts ) self._fns[name] = wrapped_func @@ -599,11 +611,9 @@ def build( if target.runtime in [Target.Runtime.CUDA, Target.Runtime.ROCM]: format |= Package.Format.HAT_SOURCE else: - format |= ( - Package.Format.HAT_STATIC - if cross_compile - else Package.Format.HAT_DYNAMIC - ) + format |= Package.Format.HAT_STATIC + if not cross_compile: + format |= Package.Format.HAT_DYNAMIC dynamic_link = bool(format & Package.Format.DYNAMIC_LIBRARY) if cross_compile and dynamic_link: @@ -805,14 +815,26 @@ def build( hat_file.Serialize(header_path) - if dynamic_link and (format & Package.Format.DYNAMIC_LIBRARY): - dyn_hat_path = f"{path_root}_dyn{extension}" - hat.create_dynamic_package(header_path, dyn_hat_path) - shutil.move(dyn_hat_path, header_path) - elif not cross_compile and (format & Package.Format.STATIC_LIBRARY): + if not cross_compile and (format & Package.Format.STATIC_LIBRARY): lib_hat_path = f"{path_root}_lib{extension}" hat.create_static_package(header_path, lib_hat_path) + + lib_hat_file = hat_file.Deserialize(lib_hat_path) + lib_hat_file.dependencies.auxiliary["static"] = lib_hat_file.dependencies.link_target + lib_hat_file.Serialize() + shutil.move(lib_hat_path, header_path) + + if dynamic_link: + dyn_hat_path = f"{path_root}_dyn{extension}" + hat.create_dynamic_package(header_path, dyn_hat_path) + + dyn_hat_file = hat_file.Deserialize(dyn_hat_path) + dyn_hat_file.dependencies.auxiliary["dynamic"] = dyn_hat_file.dependencies.link_target + dyn_hat_file.Serialize() + + shutil.move(dyn_hat_path, header_path) + # TODO: plumb cross-compilation of static libs return proj.module_file_sets diff --git a/accera/python/accera/Targets.py b/accera/python/accera/Targets.py index 746e6d96..63e4b72e 100644 --- a/accera/python/accera/Targets.py +++ b/accera/python/accera/Targets.py @@ -459,6 +459,7 @@ class Architecture(Enum): ["Intel E5-1650 v3", "Haswell", "Xeon E5", 3.5, 3.8, 6, 12, [48, 256, 15 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"], ["Intel E5-1660 v3", "Haswell", "Xeon E5", 3.0, 3.5, 8, 16, [48, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"], ["Intel E5-1680 v3", "Haswell", "Xeon E5", 3.2, 3.8, 8, 16, [48, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"], + ["Intel E5-2620 v3", "Haswell", "Xeon E5", 2.4, 3.2, 6, 12, [48, 256, 15 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"], # AMD Zen # ref: https://en.wikipedia.org/wiki/Zen_(first_generation) diff --git a/accera/python/accera/__init__.py b/accera/python/accera/__init__.py index 38482454..2d1f167c 100644 --- a/accera/python/accera/__init__.py +++ b/accera/python/accera/__init__.py @@ -15,10 +15,10 @@ from .Package import Package from .lang import * -from ._lang_python import CompilerOptions, ScalarType, _GetTargetDeviceFromName +from ._lang_python import CompilerOptions, ScalarType, _GetTargetDeviceFromName, AllocateFlags from ._lang_python import ( abs, max, min, ceil, floor, sqrt, exp, log, log10, log2, sin, cos, tan, sinh, cosh, tanh, logical_and, logical_or, - logical_not, cast + logical_not, cast, round, remainderf ) # Global initialization diff --git a/accera/python/accera/lang/Array.py b/accera/python/accera/lang/Array.py index 3190e05c..b33bfc1c 100644 --- a/accera/python/accera/lang/Array.py +++ b/accera/python/accera/lang/Array.py @@ -8,7 +8,7 @@ from enum import Enum, auto from functools import partial -from .._lang_python import ScalarType, _MemoryLayout +from .._lang_python import ScalarType, _MemoryLayout, AllocateFlags from .._lang_python._lang import Array as NativeArray from .Layout import Layout, MemoryMapLayout from ..Parameter import DelayedParameter @@ -36,7 +36,8 @@ def __init__( element_type: Union["accera.ScalarType", type] = None, layout: Union["accera.Array.Layout", Tuple[int]] = Layout.FIRST_MAJOR, offset: int = 0, - shape: Tuple[Union[int, DelayedParameter, Dimension]] = None + shape: Tuple[Union[int, DelayedParameter, Dimension]] = None, + flags: "accera.AllocateFlags" = AllocateFlags.NONE ): """Creates an Array @@ -74,6 +75,7 @@ def __init__( self._shape = shape self._native_array = None self._delayed_calls = {} + self._flags = flags if self._role == Array.Role.CONST: if self._data is None: @@ -156,6 +158,10 @@ def role(self): def element_type(self): return self._element_type + @property + def flags(self): + return self._flags + @property def _value(self): if self._native_array: @@ -267,7 +273,7 @@ def _allocate(self): return # already contains data # Note: we are blowing away the original Value and replacing with a new allocated Value - self._native_array = NativeArray(Allocate(type=self._element_type, layout=self._layout)) + self._native_array = NativeArray(Allocate(type=self._element_type, layout=self._layout, flags=self._flags)) assert (not self._value.is_empty) diff --git a/accera/python/accera/lang/Dimension.py b/accera/python/accera/lang/Dimension.py index d4232098..ffaa90b4 100644 --- a/accera/python/accera/lang/Dimension.py +++ b/accera/python/accera/lang/Dimension.py @@ -17,6 +17,7 @@ class Dimension: class Role(Enum): "Defines the Dimension role" INPUT = (auto()) #: An input dimension (immutable and provided as an Accera function argument). + INPUT_OUTPUT = auto() #: An input/output dimension (mutable and updated by an Accera function). OUTPUT = auto() #: An output dimension (mutable and updated by an Accera function). def __init__( @@ -30,6 +31,7 @@ def __init__( self._role = role if value: + self._value = value if self._role != Dimension.Role.OUTPUT: raise ValueError("Only output dimension can accept the optional value to initialize itself") self._native_dim = value._native_dim if isinstance(value, Dimension) else Scalar(value) @@ -40,6 +42,21 @@ def __init__( def role(self): return self._role + @property + def value(self): + return self._value + + @value.setter + def value(self, val): + self._value = val + if self._role != Dimension.Role.OUTPUT: + raise ValueError("Only output dimension can accept the optional value to initialize itself") + self._native_dim = val._native_dim if isinstance(val, Dimension) else Scalar(val) + + @role.setter + def role(self, val): + self._role = val + def __eq__(self, other): return id(self) == id(other) diff --git a/accera/python/accera/lang/Function.py b/accera/python/accera/lang/Function.py index 51f7fb6a..9a86e300 100644 --- a/accera/python/accera/lang/Function.py +++ b/accera/python/accera/lang/Function.py @@ -44,21 +44,25 @@ def _(arg: Array): ) return arg._get_native_array() # unpack - -def role_to_usage(role): +def role_to_usage(arg): from .._lang_python import _FunctionParameterUsage - if role == Array.Role.INPUT or role == Dimension.Role.INPUT: - return _FunctionParameterUsage.INPUT + if isinstance(arg, Array) or isinstance(arg, Dimension): + role = arg.role + if role == Array.Role.INPUT or role == Dimension.Role.INPUT: + return _FunctionParameterUsage.INPUT + elif role == Dimension.Role.OUTPUT: + return _FunctionParameterUsage.OUTPUT + else: + return _FunctionParameterUsage.INPUT_OUTPUT else: - return _FunctionParameterUsage.INPUT_OUTPUT - + return _FunctionParameterUsage.INPUT @dataclass class Function: name: str = "" # base_name + _ + generated unique_id base_name: str = "" - public: bool = False + public: bool = True external: bool = False decorated: bool = True # do we want to expose this? requested_args: tuple = () # args as provided into Package.add @@ -66,7 +70,8 @@ class Function: arg_size_references: tuple = () # references from array args to dimension arg positions for dynamically sized arrays param_overrides: dict = field(default_factory=dict) # overrides for constants definition: Callable = None - no_inline: bool = False + no_inline: bool = False # no_inline == True means that this function cannot be inlined into other functions + no_inline_into: bool = False # no_inline_into == True means that this function cannot have other functions inlined into it auxiliary: dict = field(default_factory=dict) target: Target = Target.HOST output_verifiers: list = field(default_factory=list) @@ -87,13 +92,14 @@ def _emit(self): delayed_param.set_value(value) if self.args: - usages = [role_to_usage(arg.role) for arg in self.requested_args] + usages = [role_to_usage(arg) for arg in self.requested_args] self._native_fn.parameters(self.args, usages, self.arg_size_references) if self.output_verifiers: self._native_fn.outputVerifiers(self.output_verifiers) self._native_fn.inlinable(not self.no_inline) + self._native_fn.inlinable_into(not self.no_inline_into) sig = signature(self.definition) diff --git a/accera/python/accera/lang/Nest.py b/accera/python/accera/lang/Nest.py index 1c07dfa6..49988a45 100644 --- a/accera/python/accera/lang/Nest.py +++ b/accera/python/accera/lang/Nest.py @@ -152,7 +152,7 @@ def _get_captures_to_replace(self, logic_fn, context: NativeLoopNestContext): if v.role == Array.Role.TEMP: temp_array = NativeArray( - Allocate(type=v.element_type, layout=v.layout) + Allocate(type=v.element_type, layout=v.layout, flags=v.flags) ) captures_to_replace[k] = context.mapping[value_id] = temp_array elif v.role == Array.Role.CONST: @@ -208,6 +208,8 @@ def _build_native_context(self, context: NativeLoopNestContext): elif isinstance(x, Dimension): x._native_dim = Scalar(y) logic_args[id(x)] = x._native_dim + elif isinstance(x, Scalar): + logic_args[id(x)] = Scalar(y) else: logic_args[id(x)] = y diff --git a/accera/python/accera/lang/Plan.py b/accera/python/accera/lang/Plan.py index 40db40e2..afb81716 100644 --- a/accera/python/accera/lang/Plan.py +++ b/accera/python/accera/lang/Plan.py @@ -958,6 +958,18 @@ def _is_valid_block_size(self, block_dims) -> bool: block_size = block_dims[0] * block_dims[1] * block_dims[2] return block_size <= max_threads + def _erase_loops(self, indices: List[LoopIndex]): + for index in indices: + self._add_index_attr(index, "_erase") + + self._commands.append( + partial(self._erase_loops_delayed, indices) + ) + + def _erase_loops_delayed(self, indices: List[LoopIndex], context: NativeLoopNestContext): + for index in indices: + context.plan._erase_loop(context.mapping[id(index)]) + def _build_native_context(self, context: NativeLoopNestContext): target = self._target @@ -1067,7 +1079,7 @@ def nest_wrapper_fn(*args: List[List[_Valor]]): def _create_function( - plan: "Plan", args: List[Union[Array, Dimension]], public: bool = True, no_inline: bool = False + plan: "Plan", args: List[Union[Array, Dimension]], public: bool = True, **kwargs ) -> Function: from secrets import token_hex @@ -1078,8 +1090,8 @@ def _create_function( args=args, public=public, definition=_build_native_nest(plan, args), - no_inline=no_inline, target=plan._target, + **kwargs ) diff --git a/accera/python/accera/lang/__init__.py b/accera/python/accera/lang/__init__.py index e9fb73c3..3924507e 100644 --- a/accera/python/accera/lang/__init__.py +++ b/accera/python/accera/lang/__init__.py @@ -15,4 +15,4 @@ from .Function import Function from .LogicFunction import logic_function, LogicFunction from .LoopIndex import LoopIndex -from .Dimension import Dimension +from .Dimension import Dimension, create_dimensions \ No newline at end of file diff --git a/accera/python/accera/test/dsl_tests.py b/accera/python/accera/test/dsl_tests.py index aedc70d3..944713af 100644 --- a/accera/python/accera/test/dsl_tests.py +++ b/accera/python/accera/test/dsl_tests.py @@ -26,10 +26,12 @@ DEV_MODE = True sys.path.insert(1, os.getcwd()) -from accera import ScalarType, Array, Function, Nest, Target, Package, algorithms +from accera import ScalarType, Array, Function, Nest, Target, Package, algorithms, Dimension, cast, AllocateFlags from accera.test import verifiers from accera.test.test_utils import expectedFailure, FailedReason +INTERNAL_FUNCTION_OPTS = { "no_inline_into": True, "public": False } + TEST_MODE = Package.Mode.DEBUG if DEV_MODE else Package.Mode.RELEASE TEST_FORMAT = Package.Format.MLIR_DYNAMIC if DEV_MODE else Package.Format.HAT_DYNAMIC TEST_PACKAGE_DIR = "test_acccgen" @@ -451,6 +453,149 @@ def _(): correctness_check_values=correctness_check_values, ) + def test_array_vectorize_cast(self) -> None: + A = Array( + shape=(256, 32), + role=Array.Role.INPUT, + layout=Array.Layout.FIRST_MAJOR, + element_type=ScalarType.uint8, + ) + B = Array( + shape=(256, 32), + role=Array.Role.INPUT_OUTPUT, + layout=Array.Layout.FIRST_MAJOR, + element_type=ScalarType.int16, + ) + + nest = Nest(shape=(256, 32)) + i, j = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i, j] = A[i, j] + + sched = nest.create_schedule() + ii = sched.split(i, 4) + jj = sched.split(j, 16) + sched.reorder(i, j, ii, jj) + plan = sched.create_plan() + plan.vectorize(ii) # ii to in-place-unroll ii and vectorize jj + + A_test = np.random.random((256, 32)).astype(np.uint8) + B_test = np.random.random((256, 32)).astype(np.int16) + B_expected = np.ndarray((256, 32)).astype(np.int16) + B_expected[:,:] = A_test[:,:] + + correctness_check_values = { + "pre": (A_test, B_test), + "post": (A_test, B_expected), + } + self._verify_nest( + plan, + (A, B), + "test_array_vectorize_cast", + correctness_check_values=correctness_check_values + ) + + def test_interleaved_vectorize_cast(self) -> None: + shape = (64, 32, 8, 2) + A = Array( + shape=shape, + role=Array.Role.INPUT, + layout=Array.Layout.FIRST_MAJOR, + element_type=ScalarType.uint8, + ) + B = Array( + shape=shape, + role=Array.Role.INPUT_OUTPUT, + layout=Array.Layout.FIRST_MAJOR, + element_type=ScalarType.int16, + ) + + nest = Nest(shape=shape) + i, j, k, l = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i, j, k, l] = A[i, j, k, l] + + sched = nest.create_schedule() + plan = sched.create_plan() + plan.vectorize(k) + + A_test = np.random.random(shape).astype(np.uint8) + B_test = np.random.random(shape).astype(np.int16) + B_expected = np.ndarray(shape).astype(np.int16) + B_expected[:,:,:,:] = A_test[:,:,:,:] + + correctness_check_values = { + "pre": (A_test, B_test), + "post": (A_test, B_expected), + } + self._verify_nest( + plan, + (A, B), + "test_interleaved_vectorize_cast", + correctness_check_values=correctness_check_values + ) + + + def test_interleaved_vectorize_store(self) -> None: + M = 32 + N = 48 + M_tile = 2 + N_tile = 16 + input_shape = (M, N) + output_shape = (M // M_tile, N // N_tile, N_tile, M_tile) + A = Array( + shape=input_shape, + role=Array.Role.INPUT, + layout=Array.Layout.FIRST_MAJOR, + element_type=ScalarType.uint8, + ) + B = Array( + shape=output_shape, + role=Array.Role.INPUT_OUTPUT, + layout=Array.Layout.FIRST_MAJOR, + element_type=ScalarType.uint8, + ) + + nest = Nest(shape=output_shape) + i_outer, j_outer, j_inner, i_inner = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i_outer, j_outer, j_inner, i_inner] = A[i_outer*M_tile + i_inner, j_outer*N_tile + j_inner] + + sched = nest.create_schedule() + plan = sched.create_plan() + plan.vectorize(j_inner) + + A_test = np.random.random(input_shape).astype(np.uint8) + B_test = np.random.random(output_shape).astype(np.uint8) + B_expected = np.ndarray(output_shape).astype(np.uint8) + for i_outer in range(0, M, M_tile): + i_outer_idx = i_outer // M_tile + for j_outer in range(0, N, N_tile): + j_outer_idx = j_outer // N_tile + for j_inner in range(0, N_tile): + full_j = j_outer + j_inner + for i_inner in range(0, M_tile): + full_i = i_outer + i_inner + B_expected[i_outer_idx, j_outer_idx, j_inner, i_inner] = A_test[full_i, full_j] + + correctness_check_values = { + "pre": (A_test, B_test), + "post": (A_test, B_expected), + } + self._verify_nest( + plan, + (A, B), + "test_interleaved_vectorize_store", + correctness_check_values=correctness_check_values + ) + + def test_subarray(self) -> None: package = Package() @@ -1087,7 +1232,102 @@ def _(): self._verify_helper(package, test_name, function.name, correctness_check_values) + + def test_output_array_range_node1(self) -> None: + from accera import Dimension, create_dimensions, floor, cast + from accera._lang_python._lang import Scalar + + Start = Scalar(ScalarType.float32) + Limit = Scalar(ScalarType.float32) + Delta = Scalar(ScalarType.float32) + + InputDim = create_dimensions() + InputDim.role = Dimension.Role.INPUT + OutputDims = Array(shape=(1,), element_type=ScalarType.int64, role=Array.Role.INPUT_OUTPUT) + Output = Array(shape=(InputDim, ), role=Array.Role.INPUT_OUTPUT) + Output_Start = Array(shape=(1,), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT) + + nest1 = Nest((1, )) + @nest1.iteration_logic + def _(): + OutputDims[0] = cast(floor((Limit - Start) / Delta), ScalarType.int64) + + nest2 = Nest([InputDim]) + i = nest2.get_indices() + @nest2.iteration_logic + def _(): + Output[i] = Output_Start[0] + Output_Start[0] += Delta + + # Generate a function like: + # range_get_size(float start, float limit, float delta, int64_t* output_dim); + # range_get_result(int64_t input_dim, float* output, float* start, float delta); + + package = Package() + # BUGBUG: dim args ordered first due to issue with Debug mode + package.add(nest1, args=(Start, Limit, Delta, OutputDims), base_name=f"range_get_size") + package.add(nest2, args=(InputDim, Output, Output_Start, Delta), base_name=f"range_get_result") + + package.build("test_output_array_range_node1", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR) + + + def test_output_array_range_node2(self) -> None: + from accera import Dimension, create_dimensions, floor, cast + from accera._lang_python._lang import Scalar + + Start = Scalar(ScalarType.float32) + Limit = Scalar(ScalarType.float32) + Delta = Scalar(ScalarType.float32) + + InputDim = create_dimensions() + InputDim.role = Dimension.Role.INPUT + + OutputDims = Array(shape=(1,), element_type=ScalarType.int64, role=Array.Role.INPUT_OUTPUT) + Output = Array(shape=(InputDim, ), role=Array.Role.INPUT_OUTPUT) + Output_Start = Array(shape=(1,), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT) + Output_Start_Tmp = Array(shape=(1,), element_type=ScalarType.float32, role=Array.Role.TEMP) + + nest1 = Nest((1, )) + @nest1.iteration_logic + def _(): + OutputDims[0] = cast(floor((Limit - Start) / Delta), ScalarType.int64) + + nest2 = Nest((1, )) + @nest2.iteration_logic + def _(): + Output_Start[0] = Start + + nest3 = Nest([InputDim]) + i = nest3.get_indices() + @nest3.iteration_logic + def _(): + Output[i] = Output_Start[0] + Output_Start[0] += Delta + + # Generate a function like: + # range_get_size(float start, float limit, float delta, int64_t* output_dim); + # ini_start(float* output_Start, float start); + # get_result(int64_t input_dim, float* output, float* start, float delta); + # range_get_output_array(int64_t input_dim, float* output, float start, float delta); + + package = Package() + # BUGBUG: dim args ordered first due to issue with Debug mode + package.add(nest1, args=(Start, Limit, Delta, OutputDims), base_name=f"range_get_size") + ini_start_fn = package.add(nest2, args=(Output_Start, Start), base_name=f"ini_start") + get_result_fn = package.add(nest3, args=(InputDim, Output, Output_Start, Delta), base_name=f"get_result") + + nest4 = Nest((1, )) + @nest4.iteration_logic + def _(): + ini_start_fn(Output_Start_Tmp, Start) + get_result_fn(InputDim, Output, Output_Start_Tmp, Delta) + + # BUGBUG: dim args ordered first due to issue with Debug mode + package.add(nest4, args=(InputDim, Output, Start, Delta), base_name=f"range_get_output_array") + + package.build("test_output_array_range_node2", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR) + class DSLTest_02SimpleAffineLoopNests(unittest.TestCase): def _create_nest(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple: @@ -1100,19 +1340,21 @@ def _create_nest(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple: return Nest(shape=(M, N, S)), A, B, C - def _build_nest(self, nest, args: Tuple[Array], package_name, correctness_check_values=None) -> None: + def _build_nest(self, nest, args: Tuple[Array], package_name, correctness_check_values=None, quiet=True) -> None: # helper function to build a nest so that we can focus on the logic function # create a HAT package and add the nest to it package = Package() function = package.add(nest, args, base_name=package_name) # build the HAT package - with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR) as v: + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name + with verifiers.VerifyPackage(self, package_name, output_dir) as v: package.build( package_name, format=TEST_FORMAT, mode=TEST_MODE, - output_dir=TEST_PACKAGE_DIR, + output_dir=output_dir, + _quiet=quiet ) if correctness_check_values: v.check_correctness( @@ -1317,6 +1559,324 @@ def _(): self._build_nest(nest, [A, B, C], f"test_intrinsics_{t.name}") + + def test_round_intrinsic(self) -> None: + from accera import round as accround + + M = 16 + N = 8 + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, N)) + B = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N)) + + nest = Nest((M, N)) + i, j = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i, j] = accround(A[i, j]) + + A_test = np.random.uniform(low=-1000.0, high=1000.0, size=A.shape).astype(np.float32) + # Ensure there's at least one element which tests the roundeven behavior in both directions + A_test[0, 0] = 1.5 # Should round up to 2 + A_test[0, 1] = 2.5 # Should round down to 2 + B_test = np.zeros(B.shape).astype(np.int32) + + B_ref = A_test.round().astype(np.int32) + self.assertEqual(B_ref[0, 0], 2) + self.assertEqual(B_ref[0, 1], 2) + + correctness_check_values = { + "pre": [A_test, B_test], + "post": [A_test, B_ref] + } + + self._build_nest(nest, [A, B], "test_round_intrinsic", correctness_check_values=correctness_check_values) + + + def test_round_intrinsic_vectorized(self) -> None: + from accera import round as accround + + M = 256 + N = 128 + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, N)) + B = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N)) + + nest = Nest((M, N)) + i, j = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i, j] = accround(A[i, j]) + + sched = nest.create_schedule() + ii, jj = sched.tile({i: 4, j: 8}) + sched.reorder(i, j, ii, jj) + plan = sched.create_plan() + plan.vectorize(ii) + + A_test = np.random.uniform(low=-1000.0, high=1000.0, size=A.shape).astype(np.float32) + # Ensure there's at least one element which tests the roundeven behavior in both directions + A_test[0, 0] = 1.5 # Should round up to 2 + A_test[0, 1] = 2.5 # Should round down to 2 + B_test = np.zeros(B.shape).astype(np.int32) + + B_ref = A_test.round().astype(np.int32) + self.assertEqual(B_ref[0, 0], 2) + self.assertEqual(B_ref[0, 1], 2) + + correctness_check_values = { + "pre": [A_test, B_test], + "post": [A_test, B_ref] + } + + self._build_nest(plan, [A, B], "test_round_intrinsic_vectorized", correctness_check_values=correctness_check_values) + + + # TODO : fix this test - it appears to abort on just the linux buddy build machine + # def test_remainderf_intrinsic_rounding(self) -> None: + # from accera import remainderf, cast + + # M = 16 + # N = 8 + + # A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, N)) + # B = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N)) + + # nest = Nest((M, N)) + # i, j = nest.get_indices() + + # @nest.iteration_logic + # def _(): + # B[i, j] = cast(A[i, j] - remainderf(A[i, j], 1.0), ScalarType.int32) + + # A_test = np.random.uniform(low=-1000.0, high=1000.0, size=A.shape).astype(np.float32) + # # Ensure there's at least one element which tests the roundeven behavior in both directions + # A_test[0, 0] = 1.5 # Should round up to 2 + # A_test[0, 1] = 2.5 # Should round down to 2 + # B_test = np.zeros(B.shape).astype(np.int32) + + # B_ref = A_test.round().astype(np.int32) + # self.assertEqual(B_ref[0, 0], 2) + # self.assertEqual(B_ref[0, 1], 2) + + # correctness_check_values = { + # "pre": [A_test, B_test], + # "post": [A_test, B_ref] + # } + + # self._build_nest(nest, [A, B], "test_remainderf_intrinsic_rounding", correctness_check_values=correctness_check_values) + + + def test_vectorized_max_min(self) -> None: + from accera import max, min + + M = 128 + N = 256 + + package = Package() + func_names = [] + package_name = "test_vectorized_max_min" + correctness_check_values = {} + for t in [ScalarType.float32]: + fn_name = f"test_vectorized_max_min_{t.name}" + func_names.append(fn_name) + + nest = Nest((M, N)) + A = Array(role=Array.Role.INPUT, element_type=t, shape=(M, N)) + B = Array(role=Array.Role.INPUT, element_type=t, shape=(M, N)) + C_max = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(M, N)) + C_min = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(M, N)) + + i, j = nest.get_indices() + + @nest.iteration_logic + def _(): + C_max[i, j] = max(A[i, j], B[i, j]) + C_min[i, j] = min(A[i, j], B[i, j]) + + sched = nest.create_schedule() + ii, jj = sched.tile({i: 4, j: 8}) + sched.reorder(i, j, ii, jj) + plan = sched.create_plan() + plan.vectorize(ii) + function = package.add(plan, args=(A, B, C_max, C_min), base_name=fn_name) + + A_test = np.random.random(A.shape).astype(np.float32) + B_test = np.random.random(B.shape).astype(np.float32) + C_max_test = np.random.random(C_max.shape).astype(np.float32) + C_min_test = np.random.random(C_min.shape).astype(np.float32) + + C_max_ref = np.maximum(A_test, B_test) + C_min_ref = np.minimum(A_test, B_test) + + correctness_check_values[fn_name] = { + "pre": [A_test, B_test, C_max_test, C_min_test], + "post": [A_test, B_test, C_max_ref, C_min_ref] + } + + # build the HAT package + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name + with verifiers.VerifyPackage(self, package_name, output_dir) as v: + package.build( + package_name, + format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, + mode=Package.Mode.RELEASE, + output_dir=output_dir + ) + for fn_name in func_names: + if fn_name in correctness_check_values: + v.check_correctness( + function.name, + before=correctness_check_values[fn_name]["pre"], + after=correctness_check_values[fn_name]["post"], + ) + + + def test_vectorized_single_max_min_block(self) -> None: + # In this test we're trying to find the single max and single min value of a 2-D array. + # To vectorize this, we'll want to compute several maxs and mins in paralle and then reduce them + # Note: This type of reduction can't be achieved with caching, so we manually construct a pattern similar to caching + from accera import max, min + + M = 128 + N = 256 + + M_outer_tile = 8 + M_tile = 4 + N_tile = 8 + + package = Package() + func_names = [] + package_name = "test_vectorized_single_max_min_block" + correctness_check_values = {} + for t in [ScalarType.float32]: + fn_name = f"{package_name}_{t.name}" + func_names.append(fn_name) + + A = Array(role=Array.Role.INPUT, element_type=t, shape=(M, N)) + A_max = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(1, )) + A_min = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(1, )) + + A_max_cache = Array(role=Array.Role.TEMP, element_type=t, shape=(M_tile, N_tile), flags=AllocateFlags.STACK) + A_min_cache = Array(role=Array.Role.TEMP, element_type=t, shape=(M_tile, N_tile), flags=AllocateFlags.STACK) + + io_A_max_cache = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=A_max_cache.shape) + io_A_min_cache = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=A_min_cache.shape) + + outer_i_dim = Dimension() + outer_j_dim = Dimension() + + # inner compute nest + + inner_nest = Nest((M_tile, N_tile)) + inner_i, inner_j = inner_nest.get_indices() + @inner_nest.iteration_logic + def _(): + i = outer_i_dim + inner_i + j = outer_j_dim + inner_j + io_A_max_cache[inner_i, inner_j] = max(io_A_max_cache[inner_i, inner_j], A[i, j]) + io_A_min_cache[inner_i, inner_j] = min(io_A_min_cache[inner_i, inner_j], A[i, j]) + + inner_sched = inner_nest.create_schedule() + inner_plan = inner_sched.create_plan() + inner_plan.vectorize(inner_i) + inner_fn = package.add(inner_plan, args=(A, io_A_max_cache, io_A_min_cache, outer_i_dim, outer_j_dim), base_name=f"{fn_name}_inner", function_opts=INTERNAL_FUNCTION_OPTS) + + # Outer nest + outer_nest = Nest((M, N)) + outer_i, outer_j = outer_nest.get_indices() + @outer_nest.iteration_logic + def _(): + inner_fn(A, io_A_max_cache, io_A_min_cache, outer_i, outer_j) + + outer_sched = outer_nest.create_schedule() + outer_ii = outer_sched.split(outer_i, M_outer_tile) + outer_iii, outer_jj = outer_sched.tile({outer_ii: M_tile, outer_j: N_tile}) + outer_sched.reorder(outer_i, outer_j, outer_ii, outer_iii, outer_jj) + outer_plan = outer_sched.create_plan() + outer_plan._erase_loops([outer_iii, outer_jj]) + outer_fn = package.add(outer_plan, args=(A, io_A_max_cache, io_A_min_cache), base_name=f"{fn_name}_outer", function_opts=INTERNAL_FUNCTION_OPTS) + + + # Cache zeroing nests + + def _make_init_fn(package: Package, outer_arr: Array, arr: Array, base_name: str): + zero_nest = Nest(arr.shape) + indices = zero_nest.get_indices() + @zero_nest.iteration_logic + def _(): + arr[indices] = outer_arr[indices] + + return package.add(zero_nest, args=(outer_arr, arr), base_name=base_name, function_opts=INTERNAL_FUNCTION_OPTS) + + zero_max_cache_fn = _make_init_fn(package, A, io_A_max_cache, "max_cache_zeroing") + zero_min_cache_fn = _make_init_fn(package, A, io_A_min_cache, "min_cache_zeroing") + + # Cache reducing nests + + def _make_cache_reduce_fn(package: Package, cache: Array, outer_arr: Array, base_name: str, use_max): + reduce_nest = Nest(cache.shape) + indices = reduce_nest.get_indices() + if use_max: + @reduce_nest.iteration_logic + def _(): + outer_arr[0] = max(outer_arr[0], cache[indices]) + else: + @reduce_nest.iteration_logic + def _(): + outer_arr[0] = min(outer_arr[0], cache[indices]) + + return package.add(reduce_nest, args=(cache, outer_arr), base_name=base_name, function_opts=INTERNAL_FUNCTION_OPTS) + + reduce_max_cache_fn = _make_cache_reduce_fn(package, io_A_max_cache, A_max, "max_cache_reduce", True) + reduce_min_cache_fn = _make_cache_reduce_fn(package, io_A_min_cache, A_min, "min_cache_reduce", False) + + # outer nest + + top_nest = Nest((1,)) + + @top_nest.iteration_logic + def _(): + zero_max_cache_fn(A, A_max_cache) + zero_min_cache_fn(A, A_min_cache) + outer_fn(A, A_max_cache, A_min_cache) + reduce_max_cache_fn(A_max_cache, A_max) + reduce_min_cache_fn(A_min_cache, A_min) + + function = package.add(top_nest, args=(A, A_max, A_min), base_name=fn_name) + + A_test = np.random.random(A.shape).astype(np.float32) + A_max_test = np.random.random(A_max.shape).astype(np.float32) + A_min_test = np.random.random(A_min.shape).astype(np.float32) + + A_max_ref = np.max(A_test).reshape((1,)) + A_min_ref = np.min(A_test).reshape((1,)) + + correctness_check_values[fn_name] = { + "pre": [A_test, A_max_test, A_min_test], + "post": [A_test, A_max_ref, A_min_ref] + } + + # build the HAT package + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name + with verifiers.VerifyPackage(self, package_name, output_dir) as v: + package.build( + package_name, + format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, + mode=Package.Mode.RELEASE, + output_dir=output_dir + ) + for fn_name in func_names: + if fn_name in correctness_check_values: + v.check_correctness( + function.name, + before=correctness_check_values[fn_name]["pre"], + after=correctness_check_values[fn_name]["post"], + ) + + def test_intrinsics_float(self) -> None: from accera import ( abs, @@ -1461,11 +2021,11 @@ def _(): schedule = nest.create_schedule() ii = schedule.split(i, 4) - iii = schedule.split(i, 2) - iiii = schedule.split(ii, 2) + iii = schedule.split(ii, 2) + iiii = schedule.split(iii, 2) for index in [ii, iii, iiii]: self.assertIsNotNone(index) - self.assertEqual(schedule._indices, [i, iii, ii, iiii, j, k]) + self.assertEqual(schedule._indices, [i, ii, iii, iiii, j, k]) self._verify_schedule(schedule, [A, B, C], "test_schedule_split1") # split size does not divide the dimension size @@ -1966,17 +2526,14 @@ def _(): class DSLTest_04Fusing(unittest.TestCase): - def _verify_schedule( - self, schedule, args: Tuple[Array], package_name, correctness_check_values, quiet=True + def _verify_func( + self, package, function, package_name, correctness_check_values, quiet=True, mode=TEST_MODE ) -> None: - # create a HAT package and add the function to it - package = Package() - function = package.add(schedule, args, base_name="fusing_test") output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name # build the HAT package with verifiers.VerifyPackage(self, package_name, output_dir) as v: - package.build(package_name, format=TEST_FORMAT, mode=TEST_MODE, output_dir=output_dir, _quiet=quiet) + package.build(package_name, format=TEST_FORMAT, mode=mode, output_dir=output_dir, _quiet=quiet) if correctness_check_values: v.check_correctness( function.name, @@ -1984,6 +2541,15 @@ def _verify_schedule( after=correctness_check_values["post"], ) + def _verify_schedule( + self, schedule, args: Tuple[Array], package_name, correctness_check_values, quiet=True + ) -> None: + # create a HAT package and add the function to it + package = Package() + function = package.add(schedule, args, base_name="fusing_test") + self._verify_func(package, function, package_name, correctness_check_values, quiet) + + def test_full_iteration_space_fusing(self) -> None: from accera import fuse, Nest @@ -2763,7 +3329,7 @@ def _(): @nest1.iteration_logic def _(): - C[i1, j1] = C[i1, j1] * 0.2 + C[i1, j1] = C[i1, j1] * 0.1 schedule1 = nest1.create_schedule() ii1, jj1 = schedule1.tile({ i1: M_tile, j1: N_tile }) @@ -2816,6 +3382,298 @@ def _(): self._verify_schedule(plan, (A, B, C), "test_hierarchical_partial_fuse", None) + def test_nested_nests_matmul(self): + test_name = "test_nested_nests_matmul" + + M = 20 + N = 32 + K = 12 + M_tile = 4 + N_tile = 16 + K_tile = 3 + + package = Package() + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K)) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N)) + C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N)) + + B_temp = Array(role=Array.Role.TEMP, element_type=ScalarType.float32, shape=(K_tile, N_tile)) + io_B_temp = Array(role=Array.Role.INPUT_OUTPUT, element_type=B_temp.element_type, shape=B_temp.shape) + + i_tile_idx = Dimension() + j_tile_idx = Dimension() + k_tile_idx = Dimension() + + pack_b_nest = Nest([K_tile, N_tile]) + pb_k, pb_j = pack_b_nest.get_indices() + + @pack_b_nest.iteration_logic + def _pack_b(): + full_k = pb_k + k_tile_idx + full_j = pb_j + j_tile_idx + io_B_temp[pb_k, pb_j] = B[full_k, full_j] + + pack_b_fn = package.add(pack_b_nest, args=(B, io_B_temp, j_tile_idx, k_tile_idx), base_name="pack_b_tile_fn") + + matmul_nest = Nest([M_tile, N_tile, K_tile]) + mm_i, mm_j, mm_k = matmul_nest.get_indices() + + @matmul_nest.iteration_logic + def _matmul(): + full_i = mm_i + i_tile_idx + full_j = mm_j + j_tile_idx + full_k = mm_k + k_tile_idx + C[full_i, full_j] += A[full_i, full_k] * io_B_temp[mm_k, mm_j] + + matmul_sched = matmul_nest.create_schedule() + mm_jj = matmul_sched.split(mm_j, 8) + matmul_sched.reorder(mm_k, mm_i, mm_j, mm_jj) + matmul_plan = matmul_sched.create_plan() + matmul_plan.vectorize(mm_jj) + matmul_fn = package.add(matmul_plan, args=(A, B, C, io_B_temp, i_tile_idx, j_tile_idx, k_tile_idx), base_name="matmul_tile_fn") + + tile_nest = Nest([M, N, K]) + i, j, k = tile_nest.get_indices() + + @tile_nest.iteration_logic + def _tile_logic(): + pack_b_fn(B, B_temp, j, k) + matmul_fn(A, B, C, B_temp, i, j, k) + + tile_sched = tile_nest.create_schedule() + ii, jj, kk = tile_sched.tile(dict(zip([i, j, k], [M_tile, N_tile, K_tile]))) + tile_sched.reorder(i, j, k, ii, jj, kk) + tile_plan = tile_sched.create_plan() + tile_plan._erase_loops([ii, jj, kk]) + full_fn = package.add(tile_plan, args=(A, B, C), base_name="full_matmul_fn") + + A_test = np.random.random(A.shape).astype(np.float32) + B_test = np.random.random(B.shape).astype(np.float32) + C_test = np.random.random(C.shape).astype(np.float32) + + A_ref = A_test + B_ref = B_test + C_ref = A_test @ B_test + C_test + + correctness_check_values = { + "pre": [A_test, B_test, C_test], + "post": [A_ref, B_ref, C_ref], + } + self._verify_func(package, full_fn, test_name, correctness_check_values, quiet=False, mode=Package.Mode.RELEASE) + + + def test_nested_nests_matmul_boundary(self): + test_name = "test_nested_nests_matmul_boundary" + from accera import min, Dimension + + M = 20 + N = 32 + K = 12 + M_tile = 4 + N_tile = 12 # 32 doesn't divide 12 so we should have an 8 element boundary in the N dimension + K_tile = 3 + + package = Package() + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K)) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N)) + C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N)) + + B_temp = Array(role=Array.Role.TEMP, element_type=ScalarType.float32, shape=(K_tile, N_tile)) + io_B_temp = Array(role=Array.Role.INPUT_OUTPUT, element_type=B_temp.element_type, shape=B_temp.shape) + + i_tile_idx = Dimension() + j_tile_idx = Dimension() + k_tile_idx = Dimension() + + n_tile_dim = Dimension() + + pack_b_nest = Nest([K_tile, n_tile_dim]) + pb_k, pb_j = pack_b_nest.get_indices() + + @pack_b_nest.iteration_logic + def _pack_b(): + full_k = pb_k + k_tile_idx + full_j = pb_j + j_tile_idx + io_B_temp[pb_k, pb_j] = B[full_k, full_j] + + pack_b_fn = package.add(pack_b_nest, args=(n_tile_dim, B, io_B_temp, j_tile_idx, k_tile_idx), base_name="pack_b_tile_fn") + + matmul_nest = Nest([M_tile, n_tile_dim, K_tile]) + mm_i, mm_j, mm_k = matmul_nest.get_indices() + + @matmul_nest.iteration_logic + def _matmul(): + full_i = mm_i + i_tile_idx + full_j = mm_j + j_tile_idx + full_k = mm_k + k_tile_idx + C[full_i, full_j] += A[full_i, full_k] * io_B_temp[mm_k, mm_j] + + matmul_sched = matmul_nest.create_schedule() + mm_jj = matmul_sched.split(mm_j, 8) + matmul_sched.reorder(mm_k, mm_i, mm_j, mm_jj) + matmul_plan = matmul_sched.create_plan() + matmul_fn = package.add(matmul_plan, args=(n_tile_dim, A, B, C, io_B_temp, i_tile_idx, j_tile_idx, k_tile_idx), base_name="matmul_tile_fn") + + tile_nest = Nest([M, N, K]) + i, j, k = tile_nest.get_indices() + + @tile_nest.iteration_logic + def _tile_logic(): + n_tile_extent = min(cast(N_tile, ScalarType.index), cast(N, ScalarType.index) - j) + pack_b_fn(n_tile_extent, B, B_temp, j, k) + matmul_fn(n_tile_extent, A, B, C, B_temp, i, j, k) + + tile_sched = tile_nest.create_schedule() + ii, jj, kk = tile_sched.tile(dict(zip([i, j, k], [M_tile, N_tile, K_tile]))) + tile_sched.reorder(i, j, k, ii, jj, kk) + tile_plan = tile_sched.create_plan() + tile_plan._erase_loops([ii, jj, kk]) + full_fn = package.add(tile_plan, args=(A, B, C), base_name="full_matmul_fn") + + A_test = np.random.random(A.shape).astype(np.float32) + B_test = np.random.random(B.shape).astype(np.float32) + C_test = np.random.random(C.shape).astype(np.float32) + + A_ref = A_test + B_ref = B_test + C_ref = A_test @ B_test + C_test + + correctness_check_values = { + "pre": [A_test, B_test, C_test], + "post": [A_ref, B_ref, C_ref], + } + self._verify_func(package, full_fn, test_name, correctness_check_values, quiet=False, mode=Package.Mode.RELEASE) + + + def test_double_nested_nests_matmul_boundary(self): + test_name = "test_double_nested_nests_matmul_boundary" + from accera import min, Dimension + + M = 20 + N = 32 + K = 12 + M_tile = 4 + N_tile = 12 # 32 doesn't divide 12 so we should have an 8 element boundary in the N dimension + N_kernel_tile = 8 # Doesn't divide N_tile so we should have a 4 element boundary in the N dimension in the outer main loop and no inner boundary in the outer boundary loop + K_tile = 3 + + package = Package() + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K)) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N)) + C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N)) + + B_temp = Array(role=Array.Role.TEMP, element_type=ScalarType.float32, shape=(K_tile, N_tile)) + io_B_temp = Array(role=Array.Role.INPUT_OUTPUT, element_type=B_temp.element_type, shape=B_temp.shape) + + n_tile_dim = Dimension() + n_kernel_dim = Dimension() + + i_tile_idx = Dimension() + j_tile_idx = Dimension() + k_tile_idx = Dimension() + + i_kernel_idx = Dimension() + j_kernel_idx = Dimension() + k_kernel_idx = Dimension() + + pack_b_nest = Nest([K_tile, n_tile_dim]) + pb_k, pb_j = pack_b_nest.get_indices() + + @pack_b_nest.iteration_logic + def _pack_b(): + full_k = pb_k + k_tile_idx + full_j = pb_j + i_tile_idx + io_B_temp[pb_k, pb_j] = B[full_k, full_j] + + pack_b_fn = package.add( + pack_b_nest, + args=(n_tile_dim, B, io_B_temp, i_tile_idx, k_tile_idx), + base_name="pack_b_tile_fn", + function_opts=INTERNAL_FUNCTION_OPTS) + + matmul_kernel_nest = Nest((n_kernel_dim,)) + mmk_j = matmul_kernel_nest.get_indices() + + @matmul_kernel_nest.iteration_logic + def _matmul(): + tile_j = mmk_j + j_kernel_idx + + full_i = i_kernel_idx + i_tile_idx + full_j = tile_j + j_tile_idx + full_k = k_kernel_idx + k_tile_idx + C[full_i, full_j] += A[full_i, full_k] * io_B_temp[k_kernel_idx, tile_j] + + matmul_kernel_sched = matmul_kernel_nest.create_schedule() + mmk_jj = matmul_kernel_sched.split(mmk_j, N_kernel_tile) + matmul_kernel_sched.reorder(mmk_j, mmk_jj) + matmul_kernel_plan = matmul_kernel_sched.create_plan() + matmul_kernel_fn = package.add(matmul_kernel_plan, + args=(n_kernel_dim, + A, B, C, io_B_temp, + i_tile_idx, j_tile_idx, k_tile_idx, + i_kernel_idx, j_kernel_idx, k_kernel_idx), + base_name="matmul_kernel_fn", + function_opts=INTERNAL_FUNCTION_OPTS) + + + matmul_tile_nest = Nest([M_tile, n_tile_dim, K_tile]) + mm_i, mm_j, mm_k = matmul_tile_nest.get_indices() + + @matmul_tile_nest.iteration_logic + def _matmul(): + n_kernel_extent = min(cast(N_kernel_tile, ScalarType.index), n_tile_dim - mm_j) + matmul_kernel_fn(n_kernel_extent, + A, B, C, io_B_temp, + i_tile_idx, j_tile_idx, k_tile_idx, + mm_i, mm_j, mm_k) + + matmul_tile_sched = matmul_tile_nest.create_schedule() + mm_jj = matmul_tile_sched.split(mm_j, N_tile) + mm_jjj = matmul_tile_sched.split(mm_jj, N_kernel_tile) + matmul_tile_sched.reorder(mm_k, mm_i, mm_j, mm_jj, mm_jjj) + matmul_tile_plan = matmul_tile_sched.create_plan() + matmul_tile_plan._erase_loops([mm_jjj]) + matmul_tile_fn = package.add( + matmul_tile_plan, + args=(n_tile_dim, A, B, C, io_B_temp, i_tile_idx, j_tile_idx, k_tile_idx), + base_name="matmul_tile_fn", + function_opts=INTERNAL_FUNCTION_OPTS) + + + tile_nest = Nest([M, N, K]) + i, j, k = tile_nest.get_indices() + + @tile_nest.iteration_logic + def _tile_logic(): + n_tile_extent = min(cast(N_tile, ScalarType.index), cast(N, ScalarType.index) - j) + pack_b_fn(n_tile_extent, B, B_temp, j, k) + matmul_tile_fn(n_tile_extent, A, B, C, B_temp, i, j, k) + + tile_sched = tile_nest.create_schedule() + ii, jj, kk = tile_sched.tile(dict(zip([i, j, k], [M_tile, N_tile, K_tile]))) + tile_sched.reorder(i, j, k, ii, jj, kk) + tile_plan = tile_sched.create_plan() + tile_plan._erase_loops([ii, jj, kk]) + full_fn = package.add(tile_plan, args=(A, B, C), base_name="full_matmul_fn") + + A_test = np.random.random(A.shape).astype(np.float32) + B_test = np.random.random(B.shape).astype(np.float32) + C_test = np.random.random(C.shape).astype(np.float32) + + A_ref = A_test + B_ref = B_test + C_ref = A_test @ B_test + C_test + + correctness_check_values = { + "pre": [A_test, B_test, C_test], + "post": [A_ref, B_ref, C_ref], + } + self._verify_func(package, full_fn, test_name, correctness_check_values, quiet=False, mode=Package.Mode.RELEASE) + + class DSLTest_05Targets(unittest.TestCase): def test_known_targets(self) -> None: intel_name = "Intel 6400" diff --git a/accera/python/accera/test/smoke_tests.py b/accera/python/accera/test/smoke_tests.py index 514cd437..22784704 100644 --- a/accera/python/accera/test/smoke_tests.py +++ b/accera/python/accera/test/smoke_tests.py @@ -42,8 +42,11 @@ DEV_MODE = True sys.path.insert(1, os.getcwd()) -from accera import Package, ScalarType, Nest, Array, Constants, Scalar, fuse, create_parameters +INTERNAL_FUNCTION_OPTS = { "no_inline_into": True, "public": False } + +from accera import Package, ScalarType, Nest, Array, Constants, Scalar, fuse, create_parameters, Dimension, cast from accera._lang_python._lang import _MemorySpace, _MMASchedulingPolicy, _MMAShape +from accera import min as accmin from accera.samples import MatrixMultiplication from accera.test import verifiers from accera.test.test_utils import expectedFailure, FailedReason @@ -2843,6 +2846,475 @@ def _(): self._verify_matrix_multiplication_function(function, package, test_name, check_correctness=check_correctness) + # TODO : move vpmaddwd tests to a different test file + def test_signextend_int16_matmul_vpmaddwd(self): + from accera import AllocateFlags + test_name = "test_signextend_int16_matmul_vpmaddwd" + + def inout_array(arr: Array): + # Copy the array info but change it to input-output role for use in an inner function declaration + return Array(role=Array.Role.INPUT_OUTPUT, element_type=arr.element_type, shape=arr.shape) + + M = 240 + N = 256 + K = 256 + + M_tile = 24 + N_tile = 128 + K_tile = 128 + + M_kernel_tile = 6 + N_kernel_tile = 16 + + N_vector_tile = 8 + K_vector_tile = 2 + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(M, K), layout=Array.Layout.FIRST_MAJOR) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.uint8, shape=(K, N), layout=Array.Layout.FIRST_MAJOR) + C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N), layout=Array.Layout.FIRST_MAJOR) + + A_cache = Array(role=Array.Role.TEMP, + element_type=ScalarType.int16, + shape=(M_tile, K_tile), + layout=Array.Layout.FIRST_MAJOR, + flags=AllocateFlags.HEAP) + B_cache = Array(role=Array.Role.TEMP, + element_type=ScalarType.uint8, + shape=(N_tile // N_kernel_tile, K_tile // K_vector_tile, N_kernel_tile, K_vector_tile), + layout=Array.Layout.FIRST_MAJOR, + flags=AllocateFlags.HEAP) + + C_cache = Array(role=Array.Role.TEMP, + element_type=ScalarType.int32, + shape=(M_kernel_tile, N_kernel_tile), + layout=Array.Layout.FIRST_MAJOR, + flags=AllocateFlags.STACK) # Stack allocate the small accumulation cache + + io_A_cache = inout_array(A_cache) + io_B_cache = inout_array(B_cache) + io_C_cache = inout_array(C_cache) + + B_ext = Array(role=Array.Role.TEMP, + element_type=ScalarType.int16, + shape=(N_kernel_tile, K_vector_tile), + layout=Array.Layout.FIRST_MAJOR, + flags=AllocateFlags.STACK) + + io_B_ext = inout_array(B_ext) + + m_tile_dim = Dimension() + n_tile_dim = Dimension() + k_tile_dim = Dimension() + m_kernel_dim = Dimension() + n_kernel_dim = Dimension() + k_kernel_dim = Dimension() + m_vector_dim = Dimension() + + i_tile_idx = Dimension() + j_tile_idx = Dimension() + k_tile_idx = Dimension() + i_kernel_idx = Dimension() + j_kernel_idx = Dimension() + k_kernel_idx = Dimension() + i_vector_idx = Dimension() + + package = Package() + + ### Matmul inner kernel tile + mmi_nest = Nest(shape=(n_kernel_dim, k_kernel_dim)) + mmi_j, mmi_k = mmi_nest.get_indices() + @mmi_nest.iteration_logic + def _matmul_inner(): + mmi_i = i_vector_idx + tile_i = i_kernel_idx + mmi_i + tile_j = j_kernel_idx + mmi_j + tile_k = k_kernel_idx + mmi_k + io_C_cache[mmi_i, mmi_j] += io_A_cache[tile_i, tile_k] * io_B_ext[mmi_j, mmi_k] + + mmi_sched = mmi_nest.create_schedule() + mmi_jj, mmi_kk = mmi_sched.tile(dict(zip([mmi_j, mmi_k], [N_kernel_tile, K_vector_tile]))) + mmi_jjj = mmi_sched.split(mmi_jj, N_vector_tile) + mmi_sched.reorder(mmi_j, mmi_k, mmi_jj, mmi_jjj, mmi_kk) + mmi_plan = mmi_sched.create_plan() + mmi_plan.vectorize(mmi_jjj) + mmi_fn = package.add(mmi_plan, + args=(n_kernel_dim, k_kernel_dim, + io_A_cache, io_B_ext, io_C_cache, + i_kernel_idx, j_kernel_idx, k_kernel_idx, i_vector_idx), + base_name="matmul_kernel", + function_opts=INTERNAL_FUNCTION_OPTS) + + ### B element zero extend + bext_nest = Nest((n_kernel_dim, k_kernel_dim)) + bext_j, bext_k = bext_nest.get_indices() + @bext_nest.iteration_logic + def _bext(): + tile_j = j_kernel_idx + tile_k = k_kernel_idx + io_B_ext[bext_j, bext_k] = io_B_cache[tile_j / N_kernel_tile, tile_k / K_vector_tile, bext_j, bext_k] + + bext_sched = bext_nest.create_schedule() + bext_jj, bext_kk = bext_sched.tile(dict(zip([bext_j, bext_k], [N_kernel_tile, K_vector_tile]))) + bext_jjj = bext_sched.split(bext_jj, N_vector_tile) + bext_sched.reorder(bext_j, bext_k, bext_jj, bext_jjj, bext_kk) + bext_plan = bext_sched.create_plan() + bext_plan.vectorize(bext_jjj) + bext_fn = package.add(bext_plan, + args=(n_kernel_dim, k_kernel_dim, + io_B_cache, io_B_ext, + j_kernel_idx, k_kernel_idx), + base_name="b_ext_kernel", + function_opts=INTERNAL_FUNCTION_OPTS) + + + ### Matmul outer kernel tile + mmo_nest = Nest(shape=(m_kernel_dim, k_tile_dim)) + mmo_i, mmo_k = mmo_nest.get_indices() + @mmo_nest.iteration_logic + def _matmul(): + + ### NOTE: The order of operands in this accmin is important + # When we vectorize a min statement that is either always true or always false, we can simplify it better. + # accmin internally uses "less-than" as the min operator, so here we order (k_tile_dim - mmo_k, K_vector_tile) because: + # k_tile_dim - mmo_k < K_vector_tile + # Is false for k_tile_dim - mmo_k >= K_vector_tile + # And importantly for vectorization it is therefore false for the entire K_vector_tile inner split and gets simplified + k_kernel_extent = accmin(k_tile_dim - mmo_k, cast(K_vector_tile, ScalarType.index)) + + bext_fn(n_kernel_dim, k_kernel_extent, io_B_cache, B_ext, j_kernel_idx, mmo_k) + mmi_fn(n_kernel_dim, k_kernel_extent, io_A_cache, B_ext, io_C_cache, i_kernel_idx, j_kernel_idx, mmo_k, mmo_i) + + mmo_sched = mmo_nest.create_schedule() + mmo_ii, mmo_kk = mmo_sched.tile(dict(zip([mmo_i, mmo_k], [M_kernel_tile, K_tile]))) + mmo_kkk = mmo_sched.split(mmo_kk, K_vector_tile) + mmo_sched.reorder(mmo_k, mmo_i, mmo_kk, mmo_ii, mmo_kkk) + mmo_plan = mmo_sched.create_plan() + mmo_plan._erase_loops([mmo_kkk]) + mmo_fn = package.add(mmo_plan, + args=(m_kernel_dim, n_kernel_dim, k_tile_dim, + io_A_cache, io_B_cache, io_C_cache, + i_kernel_idx, j_kernel_idx), + base_name="matmul_kernel", + function_opts=INTERNAL_FUNCTION_OPTS) + + + ### C cache init + cci_nest = Nest(shape=(M_kernel_tile, N_kernel_tile)) + cci_i, cci_j = cci_nest.get_indices() + @cci_nest.iteration_logic + def _cci(): + io_C_cache[cci_i, cci_j] = 0 + + cci_sched = cci_nest.create_schedule() + cci_plan = cci_sched.create_plan() + cci_fn = package.add(cci_plan, args=(io_C_cache,), base_name="c_cache_init_kernel", function_opts=INTERNAL_FUNCTION_OPTS) + + ### C cache reduce + ccr_nest = Nest(shape=(m_kernel_dim, n_kernel_dim)) + ccr_i, ccr_j = ccr_nest.get_indices() + @ccr_nest.iteration_logic + def _ccr(): + global_i = i_tile_idx + i_kernel_idx + ccr_i + global_j = j_tile_idx + j_kernel_idx + ccr_j + C[global_i, global_j] += io_C_cache[ccr_i, ccr_j] + + ccr_sched = ccr_nest.create_schedule() + ccr_ii, ccr_jj = ccr_sched.tile(dict(zip([ccr_i, ccr_j], [M_kernel_tile, N_kernel_tile]))) + ccr_sched.reorder(ccr_i, ccr_j, ccr_ii, ccr_jj) + ccr_plan = ccr_sched.create_plan() + ccr_plan.vectorize(ccr_ii) + ccr_fn = package.add(ccr_plan, + args=(m_kernel_dim, n_kernel_dim, + C, io_C_cache, + i_tile_idx, j_tile_idx, + i_kernel_idx, j_kernel_idx), + base_name="c_cache_reduce_kernel", + function_opts=INTERNAL_FUNCTION_OPTS) + + ### A cache pack + pa_nest = Nest(shape=(m_tile_dim, k_tile_dim)) + pa_i, pa_k = pa_nest.get_indices() + @pa_nest.iteration_logic + def _pack_a(): + global_i = i_tile_idx + pa_i + global_k = k_tile_idx + pa_k + io_A_cache[pa_i, pa_k] = A[global_i, global_k] + + pa_sched = pa_nest.create_schedule() + pa_ii, pa_kk = pa_sched.tile(dict(zip([pa_i, pa_k], [M_tile, K_tile]))) + pa_sched.reorder(pa_i, pa_k, pa_ii, pa_kk) + pa_plan = pa_sched.create_plan() + pa_fn = package.add(pa_plan, + args=(m_tile_dim, k_tile_dim, + A, io_A_cache, + i_tile_idx, k_tile_idx), + base_name="pack_a", + function_opts=INTERNAL_FUNCTION_OPTS) + + + ### B cache pack + pb_nest = Nest(shape=(n_tile_dim, k_tile_dim)) + pb_j, pb_k = pb_nest.get_indices() + @pb_nest.iteration_logic + def _pack_b(): + global_j = j_tile_idx + pb_j + global_k = k_tile_idx + pb_k + io_B_cache[pb_j / N_kernel_tile, pb_k / K_vector_tile, pb_j % N_kernel_tile, pb_k % K_vector_tile] = B[global_k, global_j] + + pb_sched = pb_nest.create_schedule() + pb_jj, pb_kk = pb_sched.tile(dict(zip([pb_j, pb_k], [N_tile, K_tile]))) + pb_jjj, pb_kkk = pb_sched.tile(dict(zip([pb_jj, pb_kk], [N_vector_tile, K_vector_tile]))) + pb_sched.reorder(pb_j, pb_k, pb_jj, pb_kk, pb_jjj, pb_kkk) + pb_plan = pb_sched.create_plan() + pb_plan.vectorize(pb_jjj) + pb_fn = package.add(pb_plan, + args=(n_tile_dim, k_tile_dim, + B, io_B_cache, + j_tile_idx, k_tile_idx), + base_name="pack_b", + function_opts=INTERNAL_FUNCTION_OPTS) + + + compute_kernel_nest = Nest(shape=(1,)) + @compute_kernel_nest.iteration_logic + def _hack(): + cci_fn(C_cache) # Don't need to range-clamp this, we can just zero out the full buffer every time + mmo_fn(m_kernel_dim, n_kernel_dim, k_tile_dim, io_A_cache, io_B_cache, C_cache, i_kernel_idx, j_kernel_idx) + ccr_fn(m_kernel_dim, n_kernel_dim, C, C_cache, i_tile_idx, j_tile_idx, i_kernel_idx, j_kernel_idx) + + compute_kernel_sched = compute_kernel_nest.create_schedule() + compute_kernel_plan = compute_kernel_sched.create_plan() + compute_kernel_fn = package.add(compute_kernel_plan, + args=( + m_kernel_dim, n_kernel_dim, k_tile_dim, + io_A_cache, io_B_cache, C, + i_tile_idx, j_tile_idx, k_tile_idx, + i_kernel_idx, j_kernel_idx), + base_name="compute_kernel_fn", + function_opts=INTERNAL_FUNCTION_OPTS) + + tile_nest = Nest(shape=(m_tile_dim, n_tile_dim)) + tile_i, tile_j = tile_nest.get_indices() + + @tile_nest.iteration_logic + def _tile(): + m_kernel_extent = accmin(m_tile_dim - tile_i, cast(M_kernel_tile, ScalarType.index)) + n_kernel_extent = accmin(n_tile_dim - tile_j, cast(N_kernel_tile, ScalarType.index)) + compute_kernel_fn(m_kernel_extent, n_kernel_extent, k_tile_dim, + io_A_cache, io_B_cache, C, + i_tile_idx, j_tile_idx, k_tile_idx, + tile_i, tile_j) + + tile_sched = tile_nest.create_schedule() + tile_ii, tile_jj = tile_sched.tile({ tile_i: M_tile, tile_j: N_tile }) + tile_iii, tile_jjj = tile_sched.tile({ tile_ii: M_kernel_tile, tile_jj: N_kernel_tile }) + tile_sched.reorder(tile_i, tile_j, tile_ii, tile_jj, tile_iii, tile_jjj) + tile_plan = tile_sched.create_plan() + tile_plan._erase_loops([tile_iii, tile_jjj]) + tile_fn = package.add(tile_plan, + args=(m_tile_dim, n_tile_dim, k_tile_dim, + io_A_cache, io_B_cache, C, + i_tile_idx, j_tile_idx, k_tile_idx), + base_name="tile_fn", + function_opts=INTERNAL_FUNCTION_OPTS) + + + global_nest = Nest(shape=(M, N, K)) + global_i, global_j, global_k = global_nest.get_indices() + + @global_nest.iteration_logic + def _tile(): + m_tile_extent = accmin(M - global_i, cast(M_tile, ScalarType.index)) + n_tile_extent = accmin(N - global_j, cast(N_tile, ScalarType.index)) + k_tile_extent = accmin(K - global_k, cast(K_tile, ScalarType.index)) + + pa_fn(m_tile_extent, k_tile_extent, A, A_cache, global_i, global_k) + pb_fn(n_tile_extent, k_tile_extent, B, B_cache, global_j, global_k) + tile_fn(m_tile_extent, n_tile_extent, k_tile_extent, A_cache, B_cache, C, global_i, global_j, global_k) + + global_sched = global_nest.create_schedule() + global_ii, global_jj, global_kk = global_sched.tile({ global_i: M_tile, global_j: N_tile, global_k: K_tile }) + global_sched.reorder(global_i, global_j, global_k, global_ii, global_jj, global_kk) + global_plan = global_sched.create_plan() + global_plan._erase_loops([global_ii, global_jj, global_kk]) + + function = package.add(global_plan, args=(A, B, C), base_name=test_name) + + A_test = np.random.random((M, K)).astype(np.int16) + B_test = np.random.random((K, N)).astype(np.uint8) + C_test = np.random.random((M, N)).astype(np.int32) + + correctness_check_values = { + "pre": (A_test, B_test, C_test), + "post": (A_test, B_test, C_test + A_test @ B_test), + } + + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name + + # build the HAT package + with verifiers.VerifyPackage(self, test_name, output_dir) as v: + package.build(test_name, format=Package.Format.DEFAULT | Package.Format.MLIR, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False) + v.check_correctness( + function.name, + before=correctness_check_values["pre"], + after=correctness_check_values["post"], + ) + + + def test_int16_matmul_vpmaddwd(self): + test_name = "test_int16_matmul_vpmaddwd" + M = 240 + N = 256 + K = 256 + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(M, K), layout=Array.Layout.FIRST_MAJOR) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(K, N), layout=Array.Layout.FIRST_MAJOR) + C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N), layout=Array.Layout.FIRST_MAJOR) + + nest = Nest(shape=(M, N, K)) + i, j, k = nest.get_indices() + + @nest.iteration_logic + def _(): + C[i, j] += A[i, k] * B[k, j] + + schedule = nest.create_schedule() + ii, jj, kk = schedule.tile({ i: 24, j: 128, k: 128 }) + iii, jjj, kkk = schedule.tile({ ii: 6, jj: 16, kk: 4 }) + jjjj, kkkk = schedule.tile({ jjj: 8, kkk: 2 }) + + schedule.reorder(i, j, k, + ii, jj, kk, + kkk, iii, jjj, + jjjj, kkkk) + + plan = schedule.create_plan() + plan.cache(A, index = ii, element_type = ScalarType.int16, vectorize=False) + plan.cache(B, index = jjjj, trigger_index = jj, layout = Array.Layout.LAST_MAJOR, vectorize=False) + plan.cache(C, iii) + plan.vectorize(jjjj) + + package = Package() + function = package.add(plan, args=(A, B, C), base_name=test_name) + + A_test = np.random.random((M, K)).astype(np.int16) + B_test = np.random.random((K, N)).astype(np.int16) + C_test = np.random.random((M, N)).astype(np.int32) + + correctness_check_values = { + "pre": (A_test, B_test, C_test), + "post": (A_test, B_test, C_test + A_test @ B_test), + } + + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name + + # build the HAT package + with verifiers.VerifyPackage(self, test_name, output_dir) as v: + package.build(test_name, format=Package.Format.DEFAULT, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False) + v.check_correctness( + function.name, + before=correctness_check_values["pre"], + after=correctness_check_values["post"], + ) + + + + def test_int32_horizontal_vector_add(self): + test_name = "test_int32_horizontal_vector_add" + M = 256 + N = 16 + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.int32, shape=(M, N), layout=Array.Layout.FIRST_MAJOR) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.int32, shape=(M,), layout=Array.Layout.FIRST_MAJOR) + + nest = Nest(shape=(M, N)) + i, j = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i] += A[i, j] + + schedule = nest.create_schedule() + + plan = schedule.create_plan() + plan.vectorize(j) + + package = Package() + function = package.add(plan, args=(A, B), base_name=test_name) + + A_test = np.random.random((M, N)).astype(np.int32) + B_test = np.random.random((M,)).astype(np.int32) + + B_ref = np.zeros((M,)).astype(np.int32) + B_ref[:] = B_test[:] + for j in range(N): + B_ref[:] += A_test[:, j] + + correctness_check_values = { + "pre": (A_test, B_test), + "post": (A_test, B_ref), + } + + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name + + # build the HAT package + with verifiers.VerifyPackage(self, test_name, output_dir) as v: + package.build(test_name, format=Package.Format.DEFAULT, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False) + v.check_correctness( + function.name, + before=correctness_check_values["pre"], + after=correctness_check_values["post"], + ) + + def test_int16_to_int32_horizontal_vector_add_simple(self): + test_name = "test_int16_to_int32_horizontal_vector_add_simple" + M = 256 + N = 16 + + A = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(M, N), layout=Array.Layout.FIRST_MAJOR) + B = Array(role=Array.Role.INPUT, element_type=ScalarType.int32, shape=(M,), layout=Array.Layout.FIRST_MAJOR) + + nest = Nest(shape=(M, N)) + i, j = nest.get_indices() + + @nest.iteration_logic + def _(): + B[i] += A[i, j] + + schedule = nest.create_schedule() + ii = schedule.split(i, 4) + schedule.reorder(i, ii, j) + plan = schedule.create_plan() + plan.vectorize(ii) + + package = Package() + function = package.add(plan, args=(A, B), base_name=test_name) + + A_test = np.random.random((M, N)).astype(np.int16) + B_test = np.random.random((M,)).astype(np.int32) + + B_ref = np.zeros((M,)).astype(np.int32) + B_ref[:] = B_test[:] + for j in range(N): + B_ref[:] += A_test[:, j] + + correctness_check_values = { + "pre": (A_test, B_test), + "post": (A_test, B_ref), + } + + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name + + # build the HAT package + with verifiers.VerifyPackage(self, test_name, output_dir) as v: + package.build(test_name, format=Package.Format.DEFAULT, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False) + v.check_correctness( + function.name, + before=correctness_check_values["pre"], + after=correctness_check_values["post"], + ) + + # Cache widening the type def test_matmul_input_cache_element_type_widen(self) -> None: test_name = "test_matmul_input_cache_element_type_widen" @@ -4165,13 +4637,13 @@ def file_check_fn(verifier): # Function decl checker.check_label('accv.func nested @test_gpu_cache_different_input_layouts_') checker.check_same( - '%[[Array_A:[a-z0-9_]+]]: memref<4x2560x2048xf32, affine_map<(d0, d1, d2) -> (d0 * 5242880 + d1 * 2048 + d2)>>' + '%[[Array_A:[a-z0-9_]+]]: memref<4x2560x2048xf32>' ) checker.check_same( '%[[Array_B:[a-z0-9_]+]]: memref<4x2048x1536xf32, affine_map<(d0, d1, d2) -> (d0 + d1 * 4 + d2 * 8192)>>' ) checker.check_same( - '%[[Array_C:[a-z0-9_]+]]: memref<4x2560x1536xf32, affine_map<(d0, d1, d2) -> (d0 * 3932160 + d1 * 1536 + d2)>>' + '%[[Array_C:[a-z0-9_]+]]: memref<4x2560x1536xf32>' ) # Block X/Y @@ -4184,8 +4656,6 @@ def file_check_fn(verifier): # Loops outside of cache regions checker.check('affine.for %[[b_iv:[a-z0-9_]+]] = 0 to 4 {') - checker.check('affine.for %[[Block_X_iv:[a-z0-9_]+]] = 0 to 1 {') - checker.check('affine.for %[[Block_Y_iv:[a-z0-9_]+]] = 0 to 1 {') checker.check('affine.for %[[k_iv:[a-z0-9_]+]] = 0 to 2048 step 512 {') checker.check('affine.for %[[kk_iv:[a-z0-9_]+]] = 0 to 512 step 32 {') @@ -4194,10 +4664,8 @@ def file_check_fn(verifier): checker.check('%[[Thread_X:[0-9_]+]] = gpu.thread_id x') checker.check('%[[Thread_Y:[0-9_]+]] = gpu.thread_id y') checker.check('affine.for %[[lpt_iv:[a-z0-9_]+]] = 0 to 2 {') - checker.check('affine.for %[[Thread_X_iv:[a-z0-9_]+]] = 0 to 1 {') - checker.check('affine.for %[[Thread_Y_iv:[a-z0-9_]+]] = 0 to 1 {') checker.check( - '%[[Loaded_A_Val:[0-9_]+]] = affine.load %[[Array_A]][%[[b_iv]], symbol(%[[Block_X]]) * 16 + symbol(%[[Thread_X]]) - (symbol(%[[Block_X]]) floordiv 160) * 2560, %[[lpt_iv]] * 16 + %[[k_iv]] + %[[kk_iv]] + symbol(%[[Thread_Y]])] : memref<4x2560x2048xf32, affine_map<(d0, d1, d2) -> (d0 * 5242880 + d1 * 2048 + d2)>>' + '%[[Loaded_A_Val:[0-9_]+]] = affine.load %[[Array_A]][%[[b_iv]], symbol(%[[Block_X]]) * 16 + symbol(%[[Thread_X]]) - (symbol(%[[Block_X]]) floordiv 160) * 2560, %[[lpt_iv]] * 16 + %[[k_iv]] + %[[kk_iv]] + symbol(%[[Thread_Y]])] : memref<4x2560x2048xf32>' ) checker.check( 'affine.store %[[Loaded_A_Val]], %[[Cache_A]][0, symbol(%[[Thread_X]]), %[[lpt_iv]] * 16 + symbol(%[[Thread_Y]])] : memref<1x16x32xf32, 3>' @@ -4208,8 +4676,6 @@ def file_check_fn(verifier): checker.check('%[[Thread_X:[0-9_]+]] = gpu.thread_id x') checker.check('%[[Thread_Y:[0-9_]+]] = gpu.thread_id y') checker.check('affine.for %[[lpt_iv:[a-z0-9_]+]] = 0 to 2 {') - checker.check('affine.for %[[Thread_X_iv:[a-z0-9_]+]] = 0 to 1 {') - checker.check('affine.for %[[Thread_Y_iv:[a-z0-9_]+]] = 0 to 1 {') checker.check( '%[[Loaded_B_Val:[0-9_]+]] = affine.load %[[Array_B]][%[[b_iv]], %[[k_iv]] + %[[kk_iv]] + symbol(%[[Thread_Y]]) * 16 + symbol(%[[Thread_X]]) - (symbol(%[[Thread_Y]]) floordiv 2) * 32, %[[lpt_iv]] * 8 + symbol(%[[Block_Y]]) * 16 - (symbol(%[[Block_Y]]) floordiv 96) * 1536 + symbol(%[[Thread_Y]]) floordiv 2 - ((%[[lpt_iv]] * 8 + symbol(%[[Thread_Y]]) floordiv 2) floordiv 16) * 16] : memref<4x2048x1536xf32, affine_map<(d0, d1, d2) -> (d0 + d1 * 4 + d2 * 8192)>>' ) @@ -5058,5 +5524,102 @@ def _(): ) + def test_loop_erase_hack(self) -> None: + # We want to fuse two nests along at least one dimension that only one of them should actually have, but for positioning reasons + # it must exist in both. We therefore fuse along all the dimensions and erase the inner unfused loops that we don't actually need + + M = 256 + N = 128 + K = 512 + M_tile = 32 + N_tile = 16 + K_tile = 8 + A = Array(role=Array.Role.INPUT, shape=(M, K)) + B = Array(role=Array.Role.INPUT, shape=(K, N)) + C = Array(role=Array.Role.INPUT_OUTPUT, shape=(M, N)) + + # Create nest0 and schedule + nest0 = Nest(shape=(M, N, K)) + i0, j0, k0 = nest0.get_indices() + + @nest0.iteration_logic + def _(): + C[i0, j0] += A[i0, k0] * B[k0, j0] + + schedule0 = nest0.create_schedule() + ii0, jj0, kk0 = schedule0.tile({ i0: M_tile, j0: N_tile, k0: K_tile }) + schedule0.reorder(i0, j0, k0, ii0, jj0, kk0) + + # Create nest1 and schedule1 + nest1 = Nest(shape=(M, N, K)) + i1, j1, k1 = nest1.get_indices() + + @nest1.iteration_logic + def _(): + C[i1, j1] = C[i1, j1] * Scalar(0.2) + + schedule1 = nest1.create_schedule() + ii1, jj1, kk1 = schedule1.tile({ i1: M_tile, j1: N_tile, k1: K_tile }) + schedule1.reorder(i1, j1, k1, ii1, jj1, kk1) + + schedule = fuse((schedule0, schedule1), partial=3) + f, i, j, k, ii0, jj0, kk0, ii1, jj1, kk1 = schedule.get_indices() + schedule.reorder(i, j, k, f, ii0, jj0, kk0, ii1, jj1, kk1) + plan = schedule.create_plan() + plan._erase_loops([kk1]) + + # Create a package and add our function definition to it + package_name = "test_loop_erase_hack" + package = Package() + package.add(plan, args=(A, B, C), base_name="test_loop_erase_hack") + + # Build the HAT package + with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR): + package.build(package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=TEST_PACKAGE_DIR) + + def test_dynamic_size_redundant_split(self) -> None: + package_name = "test_dynamic_size_redundant_split" + split_size = 32 + + m_extent = Dimension() + input_arr = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(m_extent,)) + output_arr = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(m_extent,)) + + nest = Nest((m_extent,)) + i = nest.get_indices() + @nest.iteration_logic + def _(): + output_arr[i] += input_arr[i] + + sched = nest.create_schedule() + ii = sched.split(i, split_size) + iii = sched.split(ii, split_size) + sched.reorder(i, ii, iii) + plan = sched.create_plan() + + # Create a package and add our function definition to it + package = Package() + + fn = package.add(plan, args=(m_extent, input_arr, output_arr), base_name=package_name) + + M_test = np.int64(67) + input_test = np.random.random((M_test,)).astype(np.float32) + output_test = np.random.random((M_test,)).astype(np.float32) + correctness_check_values = { + "pre": [M_test, input_test, output_test], + "post": [M_test, input_test, output_test + input_test], + } + + # Build the HAT package + output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name + with verifiers.VerifyPackage(self, package_name, output_dir) as v: + package.build(package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir, _quiet=False) + + v.check_correctness( + fn.name, + before=correctness_check_values["pre"], + after=correctness_check_values["post"], + ) + if __name__ == '__main__': unittest.main(verbosity=10) diff --git a/accera/python/lib/src/ContainerTypes.cpp b/accera/python/lib/src/ContainerTypes.cpp index b4157f53..67c82e77 100644 --- a/accera/python/lib/src/ContainerTypes.cpp +++ b/accera/python/lib/src/ContainerTypes.cpp @@ -38,8 +38,11 @@ namespace .value("float32", value::ValueType::Float, "4 byte floating point") .value("float64", value::ValueType::Double, "8 byte floating point"); - py::enum_(subModule, "AllocateFlags", "An enumeration of allocation flags") + py::enum_(module, "AllocateFlags", "An enumeration of allocation flags") .value("NONE", value::AllocateFlags::None) + .value("GLOBAL", value::AllocateFlags::Global) + .value("STACK", value::AllocateFlags::Stack) + .value("HEAP", value::AllocateFlags::Heap) .value("THREAD_LOCAL", value::AllocateFlags::ThreadLocal); } @@ -154,6 +157,8 @@ General constructor. module.def("cast", [](value::Scalar s, value::ValueType type) { return value::Cast(s, type); }); + module.def("round", &value::Round); + module.def("remainderf", &value::Remainderf); } void DefineArrayClass(py::module& module) diff --git a/accera/python/lib/src/ExecutionPlanTypes.cpp b/accera/python/lib/src/ExecutionPlanTypes.cpp index cadf661b..60f303d5 100644 --- a/accera/python/lib/src/ExecutionPlanTypes.cpp +++ b/accera/python/lib/src/ExecutionPlanTypes.cpp @@ -203,7 +203,8 @@ namespace .def("emit_runtime_init_packing", py::overload_cast(&value::Plan::EmitRuntimeInitPacking), "target"_a, "packing_func_name"_a, "packed_buf_size_func_name"_a, "indexing"_a = value::CacheIndexing::GlobalToPhysical) .def("pack_and_embed_buffer", py::overload_cast(&value::Plan::PackAndEmbedBuffer), "target"_a, "constant_data_buffer"_a, "wrapper_fn_name"_a, "packed_buffer_name"_a, "indexing"_a = value::CacheIndexing::GlobalToPhysical) .def("vectorize", &value::Plan::Vectorize, "i"_a, "vectorization_info"_a) - .def("parallelize", &value::Plan::Parallelize, "indices"_a, "num_threads"_a, "policy"_a); + .def("parallelize", &value::Plan::Parallelize, "indices"_a, "num_threads"_a, "policy"_a) + .def("_erase_loop", &value::Plan::_EraseLoop, "index"_a); py::class_(module, "_GPUExecutionPlan") .def(py::init([](value::GPUPlan& plan) { diff --git a/accera/python/lib/src/PackagingTypes.cpp b/accera/python/lib/src/PackagingTypes.cpp index 929dfb91..7548949e 100644 --- a/accera/python/lib/src/PackagingTypes.cpp +++ b/accera/python/lib/src/PackagingTypes.cpp @@ -106,12 +106,13 @@ ARM: fp16, neon, vfp3, d16, vfp4, hwdiv-arm, hwdiv .def(py::init(), "name"_a, "options"_a = value::CompilerOptions{}) .def( "Allocate", - [](value::MLIRContext& c, value::ValueType type, const util::MemoryLayout& layout, size_t alignment) { - return c.Allocate(type, layout, alignment); + [](value::MLIRContext& c, value::ValueType type, const util::MemoryLayout& layout, size_t alignment, value::AllocateFlags flags) { + return c.Allocate(type, layout, alignment, flags); }, "type"_a, "layout"_a, - "alignment"_a = 0) + "alignment"_a = 0, + "_flags"_a = value::AllocateFlags::None) .def("Print", &value::MLIRContext::print, "Prints the module") .def("Save", &value::MLIRContext::save, "filename"_a) .def("Verify", &value::MLIRContext::verify) @@ -160,6 +161,14 @@ Sets whether this function should be decorated (mangled) "inlinable"_a, py::return_value_policy::reference_internal, "Sets whether the function is allowed to be inlined.") + .def( + "inlinable_into", [](value::FunctionDeclaration& fn, bool inlinable_into) { + (void)fn.InlineInto(inlinable_into ? value::FunctionInlining::always : value::FunctionInlining::never); + return fn; + }, + "inlinable_into"_a, + py::return_value_policy::reference_internal, + "Sets whether other functions are allowed to be inlined into this function.") .def("addTag", &value::FunctionDeclaration::AddTag, "addTag"_a, py::return_value_policy::reference_internal, "A tag to add to a function as an attribute.") .def("baseName", &value::FunctionDeclaration::BaseName, "baseName"_a, py::return_value_policy::reference_internal, "Sets the base name for this function to use as an alias in the generated header file.") .def("outputVerifiers", &value::FunctionDeclaration::OutputVerifiers, "outputVerifiers"_a, py::return_value_policy::reference_internal, "Sets the verification functions for output checking, one per output argument.") diff --git a/accera/transforms/include/AcceraPasses.h b/accera/transforms/include/AcceraPasses.h index 6d9d913f..265aaa94 100644 --- a/accera/transforms/include/AcceraPasses.h +++ b/accera/transforms/include/AcceraPasses.h @@ -22,6 +22,7 @@ #include "value/ValueToLLVMLoweringPass.h" #include "value/ValueToStandardLoweringPass.h" +#include #include #include #include diff --git a/accera/transforms/include/AcceraPasses.td b/accera/transforms/include/AcceraPasses.td index 54e266d3..64ffbafd 100644 --- a/accera/transforms/include/AcceraPasses.td +++ b/accera/transforms/include/AcceraPasses.td @@ -257,6 +257,7 @@ def ConvertValueToLLVM : accModulePass<"value-to-llvm"> { let constructor = "accera::transforms::value::createValueToLLVMPass()"; let dependentDialects = [ "mlir::StandardOpsDialect", + "accera::ir::intrinsics::AcceraIntrinsicsDialect", "mlir::LLVM::LLVMDialect" ]; // Match std-to-llvm options so we can pass through arguments diff --git a/accera/transforms/include/affine/AffineSimplifications.h b/accera/transforms/include/affine/AffineSimplifications.h index 6705c2f4..f50e3098 100644 --- a/accera/transforms/include/affine/AffineSimplifications.h +++ b/accera/transforms/include/affine/AffineSimplifications.h @@ -16,6 +16,7 @@ using OwningRewritePatternList = RewritePatternSet; namespace accera::transforms::affine { -void populateAcceraAffineSimplificationPatterns(mlir::OwningRewritePatternList& patterns); +void populateAcceraAffineExprSimplificationPatterns(mlir::OwningRewritePatternList& patterns); +void populateAcceraAffineLoopSimplificationPatterns(mlir::OwningRewritePatternList& patterns); std::unique_ptr createAffineSimplificationPass(); } // namespace accera::transforms::affine diff --git a/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h b/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h index 12eb804b..dfcdebe9 100644 --- a/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h +++ b/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h @@ -28,6 +28,7 @@ void populateExecutionPlanAdjustHierarchicalCacheRegionPositionPatterns(mlir::Re void populateExecutionPlanAdjustCacheMappingPositionPatterns(mlir::RewritePatternSet& patterns); void populateExecutionPlanMaxElementCacheRegionPatterns(mlir::RewritePatternSet& patterns); void populateExecutionPlanVectorizePatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns); +void populateExecutionPlanVectorizeUnrollPatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns); void populateExecutionPlanTensorizePatterns(mlir::RewritePatternSet& patterns); void populateExecutionPlanParallelizePatterns(mlir::RewritePatternSet& patterns); void populateExecutionPlanScaleHoistingPatterns(mlir::RewritePatternSet& patterns); diff --git a/accera/transforms/include/util/RangeValueUtilities.h b/accera/transforms/include/util/RangeValueUtilities.h index 3a8caccf..cc06fd1a 100644 --- a/accera/transforms/include/util/RangeValueUtilities.h +++ b/accera/transforms/include/util/RangeValueUtilities.h @@ -74,7 +74,9 @@ class RangeValueAnalysis RangeValue resolveRangeValue(mlir::gpu::GridDimOp op); RangeValue resolveRangeValue(accera::ir::value::WarpIdOp op); RangeValue resolveRangeValue(llvm::Instruction::BinaryOps binOp, mlir::Operation* op); + RangeValue resolveRangeValue(llvm::Instruction::BinaryOps binOp, const llvm::SmallVectorImpl& operandRanges); RangeValue resolveRangeValue(mlir::AffineForOp op); + RangeValue resolveRangeValue(mlir::AffineApplyOp op); RangeValue resolveRangeValue(mlir::scf::ForOp op); RangeValue resolveRangeValue(mlir::Operation* op); }; diff --git a/accera/transforms/include/util/VectorizationUtil.h b/accera/transforms/include/util/VectorizationUtil.h index b6dcb812..f78ce303 100644 --- a/accera/transforms/include/util/VectorizationUtil.h +++ b/accera/transforms/include/util/VectorizationUtil.h @@ -42,8 +42,10 @@ class VectorizedOpMap std::map _vectorizedOps; }; -mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, - mlir::PatternRewriter& rewriter); + +mlir::LogicalResult TryVectorizeKnownSubgraph(mlir::AffineForOp affineForOp, + mlir::PatternRewriter& rewriter); + std::optional VectorizeOp(mlir::PatternRewriter& rewriter, mlir::Operation* op, diff --git a/accera/transforms/include/value/RangeValueOptimizePass.h b/accera/transforms/include/value/RangeValueOptimizePass.h index 6557b681..7a201fb1 100644 --- a/accera/transforms/include/value/RangeValueOptimizePass.h +++ b/accera/transforms/include/value/RangeValueOptimizePass.h @@ -12,9 +12,12 @@ namespace mlir { class Pass; +class RewritePatternSet; } // namespace mlir namespace accera::transforms::value { +void populateRangeValueOptimizePatterns(mlir::RewritePatternSet& patterns); + std::unique_ptr createRangeValueOptimizePass(); } // namespace accera::transforms::value diff --git a/accera/transforms/src/AcceraPasses.cpp b/accera/transforms/src/AcceraPasses.cpp index 920de35a..fbb729dc 100644 --- a/accera/transforms/src/AcceraPasses.cpp +++ b/accera/transforms/src/AcceraPasses.cpp @@ -151,6 +151,7 @@ void addAcceraToLLVMPassPipeline(OpPassManager& pm, const AcceraPassPipelineOpti pmAdaptor.addPass(value::createValueFuncToTargetPass()); pmAdaptor.addPass(createSymbolDCEPass()); + pmAdaptor.addPass(affine::createAffineSimplificationPass()); auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest().nest(); }); funcOpPM.addPass(createConvertLinalgToAffineLoopsPass()); diff --git a/accera/transforms/src/affine/AffineSimplifications.cpp b/accera/transforms/src/affine/AffineSimplifications.cpp index 363add2b..64d249bd 100644 --- a/accera/transforms/src/affine/AffineSimplifications.cpp +++ b/accera/transforms/src/affine/AffineSimplifications.cpp @@ -7,11 +7,14 @@ #include "affine/AffineSimplifications.h" #include "util/RangeValueUtilities.h" +#include "nest/LoopNestToValue.h" +#include "value/RangeValueOptimizePass.h" #include #include #include +#include #include #include #include @@ -242,17 +245,32 @@ mlir::AffineExpr RunOnBinaryOpSubExpr(mlir::AffineExprKind exprKind, mlir::Affin mlir::AffineValueMap GetAffineValueMap(mlir::AffineStoreOp& storeOp) { - return mlir::AffineValueMap(storeOp.getAffineMap(), storeOp.getOperands()); + return mlir::AffineValueMap(storeOp.getAffineMap(), storeOp.getMapOperands()); } mlir::AffineValueMap GetAffineValueMap(mlir::AffineLoadOp& loadOp) { - return mlir::AffineValueMap(loadOp.getAffineMap(), loadOp.getOperands()); + return mlir::AffineValueMap(loadOp.getAffineMap(), loadOp.getMapOperands()); } mlir::AffineValueMap GetAffineValueMap(mlir::AffineApplyOp& applyOp) { return applyOp.getAffineValueMap(); } +template +bool AllOperandDefsAreInScope(AffineOpTy op) +{ + auto operands = op.getMapOperands(); + for (auto operand : operands) + { + mlir::Operation* defOp = GetDefiningOpOrForLoop(operand); + if (defOp == nullptr) + { + return false; + } + } + return true; +} + void ReplaceOpUsingNewValueMap(PatternRewriter& rewriter, mlir::AffineLoadOp loadOp, mlir::AffineValueMap newAffineValueMap) { rewriter.replaceOpWithNewOp(loadOp, loadOp.memref(), newAffineValueMap.getAffineMap(), newAffineValueMap.getOperands()); @@ -275,6 +293,11 @@ struct SmallNumeratorTermFloorDivSimplification : public OpRewritePattern { // See docs/Reference/gpu_caching_mod.md for a proof of the equivalence this simplification leverages + if (!AllOperandDefsAreInScope(affineOp)) + { + return failure(); + } + AffineSimplifyHelper helper(affineOp); auto loc = affineOp.getLoc(); @@ -487,6 +515,11 @@ struct PropagateGPUConstants : public OpRewritePattern LogicalResult matchAndRewrite(AffineOpTy affineOp, PatternRewriter& rewriter) const final { + if (!AllOperandDefsAreInScope(affineOp)) + { + return failure(); + } + auto loc = affineOp.getLoc(); std::vector opsToErase; @@ -499,15 +532,21 @@ struct PropagateGPUConstants : public OpRewritePattern { auto handleBlockDimOp = [&](gpu::BlockDimOp blockDimOp) { auto dimSize = GetBlockDimSize(blockDimOp); - mlir::Value dimSizeConstantOp = rewriter.create(loc, dimSize, rewriter.getI64Type()); - affineOp->replaceUsesOfWith(operand, dimSizeConstantOp); - replaced = true; + if (dimSize.has_value()) + { + mlir::Value dimSizeConstantOp = rewriter.create(loc, *dimSize, rewriter.getI64Type()); + affineOp->replaceUsesOfWith(operand, dimSizeConstantOp); + replaced = true; + } }; auto handleGridDimOp = [&](gpu::GridDimOp gridDimOp) { auto dimSize = GetGridDimSize(gridDimOp); - mlir::Value dimSizeConstantOp = rewriter.create(loc, dimSize, rewriter.getI64Type()); - affineOp->replaceUsesOfWith(operand, dimSizeConstantOp); - replaced = true; + if (dimSize.has_value()) + { + mlir::Value dimSizeConstantOp = rewriter.create(loc, *dimSize, rewriter.getI64Type()); + affineOp->replaceUsesOfWith(operand, dimSizeConstantOp); + replaced = true; + } }; mlir::TypeSwitch(definingOp) .Case([&](gpu::BlockDimOp blockDimOp) { @@ -542,6 +581,67 @@ struct PropagateGPUConstants : public OpRewritePattern } }; +struct AffineForOpSimplifyBounds : public OpRewritePattern +{ + using OpRewritePattern::OpRewritePattern; + + LogicalResult matchAndRewrite(AffineForOp affineForOp, PatternRewriter& rewriter) const final + { + RangeValueAnalysis rangeValue; + + auto lowerBound = affineForOp.getLowerBound(); + auto initLowerBoundMap = lowerBound.getMap(); + std::vector lowerBoundOperands(lowerBound.operandBegin(), lowerBound.operandEnd()); + mlir::AffineValueMap lowerBoundValueMap(initLowerBoundMap, lowerBoundOperands); + lowerBoundValueMap = SimplifyAffineValueMap(lowerBoundValueMap); + auto simplifiedLowerBoundMap = lowerBoundValueMap.getAffineMap(); + + auto upperBound = affineForOp.getUpperBound(); + auto initUpperBoundMap = upperBound.getMap(); + std::vector upperBoundOperands(upperBound.operandBegin(), upperBound.operandEnd()); + mlir::AffineValueMap upperBoundValueMap(initUpperBoundMap, upperBoundOperands); + upperBoundValueMap = SimplifyAffineValueMap(upperBoundValueMap); + auto simplifiedUpperBoundMap = upperBoundValueMap.getAffineMap(); + + rewriter.updateRootInPlace(affineForOp, [&]() + { + if (simplifiedLowerBoundMap.isSingleConstant()) + { + auto lowerBoundConst = simplifiedLowerBoundMap.getSingleConstantResult(); + affineForOp.setConstantLowerBound(lowerBoundConst); + } + else + { + affineForOp.setUpperBound(upperBoundValueMap.getOperands(), upperBoundValueMap.getAffineMap()); + } + + if (simplifiedUpperBoundMap.isSingleConstant()) + { + auto upperBoundConst = simplifiedUpperBoundMap.getSingleConstantResult(); + affineForOp.setConstantUpperBound(upperBoundConst); + } + else + { + affineForOp.setLowerBound(lowerBoundValueMap.getOperands(), lowerBoundValueMap.getAffineMap()); + } + }); + + if (affineForOp.hasConstantBounds()) + { + auto constantTripCountOpt = mlir::getConstantTripCount(affineForOp); + if (constantTripCountOpt.getValue() == 0) + { + rewriter.eraseOp(affineForOp); + return success(); + } + return PromoteIfSingleIteration(rewriter, affineForOp); + } + + // Didn't remove the loop, but possibly modified it. Let another rewrite try to simplify it + return failure(); + } +}; + struct AffineSimplificationPass : public accera::transforms::AcceraAffineSimplificationBase { void runOnOperation() final @@ -549,12 +649,24 @@ struct AffineSimplificationPass : public accera::transforms::AcceraAffineSimplif auto* context = &getContext(); auto op = getOperation(); - mlir::GreedyRewriteConfig singleIterationConfig; - singleIterationConfig.maxIterations = 1; + { + mlir::GreedyRewriteConfig singleIterationConfig; + singleIterationConfig.maxIterations = 1; + + OwningRewritePatternList patterns(context); + accera::transforms::affine::populateAcceraAffineExprSimplificationPatterns(patterns); + (void)applyPatternsAndFoldGreedily(op, std::move(patterns), singleIterationConfig); + } - OwningRewritePatternList patterns(context); - accera::transforms::affine::populateAcceraAffineSimplificationPatterns(patterns); - (void)applyPatternsAndFoldGreedily(op, std::move(patterns), singleIterationConfig); + // Apply RangeValueOptimize and affine value map simplification to try to simplify possibly-dynamic loop bounds + { + mlir::GreedyRewriteConfig topDownConfig; + topDownConfig.useTopDownTraversal = true; + + OwningRewritePatternList patterns(context); + accera::transforms::affine::populateAcceraAffineLoopSimplificationPatterns(patterns); + (void)applyPatternsAndFoldGreedily(op, std::move(patterns), topDownConfig); + } } }; @@ -562,7 +674,8 @@ struct AffineSimplificationPass : public accera::transforms::AcceraAffineSimplif namespace accera::transforms::affine { -void populateAcceraAffineSimplificationPatterns(mlir::OwningRewritePatternList& patterns) + +void populateAcceraAffineExprSimplificationPatterns(mlir::OwningRewritePatternList& patterns) { patterns.insert>(patterns.getContext()); patterns.insert>(patterns.getContext()); @@ -575,6 +688,12 @@ void populateAcceraAffineSimplificationPatterns(mlir::OwningRewritePatternList& patterns.insert>(patterns.getContext()); } +void populateAcceraAffineLoopSimplificationPatterns(mlir::OwningRewritePatternList& patterns) +{ + patterns.insert(patterns.getContext()); + accera::transforms::value::populateRangeValueOptimizePatterns(patterns); +} + std::unique_ptr createAffineSimplificationPass() { return std::make_unique(); diff --git a/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp b/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp index 6ee9dd80..4806c6a4 100644 --- a/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp +++ b/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp @@ -5910,12 +5910,12 @@ LogicalResult VectorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine return failure(); } - // First, match and rewrite the special case for vectorizing int16 matmul - auto result = vectorizeInt16MatMul(affineForOp, rewriter); - if (succeeded(result)) + // First, check if we have a custom match and rewrite pattern for this exact subgraph + auto knownSubgraphResult = TryVectorizeKnownSubgraph(affineForOp, rewriter); + if (succeeded(knownSubgraphResult)) { RemoveVectorizationInfo(affineForOp); - return result; + return knownSubgraphResult; } auto vectorInfo = GetVectorizationInfo(affineForOp); @@ -5935,6 +5935,23 @@ LogicalResult VectorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine return success(); } + // If this isn't the innermost loop in the nest and we don't have custom handling for this pattern, + // then in-place unroll the loops between this loop and the innermost loop and vectorize the innermost loop + SmallVector nestedLoops; + mlir::getPerfectlyNestedLoops(nestedLoops, affineForOp); + if (nestedLoops.size() > 1) + { + RemoveVectorizationInfo(affineForOp); + for (unsigned loopIdx = 0; loopIdx < nestedLoops.size() - 1; loopIdx++) + { + InPlaceUnrollInfo inPlaceUnrollInfo{ 0 }; // 0 for full unroll + SetInPlaceUnrollInfo(nestedLoops[loopIdx], inPlaceUnrollInfo); + } + auto vecInfoAttr = VectorizationInfoAttr::get(vectorInfo, rewriter.getContext()); + nestedLoops[nestedLoops.size() - 1]->setAttr(VectorizationInfoAttr::getKeyName(), vecInfoAttr); + return failure(); + } + auto affineForOpIV = affineForOp.getInductionVar(); if (affineForOpIV.use_empty()) @@ -7850,6 +7867,7 @@ void ExecutionPlanVectorizationPass::runOnOperation() RewritePatternSet patterns(&getContext()); accera::transforms::executionPlan::populateExecutionPlanVectorizePatterns(printVecOpDetails, patterns); + accera::transforms::executionPlan::populateExecutionPlanVectorizeUnrollPatterns(printVecOpDetails, patterns); (void)applyPatternsAndFoldGreedily(operation, std::move(patterns)); } @@ -8073,8 +8091,12 @@ void populateExecutionPlanAdjustCacheMappingPositionPatterns(mlir::RewritePatter void populateExecutionPlanVectorizePatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns) { - patterns.insert(patterns.getContext(), printVectorizationDetails); + patterns.insert(patterns.getContext(), printVectorizationDetails); +} + +void populateExecutionPlanVectorizeUnrollPatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns) +{ + patterns.insert(patterns.getContext(), printVectorizationDetails); } void populateExecutionPlanTensorizePatterns(mlir::RewritePatternSet& patterns) diff --git a/accera/transforms/src/nest/LoopNestToValue.cpp b/accera/transforms/src/nest/LoopNestToValue.cpp index 275750ad..088e2a63 100644 --- a/accera/transforms/src/nest/LoopNestToValue.cpp +++ b/accera/transforms/src/nest/LoopNestToValue.cpp @@ -814,7 +814,19 @@ LogicalResult ScheduledLoopOpRewrite::matchAndRewrite(ScheduledLoopOp op, Patter auto scheduledLoopOpAttrs = op->getAttrs(); for (auto& attr : scheduledLoopOpAttrs) { - bodyLoop->setAttr(attr.getName(), attr.getValue()); + // HACK: Don't copy the domain attribute in case we later inline a dynamically-sized domain into a statically-sized region and the domain doesn't adjust correctly for serialization + // (we also no longer need the domain after building out the loopnest) + if (attr.getName() != "domain") + { + bodyLoop->setAttr(attr.getName(), attr.getValue()); + } + } + // Hack for erasing loops + if (bodyLoop->hasAttr("_erase")) + { + bodyLoop.setConstantLowerBound(0); + bodyLoop.setConstantUpperBound(1); + bodyLoop.setStep(1); } auto bodyLoopRegion = &bodyLoop.region(); diff --git a/accera/transforms/src/nest/LoopNestToValueFunc.cpp b/accera/transforms/src/nest/LoopNestToValueFunc.cpp index be5f7840..0a76b3cc 100644 --- a/accera/transforms/src/nest/LoopNestToValueFunc.cpp +++ b/accera/transforms/src/nest/LoopNestToValueFunc.cpp @@ -281,7 +281,7 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB { RewritePatternSet patterns(context); - affinetr::populateAcceraAffineSimplificationPatterns(patterns); + affinetr::populateAcceraAffineExprSimplificationPatterns(patterns); (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns), singleIterationConfig); snapshotter.Snapshot("AcceraAffineSimplification", vFuncOp); } @@ -308,6 +308,21 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB snapshotter.Snapshot("ExecutionPlanVectorize_Canonicalize", vFuncOp); } + { + RewritePatternSet patterns(context); + tr::populateLoopSimplificationPatterns(patterns); + (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns)); + snapshotter.Snapshot("LoopSimplification", vFuncOp); + } + + { + RewritePatternSet patterns(context); + xptr::populateExecutionPlanVectorizeUnrollPatterns(printVecOpDetails, patterns); + utilir::FillCanonicalPatternsRecursively(vFuncOp, patterns); + (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns)); + snapshotter.Snapshot("ExecutionPlanVectorizeUnroll_Canonicalize", vFuncOp); + } + { RewritePatternSet patterns(context); tr::populateLoopOptimizationPatterns(patterns); diff --git a/accera/transforms/src/util/RangeValueUtilities.cpp b/accera/transforms/src/util/RangeValueUtilities.cpp index 50e7fc53..b328e1f8 100644 --- a/accera/transforms/src/util/RangeValueUtilities.cpp +++ b/accera/transforms/src/util/RangeValueUtilities.cpp @@ -41,19 +41,31 @@ namespace RangeValue resolveThreadIdRange(Operation* op, gpu::Dimension dimId) { auto upperBound = GetBlockDimSize(op, dimId); - return RangeValue(0, upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the thread id never takes on the upperBound value + if (upperBound.has_value()) + { + return RangeValue(0, *upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the thread id never takes on the upperBound value + } + return RangeValue(); } RangeValue resolveBlockIdRange(Operation* op, gpu::Dimension dimId) { auto upperBound = GetGridDimSize(op, dimId); - return RangeValue(0, upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the block id never takes on the upperBound value + if (upperBound.has_value()) + { + return RangeValue(0, *upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the block id never takes on the upperBound value + } + return RangeValue(); } RangeValue resolveGridDimRange(Operation* op, gpu::Dimension dimId) { auto upperBound = GetGridDimSize(op, dimId); - return RangeValue(upperBound, upperBound); + if (upperBound.has_value()) + { + return RangeValue(*upperBound, *upperBound); + } + return RangeValue(); } } // namespace @@ -285,7 +297,12 @@ RangeValue RangeValueAnalysis::resolveRangeValue(mlir::gpu::GridDimOp op) RangeValue RangeValueAnalysis::resolveRangeValue(WarpIdOp op) { const mlir::gpu::Dimension dim{ op.dimension() }; - auto upperBound = GetBlockDimSize(op, dim); + auto upperBoundOpt = GetBlockDimSize(op, dim); + if (!upperBoundOpt.has_value()) + { + return RangeValue(); + } + auto upperBound = *upperBoundOpt; if (dim == mlir::gpu::Dimension::x) { auto [warpSizeX, warpSizeY] = ResolveWarpSize(ResolveExecutionRuntime(op)).value(); @@ -298,11 +315,114 @@ RangeValue RangeValueAnalysis::resolveRangeValue(WarpIdOp op) RangeValue RangeValueAnalysis::resolveRangeValue(Instruction::BinaryOps binOp, mlir::Operation* op) { auto operands = resolveOperands(op); + return resolveRangeValue(binOp, operands); +} + +RangeValue RangeValueAnalysis::resolveRangeValue(Instruction::BinaryOps binOp, const llvm::SmallVectorImpl& operands) +{ return operands[0].binaryOp(binOp, operands[1]); } + +RangeValue RangeValueAnalysis::resolveRangeValue(AffineApplyOp op) +{ + auto affineValueMap = util::AffineApplyToAffineValueMap(op); + auto simplified = util::SimplifyAffineValueMap(affineValueMap); + auto map = simplified.getAffineMap(); + assert(map.getNumResults() == 1 && "Affine apply can't have multiple expressions"); + auto expr = map.getResult(0); + auto operands = simplified.getOperands(); + for (auto operand : operands) + { + if (!hasRange(operand)) + { + if (auto defOp = GetDefiningOpOrForLoop(operand)) + { + addOperation(defOp); + } + } + } + std::vector dimOperands(operands.begin(), operands.begin() + map.getNumDims()); + std::vector symbolOperands(operands.begin() + map.getNumDims(), operands.end()); + mlir::DenseMap subExprRanges; + // Post-order traversal of the expression tree + expr.walk([&](mlir::AffineExpr subExpr) { + if (auto dimExpr = subExpr.dyn_cast()) + { + auto idx = dimExpr.getPosition(); + auto rv = getRange(dimOperands[idx]); + subExprRanges.insert({ subExpr, rv }); + } + if (auto symExpr = subExpr.dyn_cast()) + { + auto idx = symExpr.getPosition(); + auto rv = getRange(symbolOperands[idx]); + subExprRanges.insert({ subExpr, rv }); + } + if (auto constExpr = subExpr.dyn_cast()) + { + RangeValue rv(constExpr.getValue(), constExpr.getValue()); + subExprRanges.insert({ subExpr, rv }); + } + if (auto binOpExpr = subExpr.dyn_cast()) + { + auto lhs = binOpExpr.getLHS(); + auto rhs = binOpExpr.getRHS(); + auto lhsIt = subExprRanges.find(lhs); + assert(lhsIt != subExprRanges.end()); + auto lhsRv = lhsIt->second; + auto rhsIt = subExprRanges.find(rhs); + assert(rhsIt != subExprRanges.end()); + auto rhsRv = rhsIt->second; + + Instruction::BinaryOps llvmBinOp; + switch (binOpExpr.getKind()) + { + case mlir::AffineExprKind::Add: + llvmBinOp = Instruction::BinaryOps::Add; + break; + case mlir::AffineExprKind::Mul: + llvmBinOp = Instruction::BinaryOps::Mul; + break; + case mlir::AffineExprKind::Mod: + llvmBinOp = Instruction::BinaryOps::SRem; + break; + case mlir::AffineExprKind::FloorDiv: + llvmBinOp = Instruction::BinaryOps::SDiv; + break; + case mlir::AffineExprKind::CeilDiv: + assert(false); // Unsupported currently - no matching llvm bin op + break; + default: + assert(false); + break; + } + llvm::SmallVector operandRanges{ lhsRv, rhsRv }; + auto rv = resolveRangeValue(llvmBinOp, operandRanges); + subExprRanges.insert({ subExpr, rv }); + } + }); + + // Find the root expr in the map and return its computed RangeValue + auto it = subExprRanges.find(expr); + assert(it != subExprRanges.end()); + return it->second; +} + RangeValue RangeValueAnalysis::resolveRangeValue(AffineForOp op) { - return op.hasConstantBounds() ? RangeValue(op.getConstantLowerBound(), op.getConstantUpperBound() - op.getStep()) : RangeValue(); + if (op.hasConstantBounds()) + { + auto lb = op.getConstantLowerBound(); + auto ub = op.getConstantUpperBound(); + auto step = op.getStep(); + + auto range = ub - lb; + auto remainder = range % step; + auto largestInductionVarValue = (remainder > 0) ? (ub - remainder) : (ub - step); + + return RangeValue(lb, largestInductionVarValue); + } + return RangeValue(); } RangeValue RangeValueAnalysis::resolveRangeValue(scf::ForOp op) { @@ -314,7 +434,22 @@ RangeValue RangeValueAnalysis::resolveRangeValue(scf::ForOp op) RangeValue lowerBound = resolveRangeValue(op.getLowerBound().getDefiningOp()); RangeValue upperBound = resolveRangeValue(op.getUpperBound().getDefiningOp()); - return lowerBound.isConstant() && upperBound.isConstant() ? RangeValue(lowerBound.range.getLower(), upperBound.range.getUpper() - 1) : RangeValue(); + RangeValue stepSize = resolveRangeValue(op.getStep().getDefiningOp()); + + bool isConstantRangeStep = lowerBound.isConstant() && upperBound.isConstant() && stepSize.isConstant(); + if (isConstantRangeStep) + { + auto lb = lowerBound.range.getLower(); + auto ub = upperBound.range.getUpper(); + auto step = stepSize.range.getLower(); + + auto range = ub - lb; + auto remainder = range.srem(step); + auto largestInductionVarValue = (remainder.sgt(0)) ? (ub - remainder) : (ub - step); + + return RangeValue(lb, largestInductionVarValue); + } + return RangeValue(); } RangeValue RangeValueAnalysis::resolveRangeValue(mlir::Operation* op) { @@ -335,6 +470,7 @@ RangeValue RangeValueAnalysis::resolveRangeValue(mlir::Operation* op) .Case([&](arith::DivUIOp op) { return resolveRangeValue(Instruction::BinaryOps::UDiv, op); }) .Case([&](scf::ForOp op) { return resolveRangeValue(op); }) .Case([&](AffineForOp op) { return resolveRangeValue(op); }) + .Case([&](AffineApplyOp op) { return resolveRangeValue(op); }) .Default([&](mlir::Operation*) { return RangeValue(); }); } diff --git a/accera/transforms/src/util/VectorizationUtil.cpp b/accera/transforms/src/util/VectorizationUtil.cpp index 9558ce24..84c91987 100644 --- a/accera/transforms/src/util/VectorizationUtil.cpp +++ b/accera/transforms/src/util/VectorizationUtil.cpp @@ -38,6 +38,9 @@ namespace v = accera::ir::value; #define DEBUG_TYPE "vectorization-util" +// TODO : plumb through a sufficient target enum / bitmap so we can dynamically enable/disable vpmaddwd and other pattern matchers +#define MATCH_VPMADDWD_INTRINSIC 1 + namespace accera::transforms { @@ -123,6 +126,8 @@ bool CanVectorizeOp(mlir::Operation* op, .Case([](mlir::math::AbsOp) { return true; }) // .Case([&](mlir::AffineApplyOp) { return true; }) // TODO: either enable or remove this .Case([](mlir::math::ExpOp) { return true; }) + .Case([](v::CastOp) { return true; }) + .Case([vectorSize](v::RoundOp) { return v::RoundOp::SupportsVectorization(vectorSize); }) .Case([](v::BitcastOp) { return true; }) .Case([](v::BinOp) { return true; }) .Case([](v::CmpOp) { return true; }) @@ -263,19 +268,101 @@ std::optional VectorizeConstantOp(mlir::PatternRewriter& rewri return constVec; } +// TODO de-dupe some internals with GetConstantStrideBetweenUnrolledAccesses +template +std::optional GetConstantStrideBetweenAccesses(mlir::PatternRewriter& rewriter, + LhsOpType lhsAccessOp, + RhsOpType rhsAccessOp) +{ + std::stack tempOps; + ir::util::TempOpCleanupGuard tempOpGuard(&tempOps, rewriter); + + auto lhsAccessMapComposition = ir::util::GetIndexToMemoryLocationMap(rewriter.getContext(), lhsAccessOp); + auto rhsAccessMapComposition = ir::util::GetIndexToMemoryLocationMap(rewriter.getContext(), rhsAccessOp); + + // For dynamically shaped memrefs, currently we only handle identity-mapped memrefs, + // general dynamic memref support will come later. + auto lhsMemRefType = lhsAccessOp.memref().getType().template cast(); + if (!lhsMemRefType.hasStaticShape()) + { + if (!ir::util::HasIdentityLayout(lhsAccessOp.memref())) + { + return std::nullopt; + } + } + + auto rhsMemRefType = rhsAccessOp.memref().getType().template cast(); + if (!rhsMemRefType.hasStaticShape()) + { + if (!ir::util::HasIdentityLayout(rhsAccessOp.memref())) + { + return std::nullopt; + } + } + + // Re-check if there is no static shape and collect the symbols now that we know we won't be returning std::nullopt + // because ir::util::GetIdentityMemrefStrideSymbols() does a non-trivial amount of work that me may as well not waste + std::vector lhsStrideSymbols; + std::vector rhsStrideSymbols; + if (!lhsMemRefType.hasStaticShape()) + { + lhsStrideSymbols = ir::util::GetIdentityMemrefStrideSymbols(rewriter, lhsAccessOp.getLoc(), lhsAccessOp.memref()); + } + if (!rhsMemRefType.hasStaticShape()) + { + rhsStrideSymbols = ir::util::GetIdentityMemrefStrideSymbols(rewriter, rhsAccessOp.getLoc(), rhsAccessOp.memref()); + } + + std::vector lhsIndicesVec(lhsAccessOp.indices().begin(), lhsAccessOp.indices().end()); + std::vector rhsIndicesVec(rhsAccessOp.indices().begin(), rhsAccessOp.indices().end()); + + // Append any dynamic stride symbols since we're dealing with a flattened layout map + lhsIndicesVec.insert(lhsIndicesVec.end(), lhsStrideSymbols.begin(), lhsStrideSymbols.end()); + rhsIndicesVec.insert(rhsIndicesVec.end(), rhsStrideSymbols.begin(), rhsStrideSymbols.end()); + + auto lhsAccess = ir::util::MultiDimAffineApply(rewriter, lhsAccessOp.getLoc(), lhsAccessMapComposition, lhsIndicesVec); + auto rhsAccess = ir::util::MultiDimAffineApply(rewriter, rhsAccessOp.getLoc(), rhsAccessMapComposition, rhsIndicesVec); + assert(lhsAccess.size() == 1); + assert(rhsAccess.size() == 1); + tempOps.push(lhsAccess[0].getDefiningOp()); + tempOps.push(rhsAccess[0].getDefiningOp()); + + mlir::AffineExpr diffExpr = rewriter.getAffineDimExpr(1) - rewriter.getAffineDimExpr(0); + auto diffMap = mlir::AffineMap::get(2, 0, diffExpr); + + mlir::SmallVector compareAccesses{ lhsAccess[0], rhsAccess[0] }; + mlir::fullyComposeAffineMapAndOperands(&diffMap, &compareAccesses); + + assert(diffMap.getNumResults() == 1); + auto resultExpr = diffMap.getResult(0); + if (resultExpr.isa()) + { + auto constExpr = resultExpr.dyn_cast(); + return constExpr.getValue(); + } + + // There isn't a constant difference between memory accesses + return std::nullopt; +} + template -bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter, - OpType op, - std::vector& laneMappings, - int64_t vectorSize) +std::optional GetConstantStrideBetweenUnrolledAccesses(mlir::PatternRewriter& rewriter, + OpType op, + std::vector& laneMappings, + int64_t vectorSize) { // Create some unrolled clones in-memory and see whether they are accessing memory-sequential elements in the MemRef + std::stack tempOps; + ir::util::TempOpCleanupGuard tempOpGuard(&tempOps, rewriter); + auto loc = op.getLoc(); std::vector temporaryClones; temporaryClones.reserve(vectorSize); for (int64_t i = 0; i < vectorSize; ++i) { - temporaryClones.push_back(mlir::dyn_cast(rewriter.clone(*op.getOperation(), laneMappings[i]))); + auto newTempOp = mlir::dyn_cast(rewriter.clone(*op.getOperation(), laneMappings[i])); + tempOps.push(newTempOp); // Useful for automatic cleanup + temporaryClones.push_back(newTempOp); // Needed for ordered comparison } // Check if the temporary clones are all accessing sequential memory @@ -289,12 +376,12 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter, { if (!ir::util::HasIdentityLayout(op.memref())) { - return false; + return std::nullopt; } strideSymbols = ir::util::GetIdentityMemrefStrideSymbols(rewriter, loc, op.memref()); } - bool sequential = true; + std::optional stride = std::nullopt; for (int64_t unrollIdx = 1; unrollIdx < vectorSize; ++unrollIdx) { std::vector prevIndicesVec(temporaryClones[unrollIdx - 1].indices().begin(), temporaryClones[unrollIdx - 1].indices().end()); @@ -308,6 +395,8 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter, auto currentAccess = ir::util::MultiDimAffineApply(rewriter, loc, accessMapComposition, currentIndicesVec); assert(prevAccess.size() == 1); assert(currentAccess.size() == 1); + tempOps.push(prevAccess[0].getDefiningOp()); + tempOps.push(currentAccess[0].getDefiningOp()); mlir::AffineExpr diffExpr = rewriter.getAffineDimExpr(1) - rewriter.getAffineDimExpr(0); auto diffMap = mlir::AffineMap::get(2, 0, diffExpr); @@ -320,31 +409,53 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter, if (resultExpr.isa()) { auto constExpr = resultExpr.dyn_cast(); - if (constExpr.getValue() != 1) + if (!stride.has_value()) + { + stride = constExpr.getValue(); + } + else if (constExpr.getValue() != *stride) { - // There is a constant difference between sequential op memory accesses - // but the stride is not 1, so the memory isn't contiguous and therefore - // it's not safe to replace all of the memory ops with a single vector op - sequential = false; - break; + // The strides aren't consistent + return std::nullopt; } } else { // There isn't a constant difference between sequential op memory accesses - // so it's not necessarily safe to convert all of the memory ops into a single - // vector op - sequential = false; - break; + return std::nullopt; } } - // Clean up the temporary clones - for (auto& clone : temporaryClones) - { - rewriter.eraseOp(clone); - } - return sequential; + return stride; +} + +template +bool DoesUnrolledAccessHaveStride(mlir::PatternRewriter& rewriter, + OpType op, + std::vector& laneMappings, + int64_t vectorSize, + int64_t stride) +{ + auto strideOpt = GetConstantStrideBetweenUnrolledAccesses(rewriter, op, laneMappings, vectorSize); + return strideOpt.has_value() && *strideOpt == stride; +} + +template +bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter, + OpType op, + std::vector& laneMappings, + int64_t vectorSize) +{ + return DoesUnrolledAccessHaveStride(rewriter, op, laneMappings, vectorSize, 1 /* stride */); +} + +template +bool IsUnrolledAccessConstant(mlir::PatternRewriter& rewriter, + OpType op, + std::vector& laneMappings, + int64_t vectorSize) +{ + return DoesUnrolledAccessHaveStride(rewriter, op, laneMappings, vectorSize, 0 /* stride */); } mlir::Value FlattenMemRefCast(mlir::OpBuilder& builder, mlir::Location loc, mlir::Value memref) @@ -488,6 +599,42 @@ std::optional VectorizeStoreOp(mlir::PatternRewriter& rewriter, } } +mlir::vector::LoadOp VectorizeAffineLoadOpHelper(mlir::PatternRewriter& rewriter, + mlir::AffineLoadOp op, + int64_t vectorSize) +{ + auto memRefType = op.getMemRefType(); + auto elementType = memRefType.getElementType(); + auto vectorType = mlir::VectorType::get({ vectorSize }, elementType); + mlir::AffineLoadOpAdaptor adaptor{ op }; + std::vector indices(adaptor.indices().begin(), adaptor.indices().end()); + + auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, op, indices); + return rewriter.create(op.getLoc(), vectorType, flatCastMemRef, mlir::ValueRange{ flattenedPos }); +} + +mlir::vector::StoreOp VectorizeAffineStoreOpHelper(mlir::PatternRewriter& rewriter, + mlir::AffineStoreOp op, + mlir::Value vecValToStore, + int64_t vectorSize) +{ + mlir::AffineStoreOpAdaptor adaptor{ op }; + std::vector indices(adaptor.indices().begin(), adaptor.indices().end()); + + auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, op, indices); + return rewriter.create(op.getLoc(), vecValToStore, flatCastMemRef, mlir::ValueRange{ flattenedPos }); +} + +mlir::vector::StoreOp VectorizeAffineStoreOpHelper(mlir::PatternRewriter& rewriter, + mlir::AffineStoreOp op, + mlir::BlockAndValueMapping valueMapping, + int64_t vectorSize) +{ + auto scalarStoreVal = op.getValueToStore(); + assert(valueMapping.contains(scalarStoreVal)); + return VectorizeAffineStoreOpHelper(rewriter, op, valueMapping.lookup(scalarStoreVal), vectorSize); +} + std::optional VectorizeAffineLoadOp(mlir::PatternRewriter& rewriter, mlir::AffineLoadOp op, const VectorizedOpMap& vectorizedOps, @@ -505,24 +652,34 @@ std::optional VectorizeAffineLoadOp(mlir::PatternRewriter& rewrite std::vector baseIndices(adaptor.indices().begin(), adaptor.indices().end()); mlir::Value result; - if (IsUnrolledAccessSequential(rewriter, op, laneMappings, vectorSize)) - { - // We know these reads are sequential, but mlir::vector::LoadOp only operates on memrefs where the minor - // dimension has unit stride, so cast the memref to a flat buffer and load from that shape - auto [flatCastMemref, flattenedPosition] = FlattenAccess(rewriter, op, baseIndices); - result = rewriter.create(op.getLoc(), vectorType, flatCastMemref, mlir::ValueRange{ flattenedPosition }); - } - else + auto strideOpt = GetConstantStrideBetweenUnrolledAccesses(rewriter, op, laneMappings, vectorSize); + if (strideOpt.has_value()) { - // Fall back to many loads and stores into a vector - auto zero = rewriter.create(loc, elementType, rewriter.getZeroAttr(elementType)); - result = rewriter.create(loc, vectorType, zero); - for (int64_t i = 0; i < vectorSize; ++i) + int64_t stride = *strideOpt; + if (stride == 1) { - auto elementLoad = rewriter.clone(*op.getOperation(), laneMappings[i]); - result = rewriter.create(loc, elementLoad->getResult(0), result, rewriter.create(loc, i)); + // We know these reads are sequential, but mlir::vector::LoadOp only operates on memrefs where the minor + // dimension has unit stride, so cast the memref to a flat buffer and load from that shape + auto [flatCastMemref, flattenedPosition] = FlattenAccess(rewriter, op, baseIndices); + result = rewriter.create(op.getLoc(), vectorType, flatCastMemref, mlir::ValueRange{ flattenedPosition }); + return result; + } + else if (stride == 0) + { + // Broadcast a single loaded element + auto clonedLoadOp = mlir::dyn_cast(rewriter.clone(*op.getOperation())); // The original op will likely get discarded as part of successful vectorization + result = rewriter.create(loc, vectorType, clonedLoadOp.getResult()); + return result; } } + // Fall back to many loads and stores into a vector + auto zero = rewriter.create(loc, elementType, rewriter.getZeroAttr(elementType)); + result = rewriter.create(loc, vectorType, zero); + for (int64_t i = 0; i < vectorSize; ++i) + { + auto elementLoad = rewriter.clone(*op.getOperation(), laneMappings[i]); + result = rewriter.create(loc, elementLoad->getResult(0), result, rewriter.create(loc, i)); + } return result; } @@ -534,16 +691,28 @@ std::optional VectorizeAffineStoreOp(mlir::PatternRewriter& rewrit int64_t step, int64_t vectorSize) { + [[maybe_unused]] auto loc = op.getLoc(); + // Get (vector) value to store from map mlir::AffineStoreOpAdaptor adaptor{ op }; auto scalarValue = op.getValueToStore(); - auto vecOp = vectorizedOps.Lookup(scalarValue.getDefiningOp()); + auto scalarValueDefOp = scalarValue.getDefiningOp(); + auto vecOp = vectorizedOps.Lookup(scalarValueDefOp); if (!vecOp) { - return std::nullopt; + if (mlir::isa(scalarValueDefOp)) + { + // If it's a constant being stored, just broadcast it to a vector and store that + auto vectorType = mlir::VectorType::get({ vectorSize }, scalarValue.getType()); + mlir::Value broadcastVal = rewriter.create(loc, vectorType, scalarValue); + vecOp = VectorizedOp(broadcastVal); + } + else + { + return std::nullopt; + } } - [[maybe_unused]] auto loc = op.getLoc(); auto memRefType = op.getMemRefType(); [[maybe_unused]] auto elementType = memRefType.getElementType(); @@ -647,6 +816,53 @@ std::optional VectorizeShiftLeftOp(mlir::PatternRewriter& rewr return result; } +// TODO : de-dupe with cast and other simple vectorizable ops +std::optional VectorizeAccRoundOp(mlir::PatternRewriter& rewriter, + v::RoundOp op, + const VectorizedOpMap& vectorizedOps, + std::vector& laneMappings, + mlir::Value inductionVar, + int64_t step, + int64_t vectorSize) +{ + // Get (vector) arguments from map + auto inputOp = op.val(); + auto input = GetVectorizedPredecessor(rewriter, inputOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); + if (!input) + { + return std::nullopt; + } + + auto loc = op.getLoc(); + auto scalarResultType = op.getResult().getType(); + auto resultType = mlir::VectorType::get({ vectorSize }, scalarResultType); + auto result = rewriter.create(loc, resultType, input->GetVectorResult()); + return result; +} + +std::optional VectorizeAccCastOp(mlir::PatternRewriter& rewriter, + v::CastOp op, + const VectorizedOpMap& vectorizedOps, + std::vector& laneMappings, + mlir::Value inductionVar, + int64_t step, + int64_t vectorSize) +{ + // Get (vector) arguments from map + auto inputOp = op.source(); + auto input = GetVectorizedPredecessor(rewriter, inputOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); + if (!input) + { + return std::nullopt; + } + + auto loc = op.getLoc(); + auto scalarResultType = op.getResult().getType(); + auto resultType = mlir::VectorType::get({ vectorSize }, scalarResultType); + auto result = rewriter.create(loc, resultType, input->GetVectorResult()); + return result; +} + std::optional VectorizeFPToSIOp(mlir::PatternRewriter& rewriter, mlir::arith::FPToSIOp op, const VectorizedOpMap& vectorizedOps, @@ -757,7 +973,23 @@ std::optional VectorizeBinOp(mlir::PatternRewriter& rewriter, assert(lhs->HasVectorType() == rhs->HasVectorType()); // TODO : do we need to support the case where one operand is a vector and the other is a series of unrolled values? if (lhs->HasVectorType()) { - mlir::Value result = rewriter.create(loc, predicate, lhs->GetVectorResult(), rhs->GetVectorResult()); + mlir::Value result; + auto vectorTy = lhs->GetVectorResult().getType(); + if (vectorSize == 8) + { + // Special-case max and min for better codegen + if (predicate == v::BinaryOpPredicate::MAX) + { + result = rewriter.create(loc, vectorTy, lhs->GetVectorResult(), rhs->GetVectorResult()); + return result; + } + else if (predicate == v::BinaryOpPredicate::MIN) + { + result = rewriter.create(loc, vectorTy, lhs->GetVectorResult(), rhs->GetVectorResult()); + return result; + } + } + result = rewriter.create(loc, predicate, lhs->GetVectorResult(), rhs->GetVectorResult()); return result; } else @@ -905,9 +1137,15 @@ std::optional VectorizeOp(mlir::PatternRewriter& rewriter, .Case([&](v::CmpOp cmpOp) { return VectorizeCmpOp(rewriter, cmpOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); }) + .Case([&](v::CastOp castOp) { + return VectorizeAccCastOp(rewriter, castOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); + }) .Case([&](v::BitcastOp bitcastOp) { return VectorizeBitcastOp(rewriter, bitcastOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); }) + .Case([&](v::RoundOp roundOp) { + return VectorizeAccRoundOp(rewriter, roundOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); + }) .Case([&](v::ReferenceGlobalOp refGlobalOp) { return VectorizeReferenceGlobalOp(rewriter, refGlobalOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize); }) @@ -928,161 +1166,1039 @@ std::optional VectorizeOp(mlir::PatternRewriter& rewriter, return resultOp; } -mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, - mlir::PatternRewriter& rewriter) +// TODO : support multi-dim vector reductions +mlir::LogicalResult vectorizeHorizontalReduction(mlir::AffineForOp affineForOp, mlir::PatternRewriter& rewriter) { + // Try to match a pattern like: + // for indices + // for i: + // x = load(input[..., i]) : memref -> T1 + // y = load(output[...]) : memref (doesn't depend on i) -> T1 + // z = x + y + // store(z, output[...]) : (same position as load) + + // And replace it with: + // flat_input = reinterpret_cast input to flat + // flat_output = reinterpret_cast output to flat + // x = vector_load(flat_input, flatten_input_pos(..., i)) : vector + // y = affine_load(output[...]) : T1 + // z = vector.reduction "add" + // affine_store(z, output[...]) + + // Note: the 'add' operation above can also be many other ops + // See enum values from /mlir/include/mlir/Dialect/Vector/IR/VectorOps.td + // e.g. add, mul, minui, minsi, minf, maxui, maxsi, maxf, and, or, xor + + // Also allow for the loaded values to be cast before the sum + + // So we need to check for the: + // - this affine for op is the innermost loop + // - the loop has constant bounds (TODO: relax this check) + // And the ops in the loop are: + // - loop-sequential load + // - loop-constant load from location Y + // - BinOp of the loaded values + // - store BinOp result to location Y + // Implement the matcher auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult { llvm::dbgs() << "While processing " << *op << ". " << message << "\n"; return rewriter.notifyMatchFailure(op, message); }; - std::stack matchedOps; + std::stack matchedOps; std::stack tempOps; + ir::util::TempOpCleanupGuard(&tempOps, rewriter); - // Match jj and kk loop in int16 matmul for vectorization rewrite rules SmallVector loops; mlir::getPerfectlyNestedLoops(loops, affineForOp); - if (loops.size() != 2) // there should be exactly 2 loops in the nest + if (loops.size() != 1) // there should be exactly 1 loops in the nest being vectorized { return failure(); } - for (auto& loop : loops) + // TODO : support dynamic loops that operate over contiguous memory + if (!affineForOp.hasConstantBounds() || affineForOp.getConstantLowerBound() != 0) { - if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0) - { - return failure(); - } + return failure(); } - // order of nested loops we are looking for is - // jj {0 to 8} followed by kk {0 to 2} - auto outerLoop = loops.front(); // jj loop - int64_t jj_begin = outerLoop.getConstantLowerBound(); - int64_t jj_end = outerLoop.getConstantUpperBound(); - int64_t jj_step = outerLoop.getStep(); - int64_t jj_numIters = (jj_end - jj_begin) / jj_step; - if (jj_numIters != 8) - return failure(); - auto jj_inductionVar = outerLoop.getInductionVar(); + int64_t begin = affineForOp.getConstantLowerBound(); + int64_t end = affineForOp.getConstantUpperBound(); + int64_t step = affineForOp.getStep(); + int64_t numIters = (end - begin) / step; + auto inductionVar = affineForOp.getInductionVar(); - auto innerLoop = loops.back(); // the innermost loop, kk - int64_t kk_begin = innerLoop.getConstantLowerBound(); - int64_t kk_end = innerLoop.getConstantUpperBound(); - int64_t kk_step = innerLoop.getStep(); - int64_t kk_numIters = (kk_end - kk_begin) / kk_step; - if (kk_numIters != 2) - return failure(); - auto kk_inductionVar = innerLoop.getInductionVar(); + int64_t unrollMax = std::min(numIters, (end - begin)); + auto vectorSize = unrollMax; // iterate on loop body from begin to end to match the ops list - auto innerLoopBodyIter = innerLoop.getBody()->begin(); - auto innerLoopBodyEnd = innerLoop.getBody()->end(); - - // TODO: deal with case where we load B before A (allow C[i,j] += B[k,j] * A[i,k]) - // TODO: ensure we're storing the updated C value back into the same location (disallow C[m,n] = C[i,j] + A[i,k] * B[k,j]) + auto loopBodyIter = affineForOp.getBody()->begin(); + auto loopBodyEnd = affineForOp.getBody()->end(); - // 1. load from A matrix - if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + // 1. load from lhs array + if (loopBodyIter == loopBodyEnd || !isa(*loopBodyIter)) { - return reportMatchFailure(affineForOp, "Failed to match the load from A Op"); + return reportMatchFailure(affineForOp, "Failed to match the lhs load op"); } - auto loadAOp = cast(*innerLoopBodyIter); - auto elementBitWidthA = loadAOp.getMemRefType().getElementTypeBitWidth(); - if (elementBitWidthA != 16) + + auto lhsLoadOp = cast(*loopBodyIter++); + auto lhsLoadVal = lhsLoadOp.getResult(); // Keep the laoded val separate from the current lhs val for mapping later + auto lhsVal = lhsLoadVal; + matchedOps.push(lhsLoadOp); + + // Set up sequential mappings for the loop + std::vector laneMappings(unrollMax); + for (int64_t idx = begin; idx < end; idx += step) { - return failure(); + auto offsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (idx * step)); + auto offsetInductionVar = rewriter.create(lhsLoadOp.getLoc(), offsetMap, ValueRange{ inductionVar }); + tempOps.push(offsetInductionVar); + laneMappings[idx].map(inductionVar, offsetInductionVar); } - matchedOps.push(loadAOp); - // verify load from A looks like A[*,kk] or A[kk,*] - int loadA_kIndex = -1; - for (auto en : llvm::enumerate(loadAOp.indices())) + bool lhsLoadIsLoopSequential = IsUnrolledAccessSequential(rewriter, lhsLoadOp, laneMappings, unrollMax); + bool lhsLoadIsLoopConstant = IsUnrolledAccessConstant(rewriter, lhsLoadOp, laneMappings, unrollMax); + + // 1a. (optional) cast + v::CastOp lhsLoadCastOp; + mlir::Type lhsCastType; + if (isa(*loopBodyIter)) { - auto i = en.value(); - if (i == kk_inductionVar) + lhsLoadCastOp = cast(*loopBodyIter++); + if (lhsLoadCastOp.source() != lhsVal) { - if (loadA_kIndex != -1) - { - return reportMatchFailure(affineForOp, "Failed to match the load from A Op (too many 'k' indicies)"); - } - loadA_kIndex = en.index(); + return reportMatchFailure(affineForOp, "Cast after lhs load isn't casting the loaded value"); } + auto castedValue = lhsLoadCastOp.result(); + lhsCastType = castedValue.getType(); + lhsVal = castedValue; + matchedOps.push(lhsLoadCastOp); } - if (loadA_kIndex == -1) + // 2. load from rhs array + if (loopBodyIter == loopBodyEnd || !isa(*loopBodyIter)) { - return reportMatchFailure(affineForOp, "Failed to match the load from A Op (no 'k' index)"); + return reportMatchFailure(affineForOp, "Failed to match the rhs load op"); } - // 2. load from B matrix - innerLoopBodyIter++; - if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + auto rhsLoadOp = cast(*loopBodyIter++); + auto rhsLoadVal = rhsLoadOp.getResult(); + auto rhsVal = rhsLoadVal; + matchedOps.push(rhsLoadOp); + + bool rhsLoadIsLoopSequential = IsUnrolledAccessSequential(rewriter, rhsLoadOp, laneMappings, unrollMax); + bool rhsLoadIsLoopConstant = IsUnrolledAccessConstant(rewriter, rhsLoadOp, laneMappings, unrollMax); + + // 2a. (optional) cast + v::CastOp rhsLoadCastOp(nullptr); + mlir::Type rhsCastType; + if (isa(*loopBodyIter)) + { + rhsLoadCastOp = cast(*loopBodyIter++); + if (rhsLoadCastOp.source() != rhsVal) + { + return reportMatchFailure(affineForOp, "Cast after rhs load isn't casting the loaded value"); + } + auto castedValue = rhsLoadCastOp.result(); + rhsCastType = castedValue.getType(); + rhsVal = castedValue; + matchedOps.push(rhsLoadCastOp); + } + + // 3. bin op + if (loopBodyIter == loopBodyEnd || !isa(*loopBodyIter)) { - return reportMatchFailure(affineForOp, "Failed to match the load from B Op"); + return reportMatchFailure(affineForOp, "Failed to match the bin op"); } - auto loadBOp = cast(innerLoopBodyIter); - auto elementBitWidthB = loadBOp.getMemRefType().getElementTypeBitWidth(); - if (elementBitWidthB != 16) + auto binOp = cast(*loopBodyIter++); + auto binOpVal = binOp.getResult(); + bool lhsRhsLineUp = (binOp.lhs() == lhsVal) && (binOp.rhs() == rhsVal); + bool lhsRhsSwap = (binOp.lhs() == rhsVal) && (binOp.rhs() == lhsVal); + if (!lhsRhsLineUp && !lhsRhsSwap) { - return failure(); + return reportMatchFailure(affineForOp, "Bin op isn't using loaded lhs and rhs values"); } - matchedOps.push(loadBOp); + matchedOps.push(binOp); - // verify load from B looks like B[kk,jj] or B[jj,kk] - int loadB_kIndex = -1; - int loadB_jIndex = -1; - for (auto en : llvm::enumerate(loadBOp.indices())) + auto elementType = binOpVal.getType(); + + // Get the bin op combining kind and verify that it has a vector reduction counterpart + mlir::vector::CombiningKind reductionKind; + // TODO : support AND, OR, MIN, MAX, and XOR as accera bin ops (accera has LOGICAL_AND and LOGICAL_OR, can those be used here?) + switch (binOp.getPredicate()) { - auto i = en.value(); - if (i == kk_inductionVar) + case v::BinaryOpPredicate::ADD: + reductionKind = mlir::vector::CombiningKind::ADD; + break; + case v::BinaryOpPredicate::MUL: + reductionKind = mlir::vector::CombiningKind::MUL; + break; + case v::BinaryOpPredicate::MAX: + if (elementType.isIntOrFloat()) { - if (loadB_kIndex != -1) + if (elementType.isIntOrIndex()) + { + if (elementType.isUnsignedInteger()) + { + reductionKind = mlir::vector::CombiningKind::MAXUI; + } + else + { + reductionKind = mlir::vector::CombiningKind::MAXSI; + } + } + else { - return reportMatchFailure(affineForOp, "Failed to match the load from B Op (too many 'k' indicies)"); + reductionKind = mlir::vector::CombiningKind::MAXF; } - loadB_kIndex = en.index(); } - else if (i == jj_inductionVar) + else + { + return reportMatchFailure(binOp, "'Max' bin op with the given element type cannot be turned into a vector reduction"); + } + break; + case v::BinaryOpPredicate::MIN: + if (elementType.isIntOrFloat()) { - if (loadB_jIndex != -1) + if (elementType.isIntOrIndex()) + { + if (elementType.isUnsignedInteger()) + { + reductionKind = mlir::vector::CombiningKind::MINUI; + } + else + { + reductionKind = mlir::vector::CombiningKind::MINSI; + } + } + else { - return reportMatchFailure(affineForOp, "Failed to match the load from B Op (too many 'j' indicies)"); + reductionKind = mlir::vector::CombiningKind::MINF; } - loadB_jIndex = en.index(); } + else + { + return reportMatchFailure(binOp, "'Min' bin op with the given element type cannot be turned into a vector reduction"); + } + break; + default: + return reportMatchFailure(binOp, "Bin op predicate type cannot be turned into a vector reduction"); } - if (loadB_kIndex == -1) + // 4. store to output array + if (loopBodyIter == loopBodyEnd || !isa(*loopBodyIter)) { - return reportMatchFailure(affineForOp, "Failed to match the load from B Op (no 'k' index)"); + return reportMatchFailure(affineForOp, "Failed to match the store op"); } - if (loadB_jIndex == -1) + auto storeOp = cast(*loopBodyIter++); + auto storeMemRefType = storeOp.getMemRefType(); + auto storeElementType = storeMemRefType.getElementType(); + auto storedVal = storeOp.value(); + matchedOps.push(storeOp); + + // Check that the value being stored is the result of the BinOp + if (storedVal != binOpVal) { - return reportMatchFailure(affineForOp, "Failed to match the load from B Op (no 'j' index)"); + return reportMatchFailure(storeOp, "Store op isn't storing the result of the bin op"); } - // 3. muliply A * B - innerLoopBodyIter++; - if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + // Check that store is constant wrt to the loop + bool storeIsLoopConstant = IsUnrolledAccessConstant(rewriter, storeOp, laneMappings, unrollMax); + if (!storeIsLoopConstant) { - return reportMatchFailure(affineForOp, "Failed to match the binary A*B multiplication op"); + return reportMatchFailure(storeOp, "Store op isn't constant wrt the loop being vectorized"); } - auto mulAB = cast(*innerLoopBodyIter); - if (mulAB.predicate() != v::BinaryOpPredicate::MUL) + + // Check which load is sequential wrt the loop and which is constant and which one is being stored to + + mlir::AffineLoadOp outputLoadOp; + if (storeOp.getMemRef() == lhsLoadOp.getMemRef()) { - return reportMatchFailure(mulAB, "Failed to match the multiplication op"); + if (!lhsLoadIsLoopConstant) + { + return reportMatchFailure(lhsLoadOp, "LHS load op isn't constant wrt the loop being vectorized but is the same memref being stored to"); + } + if (!rhsLoadIsLoopSequential) + { + return reportMatchFailure(rhsLoadOp, "RHS load op isn't sequential when LHS load is constant"); + } + outputLoadOp = lhsLoadOp; } - // Check that the operands for the multiply op are in fact the loads from A and B - if (!((mulAB.lhs() == loadAOp && mulAB.rhs() == loadBOp) || (mulAB.rhs() == loadAOp && mulAB.lhs() == loadBOp))) + else if (storeOp.getMemRef() == rhsLoadOp.getMemRef()) { - return reportMatchFailure(mulAB, "Failed to match the multiplication operands"); + if (!rhsLoadIsLoopConstant) + { + return reportMatchFailure(rhsLoadOp, "RHS load op isn't constant wrt the loop being vectorized but is the same memref being stored to"); + } + if (!lhsLoadIsLoopSequential) + { + return reportMatchFailure(lhsLoadOp, "LHS load op isn't sequential when RHS load is constant"); + } + outputLoadOp = rhsLoadOp; + } + else + { + return reportMatchFailure(storeOp, "Store op isn't storing to the same memref as either load"); } - matchedOps.push(mulAB); - // 4. sign-extend / cast result of A * B + // Check that the output load and store are at the same position + + auto strideOpt = GetConstantStrideBetweenAccesses(rewriter, outputLoadOp, storeOp); + if (!strideOpt.has_value() || *strideOpt != 0) + { + return reportMatchFailure(storeOp, "Output load and store ops aren't at the same location"); + } + + // At this point we've verified: + // - this affine for op is the innermost loop + // - the loop has constant bounds + // And the ops in the loop are: + // - loop-sequential load + // - loop-constant load from location Y + // - BinOp of the loaded values + // - store BinOp result to location Y + + // Check that all that remains are optionally redundant load-stores and the yield op + + // match the final pair of redundant load and store ops + if (loopBodyIter != loopBodyEnd && isa(*loopBodyIter)) + { + auto loadOp = cast(*loopBodyIter++); + matchedOps.push(loadOp); + if (loopBodyIter != loopBodyEnd && isa(*loopBodyIter)) + { + auto storeOp = cast(*loopBodyIter++); + if (storeOp.getMemRef() != loadOp.getMemRef()) + { + return reportMatchFailure(storeOp, "Extraneous load/store aren't to the same memref"); + } + + auto strideOpt = GetConstantStrideBetweenAccesses(rewriter, loadOp, storeOp); + if (!strideOpt.has_value() || *strideOpt != 0) + { + return reportMatchFailure(storeOp, "Extraneous load/store aren't to the same location"); + } + + matchedOps.push(storeOp); + } + else + { + return reportMatchFailure(loadOp, "Failed to match extraneous store"); + } + } + + // Ignore the yield op at the end + if (loopBodyIter != loopBodyEnd && isa(*loopBodyIter)) + { + (void)loopBodyIter++; + } + + if (loopBodyIter != loopBodyEnd) + { + LLVM_DEBUG(llvm::dbgs() << "Found additional instructions after the store"); + return failure(); + } + + // Set the insertion point to the end of the loop (just before the terminator) + mlir::OpBuilder::InsertionGuard guard(rewriter); + rewriter.setInsertionPoint(affineForOp.getBody(), affineForOp.getBody()->getTerminator()->getIterator()); + + // Now replace the matched ops with the vector load and reduction sequence + mlir::BlockAndValueMapping mappings; + + // LHS Load + mlir::Value vecLhsVal; + if (lhsLoadIsLoopSequential) + { + auto lhsLoadVecOp = VectorizeAffineLoadOpHelper(rewriter, lhsLoadOp, vectorSize); + vecLhsVal = lhsLoadVecOp.getResult(); + mappings.map(lhsLoadVal, vecLhsVal); + } + else + { + vecLhsVal = mlir::cast(rewriter.clone(*lhsLoadOp.getOperation(), mappings)); + } + mappings.map(lhsLoadVal, vecLhsVal); + + // Optional cast + if (lhsLoadCastOp) + { + // Create a vector cast + auto castVecType = mlir::VectorType::get({ vectorSize }, lhsCastType); + vecLhsVal = rewriter.create(lhsLoadCastOp.getLoc(), vecLhsVal, castVecType); + } + mappings.map(lhsVal, vecLhsVal); + + // RHS Load + mlir::Value vecRhsVal; + if (rhsLoadIsLoopSequential) + { + auto rhsLoadVecOp = VectorizeAffineLoadOpHelper(rewriter, rhsLoadOp, vectorSize); + vecRhsVal = rhsLoadVecOp.getResult(); + mappings.map(rhsLoadVal, vecRhsVal); + } + else + { + vecRhsVal = mlir::cast(rewriter.clone(*rhsLoadOp.getOperation(), mappings)); + } + mappings.map(rhsLoadVal, vecRhsVal); + + // Optional cast + if (rhsLoadCastOp) + { + // Create a vector cast + auto castVecType = mlir::VectorType::get({ vectorSize }, rhsCastType); + vecRhsVal = rewriter.create(rhsLoadCastOp.getLoc(), vecRhsVal, castVecType); + } + mappings.map(rhsVal, vecRhsVal); + + // Now create the appropriate vector reduce given the bin op type and apply it to either the LHS vector val or RHS vector val, whichever is the loaded vector + auto vectorValToReduce = lhsLoadIsLoopSequential ? vecLhsVal : vecRhsVal; + auto reduceOp = rewriter.create(binOp.getLoc(), storeElementType, mlir::vector::stringifyEnum(reductionKind), vectorValToReduce, mlir::ValueRange{} /* optional accumulate values */); + + mlir::Value reducedVal = reduceOp.getResult(); + auto scalarValThatWasReduced = lhsLoadIsLoopSequential ? lhsVal : rhsVal; + mappings.map(scalarValThatWasReduced, reducedVal); + + // Now we're left with two scalars, since we've reduced one vector to a scalar and the other value was a scalar to begin with. + // Clone the original bin op now that we've vector reduced either the LHS or RHS side and are left with 2 vectors + // At this point, in our mappings we've replaces the original lhsVal and rhsVal with either their cloned scalar versions, + // or the result of the vector reduce + auto finalBinOp = mlir::cast(rewriter.clone(*binOp.getOperation(), mappings)); + mappings.map(binOp, finalBinOp); + + // Clone the final store op + rewriter.clone(*storeOp.getOperation(), mappings); + + // Set the step size for the vectorized loops such that they each have a single iteration and will later get simplified away while replacing any IV usage with their begin value + affineForOp.setStep(step * numIters); + + // Erase the original non-vectorized ops + ir::util::EraseOps(matchedOps, rewriter); + + return mlir::success(); +} + +// TODO : de-dupe with part of vectorizeInt16Matmul matcher +mlir::LogicalResult vectorizeSequentialCast(mlir::AffineForOp affineForOp, mlir::PatternRewriter& rewriter) +{ + // Try to match a pattern like: + // for jj: + // for kk: + // x = load(input[..., jj, kk]) : memref<...x M x N, T1> + // y = cast(x, T2) : T2 + // store(y, output[..., jj, kk]) : memref<...x M x N, T2> + + // And replace it with: + // flat_input = reinterpret_cast input to flat + // flat_output = reinterpret_cast output to flat + // x = vector_load(flat_input, flatten_input_pos(..., jj, kk)) : vector<(M*N)xT1> + // y = cast(x, T2) : vector<(M*N)xT2> + // vector_store(y, flat_output, flatten_output_pos(..., jj, kk)) + + // So we need to check: + // - there are 2 nested loops (TODO : generalize this) + // - the loops have constant bounds (TODO: relax this check) + // - the innermost loop contains a sequential load + // - the innermost loop contains a cast of the loaded value + // - the innermost loop contains a sequential store of the cast value + // - there are no other ops in the innermost loop (other than a loop terminator op) + + // Implement the matcher + auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult { + llvm::dbgs() << "While processing " << *op << ". " << message << "\n"; + return rewriter.notifyMatchFailure(op, message); + }; + + std::stack matchedOps; + std::stack tempOps; + ir::util::TempOpCleanupGuard(&tempOps, rewriter); + + // Match j and k loop + SmallVector loops; + mlir::getPerfectlyNestedLoops(loops, affineForOp); + if (loops.size() != 2) // there should be exactly 2 loops in the nest + { + return failure(); + } + + // TODO : support dynamic loops that operate over contiguous memory + for (auto& loop : loops) + { + if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0) + { + return failure(); + } + } + + auto outerLoop = loops.front(); // jj loop + int64_t jj_begin = outerLoop.getConstantLowerBound(); + int64_t jj_end = outerLoop.getConstantUpperBound(); + int64_t jj_step = outerLoop.getStep(); + int64_t jj_numIters = (jj_end - jj_begin) / jj_step; + auto jj_inductionVar = outerLoop.getInductionVar(); + + auto innerLoop = loops.back(); // the innermost loop, kk + int64_t kk_begin = innerLoop.getConstantLowerBound(); + int64_t kk_end = innerLoop.getConstantUpperBound(); + int64_t kk_step = innerLoop.getStep(); + int64_t kk_numIters = (kk_end - kk_begin) / kk_step; + auto kk_inductionVar = innerLoop.getInductionVar(); + + // iterate on loop body from begin to end to match the ops list + auto innerLoopBodyIter = innerLoop.getBody()->begin(); + auto innerLoopBodyEnd = innerLoop.getBody()->end(); + + // 1. load from input array + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the input load op"); + } + + auto loadOp = cast(*innerLoopBodyIter); + auto loadedVal = loadOp.getResult(); + matchedOps.push(loadOp); + + // 2. cast loaded input value + innerLoopBodyIter++; + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the cast op"); + } + + auto castOp = cast(*innerLoopBodyIter); + auto castedValue = castOp.result(); + auto castResultType = castedValue.getType(); + matchedOps.push(castOp); + + if (castOp.source() != loadedVal) + { + return reportMatchFailure(affineForOp, "Cast op isn't casting the loaded value"); + } + + // 3. store cast value + innerLoopBodyIter++; + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the store op"); + } + + auto storeOp = cast(*innerLoopBodyIter); + matchedOps.push(storeOp); + + if (storeOp.value() != castedValue) + { + return reportMatchFailure(affineForOp, "Store op isn't storing the cast value"); + } + + // Ignore the yield op at the end + innerLoopBodyIter++; + if (innerLoopBodyIter != innerLoopBodyEnd && isa(*innerLoopBodyIter)) + { + (void)innerLoopBodyIter++; + } + + if (innerLoopBodyIter != innerLoopBodyEnd) + { + LLVM_DEBUG(llvm::dbgs() << "Found additional instructions after the store"); + return failure(); + } + + // Check if the input loads and output writes are sequential + int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin)); + int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin)); + + // create lanemappings for jj * kk + std::vector laneMappings(unrollMax_kk * unrollMax_jj); + auto loadLoc = loadOp.getLoc(); + + for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step) + { + auto jjOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (jj_idx * jj_step)); + auto offsetInductionVar_jj = rewriter.create(loadLoc, jjOffsetMap, ValueRange{ jj_inductionVar }); + tempOps.push(offsetInductionVar_jj); + for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step) + { + auto kkOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (kk_idx * kk_step)); + auto offsetInductionVar_kk = rewriter.create(loadLoc, kkOffsetMap, ValueRange{ kk_inductionVar }); + tempOps.push(offsetInductionVar_kk); + BlockAndValueMapping& operandMap = laneMappings[jj_idx * unrollMax_kk + kk_idx]; + operandMap.map(kk_inductionVar, offsetInductionVar_kk); + operandMap.map(jj_inductionVar, offsetInductionVar_jj); + } + } + + int64_t vectorSize = unrollMax_jj * unrollMax_kk; + + if (!IsUnrolledAccessSequential(rewriter, loadOp, laneMappings, vectorSize)) + { + return reportMatchFailure(loadOp, "Failed: isUnrolledAcessSequential for load op"); + } + if (!IsUnrolledAccessSequential(rewriter, storeOp, laneMappings, vectorSize)) + { + return reportMatchFailure(storeOp, "Failed: isUnrolledAcessSequential for store op"); + } + + // At this point we know: + // - there are 2 nested loops + // - the loops have constant bounds + // - the innermost loop contains a load that is sequential wrt the 2 loops + // - the innermost loop contains a cast of the loaded value + // - the innermost loop contains a store of the cast value that is sequential wrt the 2 loops + // - there are no other ops in the innermost loop (other than a loop terminator op) + + // So now we can create the new vectorized version of the loops + + // Set the insertion point to the end of the inner loop (just before the terminator) + mlir::OpBuilder::InsertionGuard guard(rewriter); + rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator()); + + // 1. create vector load of the input + auto inputMemRefType = loadOp.getMemRefType(); + auto inputElementType = inputMemRefType.getElementType(); + auto inputVectorType = mlir::VectorType::get({ vectorSize }, inputElementType); + mlir::AffineLoadOpAdaptor loadAdaptor{ loadOp }; + std::vector loadIndices(loadAdaptor.indices().begin(), loadAdaptor.indices().end()); + + auto [flatCastInputMemRef, flattenedInputPos] = FlattenAccess(rewriter, loadOp, loadIndices); + auto loadVecOp = rewriter.create(loadOp.getLoc(), inputVectorType, flatCastInputMemRef, mlir::ValueRange{ flattenedInputPos }); + + // 2. create a cast op of the loaded vector + auto castResultVecType = mlir::VectorType::get({ vectorSize }, castResultType); + mlir::Value castVecVal = rewriter.create(castOp.getLoc(), loadVecOp, castResultVecType); + + // 3. create a vector store op of the casted value + mlir::AffineStoreOpAdaptor storeAdaptor{ storeOp }; + std::vector storeIndices(storeAdaptor.indices().begin(), storeAdaptor.indices().end()); + + auto [flatCastOutputMemRef, flattenedOutputPos] = FlattenAccess(rewriter, storeOp, storeIndices); + rewriter.create(storeOp.getLoc(), castVecVal, flatCastOutputMemRef, mlir::ValueRange{ flattenedOutputPos }); + + // Set the step size for the vectorized loops such that they each have a single iteration and will later get simplified away while replacing any IV usage with their begin value + outerLoop.setStep(jj_step * jj_numIters); + innerLoop.setStep(kk_step * kk_numIters); + + // Erase the original non-vectorized ops + ir::util::EraseOps(matchedOps, rewriter); + + return mlir::success(); +} + +mlir::LogicalResult vectorizeTwoRowInterleavedPack(mlir::AffineForOp affineForOp, + mlir::PatternRewriter& rewriter) +{ + // TODO : generalize this beyond 2 rows + + // Try to match a pattern like: + // for jj: + // for kk = 0 ... 2: + // x = load(input[..., kk, jj]) : memref<...x N x M> + // store(x, output[..., jj, kk]) : memref<...x M x N> + + // And replace it with: + // flat_input = reinterpret_cast input to flat + // loaded_vec_0 = vector_load(flat_input, flatten_input_pos(..., 0, i)) // vector + // loaded_vec_1 = vector_load(flat_input, flatten_input_pos(..., 1, i)) // vector + // interleaved = vector.shuffle loaded_vec_0, loaded_vec_1 [0, M, 1, M+1, 2, M+2, ...] + // flat_output = reinterpret_cast output to flat + // vector_store(interleaved, flat_output, flatten_output_pos(..., 0, 0)) + + // So we need to check: + // - there are 2 nested loops (TODO : generalize this) + // - the loops have constant bounds (TODO: relax this check) + // - the innermost loop contains a load that is sequential wrt the outer loop + // - the innermost loop contains a store that is sequential wrt both loops + // - there are no other ops in the innermost loop (other than a loop terminator op) + + // Implement the matcher + auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult { + llvm::dbgs() << "While processing " << *op << ". " << message << "\n"; + return rewriter.notifyMatchFailure(op, message); + }; + + std::stack matchedOps; + std::stack tempOps; + ir::util::TempOpCleanupGuard(&tempOps, rewriter); + + // Match j and k loop + SmallVector loops; + mlir::getPerfectlyNestedLoops(loops, affineForOp); + if (loops.size() != 2) // there should be exactly 2 loops in the nest + { + return failure(); + } + + // TODO : support dynamic loops that operate over contiguous memory + for (auto& loop : loops) + { + if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0) + { + return failure(); + } + } + + auto outerLoop = loops.front(); // jj loop + int64_t jj_begin = outerLoop.getConstantLowerBound(); + int64_t jj_end = outerLoop.getConstantUpperBound(); + int64_t jj_step = outerLoop.getStep(); + int64_t jj_numIters = (jj_end - jj_begin) / jj_step; + auto jj_inductionVar = outerLoop.getInductionVar(); + + auto innerLoop = loops.back(); // the innermost loop, kk + int64_t kk_begin = innerLoop.getConstantLowerBound(); + int64_t kk_end = innerLoop.getConstantUpperBound(); + int64_t kk_step = innerLoop.getStep(); + int64_t kk_numIters = (kk_end - kk_begin) / kk_step; + if (kk_numIters != 2) + return failure(); + auto kk_inductionVar = innerLoop.getInductionVar(); + + int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin)); + int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin)); + + // iterate on loop body from begin to end to match the ops list + auto innerLoopBodyIter = innerLoop.getBody()->begin(); + auto innerLoopBodyEnd = innerLoop.getBody()->end(); + + // 1. load from input array + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the input load op"); + } + + auto loadOp = cast(*innerLoopBodyIter); + auto loadLoc = loadOp.getLoc(); + auto loadedVal = loadOp.getResult(); + matchedOps.push(loadOp); + + // 2. store value + innerLoopBodyIter++; + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the store op"); + } + + auto storeOp = cast(*innerLoopBodyIter); + matchedOps.push(storeOp); + + if (storeOp.value() != loadedVal) + { + return reportMatchFailure(affineForOp, "Store op isn't storing the loaded value"); + } + + // Ignore the yield op at the end + innerLoopBodyIter++; + if (innerLoopBodyIter != innerLoopBodyEnd && isa(*innerLoopBodyIter)) + { + (void)innerLoopBodyIter++; + } + + if (innerLoopBodyIter != innerLoopBodyEnd) + { + LLVM_DEBUG(llvm::dbgs() << "Found additional instructions after the store"); + return failure(); + } + + // Create two sets of lane mappings: one just for jj and one for jj and kk together + + // create lanemappings for jj + std::vector jj_laneMappings(unrollMax_jj); + + // create lanemappings for jj and kk + std::vector jj_kk_laneMappings(unrollMax_kk * unrollMax_jj); + + for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step) + { + auto jjOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (jj_idx * jj_step)); + auto offsetInductionVar_jj = rewriter.create(loadLoc, jjOffsetMap, ValueRange{ jj_inductionVar }); + tempOps.push(offsetInductionVar_jj); + BlockAndValueMapping& jj_operandMap = jj_laneMappings[jj_idx]; + jj_operandMap.map(jj_inductionVar, offsetInductionVar_jj); + for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step) + { + auto kkOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (kk_idx * kk_step)); + auto offsetInductionVar_kk = rewriter.create(loadLoc, kkOffsetMap, ValueRange{ kk_inductionVar }); + tempOps.push(offsetInductionVar_kk); + BlockAndValueMapping& jj_kk_operandMap = jj_kk_laneMappings[jj_idx * unrollMax_kk + kk_idx]; + jj_kk_operandMap.map(kk_inductionVar, offsetInductionVar_kk); + jj_kk_operandMap.map(jj_inductionVar, offsetInductionVar_jj); + } + } + + // Check if the input load is sequential wrt the jj loop + int64_t inputVectorSize = unrollMax_jj; + if (!IsUnrolledAccessSequential(rewriter, loadOp, jj_laneMappings, inputVectorSize)) + { + return reportMatchFailure(loadOp, "Failed: isUnrolledAcessSequential for load op"); + } + + // Check if the output store is sequential wrt the jj and kk loops + int64_t outputVectorSize = unrollMax_jj * unrollMax_kk; + if (!IsUnrolledAccessSequential(rewriter, storeOp, jj_kk_laneMappings, outputVectorSize)) + { + return reportMatchFailure(storeOp, "Failed: isUnrolledAcessSequential for store op"); + } + + // At this point we know: + // - there are 2 nested loops, the inner of which has 2 iterations + // - the loops have constant bounds + // - the innermost loop contains a load that is sequential wrt the outer loop + // - the innermost loop contains a store of the loaded value that is sequential wrt the 2 loops + // - there are no other ops in the innermost loop (other than a loop terminator op) + + // So now we can create the new vectorized version of the loops + + // Set the insertion point to the end of the inner loop (just before the terminator) + mlir::OpBuilder::InsertionGuard guard(rewriter); + rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator()); + + // 1. create vector load of the input rows + auto inputMemRefType = loadOp.getMemRefType(); + auto inputElementType = inputMemRefType.getElementType(); + auto inputVectorType = mlir::VectorType::get({ inputVectorSize }, inputElementType); + + std::vector loadedVecs; + // Clone the load op for each iteration of the kk loop and vectorize each of those loads wrt the jj loop + for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step) + { + auto unrolledInductionVar_kk = rewriter.create(loadLoc, kk_idx); + tempOps.push(unrolledInductionVar_kk); + mlir::BlockAndValueMapping kIterMapping; + kIterMapping.map(kk_inductionVar, unrolledInductionVar_kk); + auto clonedLoadOp = mlir::cast(rewriter.clone(*(loadOp.getOperation()), kIterMapping)); + tempOps.push(clonedLoadOp); + + mlir::AffineLoadOpAdaptor loadAdaptor{ clonedLoadOp }; + std::vector loadIndices(loadAdaptor.indices().begin(), loadAdaptor.indices().end()); + + auto [flatCastInputMemRef, flattenedInputPos] = FlattenAccess(rewriter, clonedLoadOp, loadIndices); + mlir::Value loadedVec = rewriter.create(loadOp.getLoc(), inputVectorType, flatCastInputMemRef, mlir::ValueRange{ flattenedInputPos }); + loadedVecs.push_back(loadedVec); + } + assert(loadedVecs.size() == 2); // Eventually we could relax this, but vector.shuffle ops require precisely 2 vectors, so if we relax this we need to create a sequence of shuffles + + // 2. create a vector.shuffle op to interleave the input rows + std::vector interleaveMask; + interleaveMask.reserve(outputVectorSize); + for (unsigned colIdx = 0; colIdx < unrollMax_jj; ++colIdx) + { + // The vector.shuffle mask should be like { 0, N, 1, N+1, 2, N+2, ... } where the jj loop has N iterations + interleaveMask.push_back(colIdx); + interleaveMask.push_back(colIdx + unrollMax_jj); + } + + auto outputMemRefType = storeOp.getMemRefType(); + auto outputElementType = outputMemRefType.getElementType(); + auto outputVectorType = mlir::VectorType::get({ outputVectorSize }, outputElementType); + auto shuffledRowsOp = rewriter.create(loadLoc, outputVectorType, loadedVecs[0], loadedVecs[1], rewriter.getI64ArrayAttr(interleaveMask)); + + // 3. create a vector store op of the interleaved rows + mlir::AffineStoreOpAdaptor storeAdaptor{ storeOp }; + std::vector storeIndices(storeAdaptor.indices().begin(), storeAdaptor.indices().end()); + + auto [flatCastOutputMemRef, flattenedOutputPos] = FlattenAccess(rewriter, storeOp, storeIndices); + rewriter.create(storeOp.getLoc(), shuffledRowsOp, flatCastOutputMemRef, mlir::ValueRange{ flattenedOutputPos }); + + // Set the step size for the vectorized loops such that they each have a single iteration and will later get simplified away while replacing any IV usage with their begin value + outerLoop.setStep(jj_step * jj_numIters); + innerLoop.setStep(kk_step * kk_numIters); + + // Erase the original non-vectorized ops + ir::util::EraseOps(matchedOps, rewriter); + + return mlir::success(); +} + +mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, + mlir::PatternRewriter& rewriter) +{ + // Implement the matcher + auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult { + llvm::dbgs() << "While processing " << *op << ". " << message << "\n"; + return rewriter.notifyMatchFailure(op, message); + }; + + std::stack matchedOps; + std::stack tempOps; + + // Match jj and kk loop in int16 matmul for vectorization rewrite rules + SmallVector loops; + mlir::getPerfectlyNestedLoops(loops, affineForOp); + if (loops.size() != 2) // there should be exactly 2 loops in the nest + { + return failure(); + } + + for (auto& loop : loops) + { + if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0) + { + return failure(); + } + } + + // order of nested loops we are looking for is + // jj {0 to 8} followed by kk {0 to 2} + auto outerLoop = loops.front(); // jj loop + int64_t jj_begin = outerLoop.getConstantLowerBound(); + int64_t jj_end = outerLoop.getConstantUpperBound(); + int64_t jj_step = outerLoop.getStep(); + int64_t jj_numIters = (jj_end - jj_begin) / jj_step; + if (jj_numIters != 8) + return failure(); + auto jj_inductionVar = outerLoop.getInductionVar(); + + auto innerLoop = loops.back(); // the innermost loop, kk + int64_t kk_begin = innerLoop.getConstantLowerBound(); + int64_t kk_end = innerLoop.getConstantUpperBound(); + int64_t kk_step = innerLoop.getStep(); + int64_t kk_numIters = (kk_end - kk_begin) / kk_step; + if (kk_numIters != 2) + return failure(); + auto kk_inductionVar = innerLoop.getInductionVar(); + + // get unroll max for jj and kk + int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin)); + int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin)); + int64_t vectorSize = unrollMax_jj * unrollMax_kk; + + // create IV map for jj and kk + auto inductionVarMap_jj = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + jj_step * rewriter.getAffineSymbolExpr(0)); + auto inductionVarMap_kk = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + kk_step * rewriter.getAffineSymbolExpr(0)); + + // create lanemappings for jj, kk, and jj * kk + std::vector laneMappings_jj(unrollMax_jj); + std::vector laneMappings_kk(unrollMax_kk); + std::vector laneMappings_jj_kk(unrollMax_kk * unrollMax_jj); + + for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step) + { + auto offset_jj = rewriter.create(outerLoop.getLoc(), jj_idx); + auto offsetInductionVar_jj = rewriter.create(outerLoop.getLoc(), inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj }); + tempOps.push(offset_jj); + tempOps.push(offsetInductionVar_jj); + laneMappings_jj[jj_idx].map(jj_inductionVar, offsetInductionVar_jj); + for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step) + { + auto offset_kk = rewriter.create(innerLoop.getLoc(), kk_idx); + auto offsetInductionVar_kk = rewriter.create(innerLoop.getLoc(), inductionVarMap_kk, ValueRange{ kk_inductionVar, offset_kk }); + tempOps.push(offset_kk); + tempOps.push(offsetInductionVar_kk); + laneMappings_jj_kk[jj_idx * unrollMax_kk + kk_idx].map(kk_inductionVar, offsetInductionVar_kk); + laneMappings_jj_kk[jj_idx * unrollMax_kk + kk_idx].map(jj_inductionVar, offsetInductionVar_jj); + if (jj_idx == jj_begin) + { + // Only map for the first iter of jj + laneMappings_kk[kk_idx].map(kk_inductionVar, offsetInductionVar_kk); + } + } + } + + // iterate on loop body from begin to end to match the ops list + auto innerLoopBodyIter = innerLoop.getBody()->begin(); + auto innerLoopBodyEnd = innerLoop.getBody()->end(); + + // TODO: ensure we're storing the updated C value back into the same location (disallow C[m,n] = C[i,j] + A[i,k] * B[k,j]) + + // TODO : de-dupe between first and second cases + + // 1. load from first matrix + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the load from the first array"); + } + auto firstLoad = cast(*innerLoopBodyIter); + auto firstElementType = firstLoad.getMemRefType().getElementType(); + matchedOps.push(firstLoad); + + // 1a. Optionally allow casting the A value to an int16 if it is not an int16 already + bool castFirstLoad = false; + mlir::Value firstLoadVal = firstLoad.getResult(); + if (firstElementType != rewriter.getIntegerType(16)) + { + innerLoopBodyIter++; + if (innerLoopBodyIter != innerLoopBodyEnd && isa(*innerLoopBodyIter)) + { + castFirstLoad = true; + auto castOp = cast(*innerLoopBodyIter); + firstLoadVal = castOp.result(); + auto castResultType = firstLoadVal.getType(); + matchedOps.push(castOp); + if (castResultType != rewriter.getIntegerType(16)) + { + return reportMatchFailure(affineForOp, "First load element is not an int16 or cast to an int16"); + } + } + else + { + return reportMatchFailure(affineForOp, "First load is not from an int16 array"); + } + } + + // 2. load from second matrix + innerLoopBodyIter++; + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the load from the second array"); + } + auto secondLoad = cast(innerLoopBodyIter); + auto secondElementType = secondLoad.getMemRefType().getElementType(); + matchedOps.push(secondLoad); + + // 2a. Optionally allow casting the B value to an int16 if it is not an int16 already + bool castSecondLoad = false; + mlir::Value secondLoadVal = secondLoad.getResult(); + if (secondElementType != rewriter.getIntegerType(16)) + { + innerLoopBodyIter++; + if (innerLoopBodyIter != innerLoopBodyEnd && isa(*innerLoopBodyIter)) + { + castSecondLoad = true; + auto castOp = cast(*innerLoopBodyIter); + secondLoadVal = castOp.result(); + auto castResultType = secondLoadVal.getType(); + matchedOps.push(castOp); + if (castResultType != rewriter.getIntegerType(16)) + { + return reportMatchFailure(affineForOp, "Second load element is not an int16 or cast to an int16"); + } + } + else + { + return reportMatchFailure(affineForOp, "Second load is not from an int16 array"); + } + } + + // If a load is sequential wrt the inner loop and constant wrt the outer loop, then we want to load the elements and broadcast them to fill a 16-element buffer + // If a load is sequential wrt both loops, then we simply want to load the data + + bool broadcastFirstLoad = IsUnrolledAccessSequential(rewriter, firstLoad, laneMappings_kk, unrollMax_kk) && IsUnrolledAccessConstant(rewriter, firstLoad, laneMappings_jj, unrollMax_jj); + bool broadcastSecondLoad = IsUnrolledAccessSequential(rewriter, secondLoad, laneMappings_kk, unrollMax_kk) && IsUnrolledAccessConstant(rewriter, secondLoad, laneMappings_jj, unrollMax_jj); + + int64_t firstLoadVecSize = vectorSize; + int64_t secondLoadVecSize = vectorSize; + + // 3. muliply A * B + innerLoopBodyIter++; + if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) + { + return reportMatchFailure(affineForOp, "Failed to match the binary A*B multiplication op"); + } + auto mulAB = cast(*innerLoopBodyIter); + if (mulAB.predicate() != v::BinaryOpPredicate::MUL) + { + return reportMatchFailure(mulAB, "Failed to match the multiplication op"); + } + // Check that the operands for the multiply op are in fact the loads from A and B + if (!((mulAB.lhs() == firstLoadVal && mulAB.rhs() == secondLoadVal) || (mulAB.rhs() == firstLoadVal && mulAB.lhs() == secondLoadVal))) + { + return reportMatchFailure(mulAB, "Failed to match the multiplication operands"); + } + matchedOps.push(mulAB); + + // 4. sign-extend / cast result of A * B innerLoopBodyIter++; if (innerLoopBodyIter == innerLoopBodyEnd || !isa(*innerLoopBodyIter)) { @@ -1104,6 +2220,11 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, { return failure(); } + if (!IsUnrolledAccessSequential(rewriter, loadCOp, laneMappings_jj, vectorSize / 2)) + { + return reportMatchFailure(loadCOp, "Failed: isUnrolledAcessSequential for C load"); + } + matchedOps.push(loadCOp); // 6. add C + (A * B) @@ -1136,6 +2257,10 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, { return reportMatchFailure(storeCOp, "Failed to match the store into C"); } + if (!IsUnrolledAccessSequential(rewriter, storeCOp, laneMappings_jj, vectorSize / 2)) + { + return reportMatchFailure(loadCOp, "Failed: isUnrolledAcessSequential for C store"); + } matchedOps.push(storeCOp); // 8. match the final pair of redundant load and store ops @@ -1172,68 +2297,6 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, return failure(); } - // Instantiate a TempOpCleanupGuard so that all the matched ops will get cleaned up - ir::util::TempOpCleanupGuard matchedOpsGuard(&matchedOps, rewriter); - //ir::util::TempOpCleanupGuard tempOpsGuard(&tempOps, rewriter); - - // Check if elements of B are sequential - // get unroll max for jj and kk - int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin)); - int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin)); - - // create IV map for jj and kk - auto inductionVarMap_jj = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + jj_step * rewriter.getAffineSymbolExpr(0)); - auto inductionVarMap_kk = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + kk_step * rewriter.getAffineSymbolExpr(0)); - - // create lanemappings for jj * kk - std::vector laneMappings(unrollMax_kk * unrollMax_jj); - auto locB = loadBOp.getLoc(); - - for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step) - { - auto offset_jj = rewriter.create(locB, jj_idx); - auto offsetInductionVar_jj = rewriter.create(locB, inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj }); - tempOps.push(offset_jj); - tempOps.push(offsetInductionVar_jj); - for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step) - { - auto offset_kk = rewriter.create(locB, kk_idx); - auto offsetInductionVar_kk = rewriter.create(locB, inductionVarMap_kk, ValueRange{ kk_inductionVar, offset_kk }); - tempOps.push(offset_kk); - tempOps.push(offsetInductionVar_kk); - BlockAndValueMapping& operandMap = laneMappings[jj_idx * unrollMax_kk + kk_idx]; - operandMap.map(kk_inductionVar, offsetInductionVar_kk); - operandMap.map(jj_inductionVar, offsetInductionVar_jj); - } - } - - int64_t vectorSize = 16; - auto memRefTypeB = loadBOp.getMemRefType(); - auto elementTypeB = memRefTypeB.getElementType(); - auto vectorTypeB = mlir::VectorType::get({ vectorSize }, elementTypeB); - mlir::AffineLoadOpAdaptor adaptorB{ loadBOp }; - std::vector baseIndicesB(adaptorB.indices().begin(), adaptorB.indices().end()); - - mlir::Value loadBVecOp; - if (!IsUnrolledAccessSequential(rewriter, loadBOp, laneMappings, vectorSize)) - { - return reportMatchFailure(loadBOp, "Failed: isUnrolledAcessSequential for B"); - } - - // Check if elements of output array, Y are sequential - // create lanemappings for jj - std::vector laneMappingsC(unrollMax_jj); - auto loc_loadCOp = loadCOp.getLoc(); - for (int64_t jj_idx = 0; jj_idx < unrollMax_jj; ++jj_idx) - { - auto offset_jj = rewriter.create(loc_loadCOp, jj_idx); - auto offsetInductionVar_jj = rewriter.create(loc_loadCOp, inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj }); - tempOps.push(offset_jj); - tempOps.push(offsetInductionVar_jj); - BlockAndValueMapping& operandMapC = laneMappingsC[jj_idx]; - operandMapC.map(jj_inductionVar, offsetInductionVar_jj); - } - auto memRefTypeC = loadCOp.getMemRefType(); auto elementTypeC = memRefTypeC.getElementType(); auto vectorTypeC = mlir::VectorType::get({ vectorSize / 2 }, elementTypeC); @@ -1241,12 +2304,6 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, std::vector baseIndicesC(adaptorC.indices().begin(), adaptorC.indices().end()); mlir::Value loadCVecOp; - if (!IsUnrolledAccessSequential(rewriter, loadCOp, laneMappingsC, vectorSize / 2)) - { - return reportMatchFailure(loadCOp, "Failed: isUnrolledAcessSequential for C"); - } - - // Set the insertion point to the end of the inner loop (just before the terminator) mlir::OpBuilder::InsertionGuard guard(rewriter); rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator()); @@ -1279,112 +2336,135 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp, // Implement the rewriter by stiching together a list of vector instructions, vector of 16 elements in this case // 1. create vector.load A - auto memRefType = loadAOp.getMemRefType(); - auto elementType = memRefType.getElementType(); - auto vectorType = mlir::VectorType::get({ vectorSize }, elementType); - mlir::AffineLoadOpAdaptor adaptorA{ loadAOp }; - std::vector baseIndicesA(adaptorA.indices().begin(), adaptorA.indices().end()); - // Ignoring the sequential access check for elements of A because that's not required. - - auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, loadAOp, baseIndicesA); - auto loadAVecOp = rewriter.create(loadAOp.getLoc(), vectorType, flatCastMemRef, mlir::ValueRange{ flattenedPos }); - - // 2. create vector.shuffle op for A: alternate between A[0,0] and A[0,1] - auto locA = loadAOp.getLoc(); auto i16Type = rewriter.getIntegerType(16); - auto vecType = mlir::VectorType::get({ vectorSize }, i16Type); + auto i32Type = rewriter.getIntegerType(32); + auto fullVecType = mlir::VectorType::get({ vectorSize }, i16Type); auto altElemsMask = rewriter.getI64ArrayAttr({ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 }); auto halfVecType = mlir::VectorType::get({ vectorSize / 2 }, i16Type); auto oddMask = rewriter.getI64ArrayAttr({ 1, 3, 5, 7, 9, 11, 13, 15 }); auto evenMask = rewriter.getI64ArrayAttr({ 0, 2, 4, 6, 8, 10, 12, 14 }); - auto shuffledAOp = rewriter.create(locA, vecType, loadAVecOp, loadAVecOp, altElemsMask); + auto loadCastBroadcastExtractVec = [&](mlir::AffineLoadOp loadOp, int64_t loadVecSize, mlir::Type loadElementType, bool cast, bool broadcast) -> std::tuple { + auto loadOpVectorType = mlir::VectorType::get({ loadVecSize }, loadElementType); + mlir::AffineLoadOpAdaptor loadOpAdaptor{ loadOp }; + std::vector loadOpIndices(loadOpAdaptor.indices().begin(), loadOpAdaptor.indices().end()); + auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, loadOp, loadOpIndices); + mlir::Value loadVecVal = rewriter.create(loadOp.getLoc(), loadOpVectorType, flatCastMemRef, mlir::ValueRange{ flattenedPos }); + if (cast) + { + // 1a. sign-extend loaded vector values + auto castLoadVecType = mlir::VectorType::get({ loadVecSize }, i16Type); + loadVecVal = rewriter.create(loadOp.getLoc(), loadVecVal, castLoadVecType); + } + if (broadcast) + { + // 1b. create vector.shuffle op for first load: alternate between A[0,0] and A[0,1] + loadVecVal = rewriter.create(loadOp.getLoc(), fullVecType, loadVecVal, loadVecVal, altElemsMask); + } - // 3. create vector shuffle op for A to pick odd and even elements separately - auto vecLoadA_oddShuffleOp = rewriter.create(locA, halfVecType, shuffledAOp, shuffledAOp, oddMask); - auto vecLoadA_evenShuffleOp = rewriter.create(locA, halfVecType, shuffledAOp, shuffledAOp, evenMask); + // 2. Now extract the odds and evens + mlir::Value oddShuffleVal = rewriter.create(loadOp.getLoc(), halfVecType, loadVecVal, loadVecVal, oddMask); + mlir::Value evenShuffleVal = rewriter.create(loadOp.getLoc(), halfVecType, loadVecVal, loadVecVal, evenMask); - // 4. create vector load op for B - if (IsUnrolledAccessSequential(rewriter, loadBOp, laneMappings, vectorSize)) + return { loadVecVal, oddShuffleVal, evenShuffleVal }; + }; + + + // If there's only one broadcasted load, make sure it happens first for better vpmaddwd matching + mlir::Value firstLoadVec; + mlir::Value firstLoadOdds; + mlir::Value firstLoadEvens; + mlir::Value secondLoadVec; + mlir::Value secondLoadOdds; + mlir::Value secondLoadEvens; + + if (broadcastFirstLoad == broadcastSecondLoad || broadcastFirstLoad) { - auto [flatCastMemRefB, flattenedPosB] = FlattenAccess(rewriter, loadBOp, baseIndicesB); - loadBVecOp = rewriter.create(loadBOp.getLoc(), vectorTypeB, flatCastMemRefB, mlir::ValueRange{ flattenedPosB }); + auto [firstLoadVecVal, firstLoadOddVal, firstLoadEvenVal] = loadCastBroadcastExtractVec(firstLoad, firstLoadVecSize, firstElementType, castFirstLoad, broadcastFirstLoad); + auto [secondLoadVecVal, secondLoadOddVal, secondLoadEvenVal] = loadCastBroadcastExtractVec(secondLoad, secondLoadVecSize, secondElementType, castSecondLoad, broadcastSecondLoad); + firstLoadVec = firstLoadVecVal; + firstLoadOdds = firstLoadOddVal; + firstLoadEvens = firstLoadEvenVal; + secondLoadVec = secondLoadVecVal; + secondLoadOdds = secondLoadOddVal; + secondLoadEvens = secondLoadEvenVal; } else { - return failure(); + // broadcastFirstLoad == false and broadcastSecondLoad == true + auto [firstLoadVecVal, firstLoadOddVal, firstLoadEvenVal] = loadCastBroadcastExtractVec(secondLoad, secondLoadVecSize, secondElementType, castSecondLoad, broadcastSecondLoad); + auto [secondLoadVecVal, secondLoadOddVal, secondLoadEvenVal] = loadCastBroadcastExtractVec(firstLoad, firstLoadVecSize, firstElementType, castFirstLoad, broadcastFirstLoad); + firstLoadVec = firstLoadVecVal; + firstLoadOdds = firstLoadOddVal; + firstLoadEvens = firstLoadEvenVal; + secondLoadVec = secondLoadVecVal; + secondLoadOdds = secondLoadOddVal; + secondLoadEvens = secondLoadEvenVal; } - // 5. create shuffled ops (odd and even) for loadBVecOp - auto vecLoadB_oddShuffleOp = rewriter.create(locB, halfVecType, loadBVecOp, loadBVecOp, oddMask); - auto vecLoadB_evenShuffleOp = rewriter.create(locB, halfVecType, loadBVecOp, loadBVecOp, evenMask); - - // 6. Sign-extend all ops for further arithmetic operations - auto i32Type = rewriter.getIntegerType(32); auto bigVecType = mlir::VectorType::get({ vectorSize / 2 }, i32Type); - auto sextA_oddOp = rewriter.create(rewriter.getUnknownLoc(), vecLoadA_oddShuffleOp, bigVecType); - auto sextA_evenOp = rewriter.create(rewriter.getUnknownLoc(), vecLoadA_evenShuffleOp, bigVecType); - auto sextB_oddOp = rewriter.create(rewriter.getUnknownLoc(), vecLoadB_oddShuffleOp, bigVecType); - auto sextB_evenOp = rewriter.create(rewriter.getUnknownLoc(), vecLoadB_evenShuffleOp, bigVecType); - // 7. binOp.mul for sign-extended even shuffled elements of A and B + // TODO : plumb this from the DSL +#if MATCH_VPMADDWD_INTRINSIC + // (3-5). Create results using vpmaddwd intrinsic + auto accumOp = rewriter.create(outerLoop.getLoc(), bigVecType, firstLoadVec, secondLoadVec); +#else + // 3. Sign-extend all ops for further arithmetic operations + // auto i32Type = rewriter.getIntegerType(32); + auto sextA_oddOp = rewriter.create(rewriter.getUnknownLoc(), firstLoadOdds, bigVecType); + auto sextA_evenOp = rewriter.create(rewriter.getUnknownLoc(), firstLoadEvens, bigVecType); + auto sextB_oddOp = rewriter.create(rewriter.getUnknownLoc(), secondLoadOdds, bigVecType); + auto sextB_evenOp = rewriter.create(rewriter.getUnknownLoc(), secondLoadEvens, bigVecType); + + // 4. binOp.mul for sign-extended even shuffled elements of A and B // A[00] * B[0], A[00] * B[2], A[00] * B[4] ... auto vecMulAB_even = rewriter.create(mulAB.getLoc(), sextA_evenOp, sextB_evenOp); // A[01] * B[1], A[01] * B[3], A[01] * B[5] ... auto vecMulAB_odd = rewriter.create(mulAB.getLoc(), sextA_oddOp, sextB_oddOp); - // 8. Add odd/even sign-extended results - auto accABOp = rewriter.create(rewriter.getUnknownLoc(), vecMulAB_even, vecMulAB_odd); + // 5. Add odd/even sign-extended results + auto accumOp = rewriter.create(rewriter.getUnknownLoc(), vecMulAB_even, vecMulAB_odd); +#endif - // 9. Vectorize affine.load of C - if (IsUnrolledAccessSequential(rewriter, loadCOp, laneMappingsC, vectorSize / 2)) - { - // TODO: substitute 0 for jj here - auto [flatCastMemRefC, flattenedPosC] = FlattenAccess(rewriter, loadCOp, baseIndicesC); - loadCVecOp = rewriter.create(loadCOp.getLoc(), vectorTypeC, flatCastMemRefC, mlir::ValueRange{ flattenedPosC }); - } - else - { - return failure(); - } + // 6. Vectorize affine.load of C + auto [flatCastMemRefC, flattenedPosC] = FlattenAccess(rewriter, loadCOp, baseIndicesC); + loadCVecOp = rewriter.create(loadCOp.getLoc(), vectorTypeC, flatCastMemRefC, mlir::ValueRange{ flattenedPosC }); - // 10. Add accABOp to vecLoadC - auto finalAccOp = rewriter.create(accOp.getLoc(), loadCVecOp, accABOp); - - // 11. store final accumulated result to vectorized C - // Verify again if the memory access is sequential and then vectorize the store op - std::vector laneMappingsStoreC(unrollMax_jj); - auto loc_storeCOp = storeCOp.getLoc(); - for (int64_t jj_idx = 0; jj_idx < unrollMax_jj; ++jj_idx) - { - auto offset_jj = rewriter.create(loc_storeCOp, jj_idx); - auto offsetInductionVar_jj = rewriter.create(loc_storeCOp, inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj }); - tempOps.push(offset_jj); - tempOps.push(offsetInductionVar_jj); - BlockAndValueMapping& operandMapStoreC = laneMappingsStoreC[jj_idx]; - operandMapStoreC.map(jj_inductionVar, offsetInductionVar_jj); - } + // 7. Add accumOp to vecLoadC + auto finalAccOp = rewriter.create(accOp.getLoc(), loadCVecOp, accumOp); + // 8. store final accumulated result to vectorized C mlir::AffineStoreOpAdaptor adaptorStoreC{ storeCOp }; std::vector baseIndicesStoreC(adaptorStoreC.indices().begin(), adaptorStoreC.indices().end()); mlir::vector::StoreOp storeCVecOp; - if (IsUnrolledAccessSequential(rewriter, storeCOp, laneMappingsStoreC, vectorSize / 2)) - { - auto [flatCastMemRefStoreC, flattenedPosStoreC] = FlattenAccess(rewriter, storeCOp, baseIndicesStoreC); - storeCVecOp = rewriter.create(storeCOp.getLoc(), finalAccOp.getResult(), flatCastMemRefStoreC, mlir::ValueRange{ flattenedPosStoreC }); - } - else - { - return failure(); - } + auto [flatCastMemRefStoreC, flattenedPosStoreC] = FlattenAccess(rewriter, storeCOp, baseIndicesStoreC); + + rewriter.create(storeCOp.getLoc(), finalAccOp.getResult(), flatCastMemRefStoreC, mlir::ValueRange{ flattenedPosStoreC }); // Set the step size for the vectorized loops to be the vector size in that dimension outerLoop.setStep(jj_step * jj_numIters); innerLoop.setStep(kk_step * kk_numIters); - + + ir::util::EraseOps(matchedOps, rewriter); + return mlir::success(); } +mlir::LogicalResult TryVectorizeKnownSubgraph(mlir::AffineForOp affineForOp, + mlir::PatternRewriter& rewriter) +{ + // TODO : convert these to rewrite pattern structs with benefit weights + if (succeeded(vectorizeHorizontalReduction(affineForOp, rewriter))) + return success(); + if (succeeded(vectorizeSequentialCast(affineForOp, rewriter))) + return success(); + if (succeeded(vectorizeTwoRowInterleavedPack(affineForOp, rewriter))) + return success(); + if (succeeded(vectorizeInt16MatMul(affineForOp, rewriter))) + return success(); + return failure(); +} + } // namespace accera::transforms diff --git a/accera/transforms/src/value/RangeValueOptimizePass.cpp b/accera/transforms/src/value/RangeValueOptimizePass.cpp index e6190cb9..4ede182f 100644 --- a/accera/transforms/src/value/RangeValueOptimizePass.cpp +++ b/accera/transforms/src/value/RangeValueOptimizePass.cpp @@ -1,7 +1,7 @@ //////////////////////////////////////////////////////////////////////////////////////////////////// // Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. See LICENSE in the project root for license information. -// Authors: Abdul Dakkak +// Authors: Abdul Dakkak, Mason Remy //////////////////////////////////////////////////////////////////////////////////////////////////// #include "AcceraPasses.h" @@ -12,7 +12,9 @@ #include #include +#include #include +#include #include #include #include @@ -39,6 +41,7 @@ #include #include +#include #define DEBUG_TYPE "value-optimize" @@ -55,101 +58,248 @@ using llvm::Instruction; namespace { -struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase + +enum class CmpIOpClassification : int { - void runOnOperation() final + Unknown, + AlwaysFalse, + AlwaysTrue +}; + +// TODO : de-dupe with value-to-std +static arith::CmpIPredicate CmpOpPredicateToCmpIPredicate(accera::ir::value::CmpOpPredicate pred) +{ +#define MAP_PREDICATE(v1, v2) \ + case accera::ir::value::CmpOpPredicate::v1: \ + return arith::CmpIPredicate::v2 + + switch (pred) { - rangeValue = &getAnalysis(); - - // now we use them to classify the comparison operation - auto ctx = &getContext(); - OpBuilder builder(ctx); - Type i1Ty = builder.getI1Type(); - getOperation()->walk([&](arith::CmpIOp op) { - auto classification = classifyCmpIOp(op); - if (classification != CmpIOpClassification::Unknown) - { - builder.setInsertionPoint(op); - Value val = builder.create(op->getLoc(), i1Ty, builder.getBoolAttr(classification == CmpIOpClassification::AlwaysTrue)); - op.replaceAllUsesWith(val); - op.erase(); - } - }); + MAP_PREDICATE(EQ, eq); + MAP_PREDICATE(GE, sge); + MAP_PREDICATE(GT, sgt); + MAP_PREDICATE(LE, sle); + MAP_PREDICATE(LT, slt); + MAP_PREDICATE(NE, ne); + default: + assert(false); + } + +#undef MAP_PREDICATE +} + +CmpIOpClassification classifyCmpIOp(RangeValueAnalysis& rangeValue, arith::CmpIOp op) +{ + auto predicate = op.getPredicate(); + auto lhs = op.getLhs(); + auto rhs = op.getRhs(); + if (!rangeValue.hasRange(lhs) || !rangeValue.hasRange(rhs)) + { + return CmpIOpClassification::Unknown; + } + auto lhsRange = rangeValue.getRange(lhs); + auto rhsRange = rangeValue.getRange(rhs); + if (lhsRange.isFullSet() || rhsRange.isFullSet()) + { + return CmpIOpClassification::Unknown; + } + + switch (predicate) + { + case arith::CmpIPredicate::slt: + if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange)) + { + return CmpIOpClassification::AlwaysTrue; + } + else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange)) + { + return CmpIOpClassification::AlwaysFalse; + } + break; + case arith::CmpIPredicate::sle: + if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange)) + { + return CmpIOpClassification::AlwaysTrue; + } + else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange)) + { + return CmpIOpClassification::AlwaysFalse; + } + break; + case arith::CmpIPredicate::sgt: + if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange)) + { + return CmpIOpClassification::AlwaysTrue; + } + else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange)) + { + return CmpIOpClassification::AlwaysFalse; + } + break; + case arith::CmpIPredicate::sge: + if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange)) + { + return CmpIOpClassification::AlwaysTrue; + } + else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange)) + { + return CmpIOpClassification::AlwaysFalse; + } + break; + default: + break; + } + + return CmpIOpClassification::Unknown; +} + +std::optional GetConstantCmpIOpResult(arith::CmpIOp cmpIOp) +{ + RangeValueAnalysis rangeValueAnalysis(cmpIOp); + auto classification = classifyCmpIOp(rangeValueAnalysis, cmpIOp); + if (classification != CmpIOpClassification::Unknown) + { + return classification == CmpIOpClassification::AlwaysTrue; + } + return std::nullopt; +} + +LogicalResult RewriteConstantCmpIOpCommon(PatternRewriter& rewriter, arith::CmpIOp cmpIOp, mlir::Operation* opToReplace = nullptr) +{ + if (!opToReplace) + { + opToReplace = cmpIOp; + } + + auto constantCmpIOpResultOpt = GetConstantCmpIOpResult(cmpIOp); + + if (constantCmpIOpResultOpt.has_value()) + { + Type i1Ty = rewriter.getI1Type(); + rewriter.replaceOpWithNewOp(opToReplace, i1Ty, rewriter.getBoolAttr(*constantCmpIOpResultOpt)); + return mlir::success(); + } + return mlir::failure(); +} + +struct ConstantCmpIOpRewrite : public mlir::OpRewritePattern +{ + using OpRewritePattern::OpRewritePattern; + LogicalResult matchAndRewrite(arith::CmpIOp op, PatternRewriter& rewriter) const final + { + return RewriteConstantCmpIOpCommon(rewriter, op); } +}; -private: - enum CmpIOpClassification : int +struct ConstantAcceraCmpOpRewrite : public mlir::OpRewritePattern +{ + using OpRewritePattern::OpRewritePattern; + LogicalResult matchAndRewrite(accera::ir::value::CmpOp op, PatternRewriter& rewriter) const final { - Unknown, - AlwaysFalse, - AlwaysTrue - }; + std::stack tempOps; + TempOpCleanupGuard guard(&tempOps, rewriter); - CmpIOpClassification classifyCmpIOp(arith::CmpIOp op) + // TODO : de-dupe with value-to-std conversion + auto lhs = op.lhs(); + auto rhs = op.rhs(); + + auto pred = op.getPredicate(); + if (util::GetElementType(lhs.getType()).isa()) + { + // Doesn't support CmpFOp classification currently + return failure(); + } + auto stdCmpIOp = rewriter.create(op.getLoc(), CmpOpPredicateToCmpIPredicate(pred), lhs, rhs); + tempOps.push(stdCmpIOp.getOperation()); + + return RewriteConstantCmpIOpCommon(rewriter, stdCmpIOp, op); + } +}; + +struct ConstantAcceraMaxMinOpRewrite : public mlir::OpRewritePattern +{ + using OpRewritePattern::OpRewritePattern; + LogicalResult matchAndRewrite(BinOp op, PatternRewriter& rewriter) const final { + // If the Bin op is a max or a min, then check if it is always equal to one of its operands + // i.e. if we have z = max(x, y), and x <= y always, then replace max(x, y) with y + // To do this, check: + // (x <= y), and + // (x >= y) + // If the former is always true, then replace max(x, y) with y, min(x, y) with x + // If the latter is always true, then replace max(x, y) with x, min(x, y) with y + // If neither are always true, then don't replace the max or min op + // We have to check both to handle the case where a '<' or '>' check doesn't capture that the point where they are equal doesn't change which operand is the replacement value of the max/min and to avoid an operand ordering bias + auto predicate = op.getPredicate(); - auto lhs = op.getLhs(); - auto rhs = op.getRhs(); - if (!rangeValue->hasRange(lhs) || !rangeValue->hasRange(rhs)) + if (predicate != BinaryOpPredicate::MAX && predicate != BinaryOpPredicate::MIN) { - return CmpIOpClassification::Unknown; + return failure(); } - auto lhsRange = rangeValue->getRange(lhs); - auto rhsRange = rangeValue->getRange(rhs); - if (lhsRange.isFullSet() || rhsRange.isFullSet()) + std::stack tempOps; + TempOpCleanupGuard guard(&tempOps, rewriter); + + auto lhs = op.lhs(); + auto rhs = op.rhs(); + + if (util::GetElementType(lhs.getType()).isa()) { - return CmpIOpClassification::Unknown; + // Doesn't support CmpFOp classification currently + return failure(); } + auto LEQCmpIOp = rewriter.create(op.getLoc(), arith::CmpIPredicate::sle, lhs, rhs); + tempOps.push(LEQCmpIOp.getOperation()); + auto LEQconstantResultOpt = GetConstantCmpIOpResult(LEQCmpIOp); - switch (predicate) + if (LEQconstantResultOpt.has_value() && *LEQconstantResultOpt) { - case arith::CmpIPredicate::slt: - if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange)) - { - return CmpIOpClassification::AlwaysTrue; - } - else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange)) - { - return CmpIOpClassification::AlwaysFalse; - } - break; - case arith::CmpIPredicate::sle: - if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange)) - { - return CmpIOpClassification::AlwaysTrue; - } - else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange)) - { - return CmpIOpClassification::AlwaysFalse; - } - break; - case arith::CmpIPredicate::sgt: - if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange)) + if (predicate == BinaryOpPredicate::MAX) { - return CmpIOpClassification::AlwaysTrue; + rewriter.replaceOp(op, mlir::ValueRange{ rhs }); } - else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange)) + else { - return CmpIOpClassification::AlwaysFalse; + rewriter.replaceOp(op, mlir::ValueRange{ lhs }); } - break; - case arith::CmpIPredicate::sge: - if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange)) + return success(); + } + + auto GEQCmpIOp = rewriter.create(op.getLoc(), arith::CmpIPredicate::sge, lhs, rhs); + tempOps.push(GEQCmpIOp.getOperation()); + auto GEQconstantResultOpt = GetConstantCmpIOpResult(GEQCmpIOp); + + if (GEQconstantResultOpt.has_value() && *GEQconstantResultOpt) + { + if (predicate == BinaryOpPredicate::MAX) { - return CmpIOpClassification::AlwaysTrue; + rewriter.replaceOp(op, mlir::ValueRange{ lhs }); } - else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange)) + else { - return CmpIOpClassification::AlwaysFalse; + rewriter.replaceOp(op, mlir::ValueRange{ rhs }); } - break; - default: - break; + return success(); } + return failure(); + } +}; - return CmpIOpClassification::Unknown; + +struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase +{ + void runOnOperation() final + { + auto context = &getContext(); + auto operation = getOperation(); + + mlir::GreedyRewriteConfig topDownConfig; // Handle outer simplifications first as they will resolve to constants need for inner simplifications + topDownConfig.useTopDownTraversal = true; + + mlir::RewritePatternSet patterns(context); + accera::transforms::value::populateRangeValueOptimizePatterns(patterns); + util::FillCanonicalPatternsRecursively(operation, patterns); + (void)applyPatternsAndFoldGreedily(operation, std::move(patterns), topDownConfig); } - RangeValueAnalysis* rangeValue = nullptr; }; } // namespace @@ -157,6 +307,13 @@ struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase(patterns.getContext()); +} + std::unique_ptr createRangeValueOptimizePass() { return std::make_unique(); diff --git a/accera/transforms/src/value/ValueFuncToTargetPass.cpp b/accera/transforms/src/value/ValueFuncToTargetPass.cpp index 530f57e8..b7e3953b 100644 --- a/accera/transforms/src/value/ValueFuncToTargetPass.cpp +++ b/accera/transforms/src/value/ValueFuncToTargetPass.cpp @@ -220,7 +220,8 @@ struct ValueLambdaRewritePattern : mlir::OpRewritePattern // gpu functions fail since hiprtc does not call the host launcher function // but instead calls the kernel directly. llvm::SetVector capturedValuesSet; - for (auto&& v : op->getParentOfType().getArguments()) + auto parentFuncOp = op->getParentOfType(); + for (auto&& v : parentFuncOp.getArguments()) { capturedValuesSet.insert(v); } @@ -306,6 +307,11 @@ struct ValueLambdaRewritePattern : mlir::OpRewritePattern mapValueTypeAttr(vFuncOp, valueMapper); + if (parentFuncOp->hasAttr(ir::NoInlineIntoAttrName)) + { + vFuncOp->setAttr(ir::NoInlineIntoAttrName, rewriter.getUnitAttr()); + } + rewriter.eraseOp(op); } }; @@ -324,6 +330,13 @@ struct ValueLaunchFuncOpInlinerPattern : OpRewritePattern // Don't inline calls from RawPointerAPI functions return failure(); } + if (parentFnOp->getAttr(ir::NoInlineIntoAttrName)) + { + // If this launch op is inside of a function that is not inlinable-into, then don't inline the function we're calling + // By doing this, only the outer publically-visible function will have its internal calls inlined and we won't + // wind up bloating our module with function contents that will never be invoked + return failure(); + } if (auto attr = parentFnOp->getAttrOfType(vir::ValueFuncOp::getExecTargetAttrName()); attr && target == attr) diff --git a/accera/transforms/src/value/ValueSimplifyPass.cpp b/accera/transforms/src/value/ValueSimplifyPass.cpp index d9ef2dfc..b72d80a8 100644 --- a/accera/transforms/src/value/ValueSimplifyPass.cpp +++ b/accera/transforms/src/value/ValueSimplifyPass.cpp @@ -448,7 +448,7 @@ struct IndexCombinationBinOpLowering : public OpRewritePattern combinationExpr = lhsExpr % rhsExpr; break; default: - assert(false); + return failure(); } auto map = mlir::AffineMap::get(nextDimIdx, 0, combinationExpr); rewriter.replaceOpWithNewOp(op, map, exprInputs); diff --git a/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp b/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp index ec3e28f8..00df5544 100644 --- a/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp +++ b/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -555,6 +556,89 @@ struct MemrefAllocOpLowering : public ConvertOpToLLVMPattern } }; +// TODO : de-dupe these lowerings, all 2-arg-1-result vector intrinsics appear to have the same lowering +struct VpmaddwdOpLowering : public ValueLLVMOpConversionPattern +{ + using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern; + + LogicalResult matchAndRewrite( + vpmaddwd op, + OpAdaptor adaptor, + ConversionPatternRewriter& rewriter) const override + { + LLVMTypeConverter llvmTypeConverter(rewriter.getContext()); + auto outputVecType = op.getType().cast(); + auto outputVecLLVMType = llvmTypeConverter.convertType(outputVecType); + rewriter.replaceOpWithNewOp(op, outputVecLLVMType, op.lhs(), op.rhs()); + return success(); + } +}; + +struct VmaxpsOpLowering : public ValueLLVMOpConversionPattern +{ + using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern; + + LogicalResult matchAndRewrite( + vmaxps op, + OpAdaptor adaptor, + ConversionPatternRewriter& rewriter) const override + { + LLVMTypeConverter llvmTypeConverter(rewriter.getContext()); + auto outputVecType = op.getType().cast(); + auto outputVecLLVMType = llvmTypeConverter.convertType(outputVecType); + rewriter.replaceOpWithNewOp(op, outputVecLLVMType, op.lhs(), op.rhs()); + return success(); + } +}; + +struct VminpsOpLowering : public ValueLLVMOpConversionPattern +{ + using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern; + + LogicalResult matchAndRewrite( + vminps op, + OpAdaptor adaptor, + ConversionPatternRewriter& rewriter) const override + { + LLVMTypeConverter llvmTypeConverter(rewriter.getContext()); + auto outputVecType = op.getType().cast(); + auto outputVecLLVMType = llvmTypeConverter.convertType(outputVecType); + rewriter.replaceOpWithNewOp(op, outputVecLLVMType, op.lhs(), op.rhs()); + return success(); + } +}; + +struct RoundOpLowering : public ValueLLVMOpConversionPattern +{ + using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern; + + LogicalResult matchAndRewrite( + RoundOp op, + OpAdaptor adaptor, + ConversionPatternRewriter& rewriter) const override + { + LLVMTypeConverter llvmTypeConverter(rewriter.getContext()); + auto outputType = llvmTypeConverter.convertType(op.getType()); + + auto inputType = op.val().getType(); + if (inputType.isa()) + { + rewriter.replaceOpWithNewOp(op, outputType, op.val()); + } + else + { + mlir::Value roundedFPVal = rewriter.create(op.getLoc(), op.val()); + + // Create arithmetic dialect cast ops with the expectation that other arithmetic dialect ops are getting lowered as part of this pass + auto signlessOutputType = util::ToSignlessMLIRType(rewriter, op.getType()); + mlir::Value roundedSIVal = rewriter.create(op.getLoc(), roundedFPVal, signlessOutputType); + rewriter.replaceOpWithNewOp(op, op.getType(), roundedSIVal); + } + return success(); + } +}; + + struct ValueToLLVMLoweringPass : public ConvertValueToLLVMBase { ValueToLLVMLoweringPass(bool useBarePtrCallConv, bool emitCWrappers, unsigned indexBitwidth, bool useAlignedAlloc, llvm::DataLayout dataLayout, const IntraPassSnapshotOptions& snapshotteroptions = {}) : @@ -1281,6 +1365,7 @@ void ValueToLLVMLoweringPass::runOnModule() snapshotter.Snapshot("Initial", moduleOp); target.addLegalOp(); + target.addLegalDialect(); // Set pass parameter values with command line options inherited from ConvertValueToLLVMBase mlir::LowerToLLVMOptions options(&getContext()); @@ -1328,16 +1413,28 @@ void ValueToLLVMLoweringPass::runOnModule() snapshotter.Snapshot("BarePtrConversion", moduleOp); { + auto intermediateTarget = target; + intermediateTarget.addLegalDialect(); + intermediateTarget.addLegalDialect(); + RewritePatternSet patterns(&getContext()); populateValueToLLVMPatterns(llvmTypeConverter, patterns); populateLinalgToLLVMConversionPatterns(llvmTypeConverter, patterns); populateVectorToLLVMConversionPatterns(llvmTypeConverter, patterns, /*reassociateFPReductions*/ true); + + // Subset of LowerVectorToLLVMPass patterns + vector::populateVectorToVectorCanonicalizationPatterns(patterns); + vector::populateVectorBroadcastLoweringPatterns(patterns); + vector::populateVectorMaskOpLoweringPatterns(patterns); + vector::populateVectorShapeCastLoweringPatterns(patterns); + vector::populateVectorTransposeLoweringPatterns(patterns); + vector::populateVectorTransferLoweringPatterns(patterns, /*maxTransferRank=*/1); vector::populateVectorContractLoweringPatterns(patterns, vector::VectorTransformsOptions{}.setVectorTransferSplit(mlir::vector::VectorTransferSplit::VectorTransfer)); vector::populateVectorMaskMaterializationPatterns(patterns, true); - if (failed(applyPartialConversion(moduleOp, target, std::move(patterns)))) + if (failed(applyPartialConversion(moduleOp, intermediateTarget, std::move(patterns)))) { signalPassFailure(); } @@ -1353,6 +1450,15 @@ void ValueToLLVMLoweringPass::runOnModule() populateMemRefToLLVMConversionPatterns(llvmTypeConverter, patterns); populateStdToLLVMConversionPatterns(llvmTypeConverter, patterns); arith::populateArithmeticToLLVMConversionPatterns(llvmTypeConverter, patterns); + arith::populateArithmeticExpandOpsPatterns(patterns); + + // Subset of LowerVectorToLLVMPass patterns + vector::populateVectorToVectorCanonicalizationPatterns(patterns); + vector::populateVectorBroadcastLoweringPatterns(patterns); + vector::populateVectorMaskOpLoweringPatterns(patterns); + vector::populateVectorShapeCastLoweringPatterns(patterns); + vector::populateVectorTransposeLoweringPatterns(patterns); + vector::populateVectorTransferLoweringPatterns(patterns, /*maxTransferRank=*/1); populateVectorToLLVMConversionPatterns(llvmTypeConverter, patterns, /*reassociateFPReductions*/ true); vector::populateVectorContractLoweringPatterns(patterns, vector::VectorTransformsOptions{}.setVectorTransferSplit(mlir::vector::VectorTransferSplit::VectorTransfer)); @@ -1413,6 +1519,10 @@ void populateLocalValueToLLVMPatterns(mlir::LLVMTypeConverter& typeConverter, ml PrintFOpLowering, GetTimeOpLowering, RangeOpLowering, + VpmaddwdOpLowering, + VmaxpsOpLowering, + VminpsOpLowering, + RoundOpLowering, MemrefAllocOpLowering>(typeConverter, context); } diff --git a/accera/transforms/src/value/ValueToStandardLoweringPass.cpp b/accera/transforms/src/value/ValueToStandardLoweringPass.cpp index 70ab243e..12da3410 100644 --- a/accera/transforms/src/value/ValueToStandardLoweringPass.cpp +++ b/accera/transforms/src/value/ValueToStandardLoweringPass.cpp @@ -472,23 +472,40 @@ struct AllocOpLowering : public OpRewritePattern auto memrefType = op.getType(); auto allocType = op.allocType().getValueOr(vir::MemoryAllocType::Global); + OpBuilder::InsertionGuard guard(rewriter); + auto parentFuncOp = op->getParentOfType(); + mlir::memref::AllocOp allocOp; + mlir::Block* parentBlock; + mlir::Value allocatedMemref; switch (allocType) { case vir::MemoryAllocType::Global: { - if (memrefType.getNumDynamicDims() == 0) - { - auto globalOp = irutil::CreateGlobalBufferOp(rewriter, op, MemRefType::Builder{ memrefType }.setLayout({}), kGlobalOpSymNameFormat); - rewriter.replaceOpWithNewOp(op, memrefType, globalOp.sym_name()); - } - else - { - rewriter.replaceOpWithNewOp(op, memrefType, op.getOperation()->getOperands(), op.alignmentAttr()); - } + if (memrefType.getNumDynamicDims() == 0) + { + auto globalOp = irutil::CreateGlobalBufferOp(rewriter, op, MemRefType::Builder{ memrefType }.setLayout({}), kGlobalOpSymNameFormat); + rewriter.replaceOpWithNewOp(op, memrefType, globalOp.sym_name()); } - break; + else + { + rewriter.replaceOpWithNewOp(op, memrefType, op.getOperation()->getOperands(), op.alignmentAttr()); + } + } + break; case vir::MemoryAllocType::Stack: + // Create the stack allocation at the beginning of the function + rewriter.setInsertionPointToStart(&parentFuncOp.front()); rewriter.replaceOpWithNewOp(op, MemRefType::Builder{ memrefType }.setLayout({}), mlir::ValueRange{}, op.alignmentAttr()); break; + case vir::MemoryAllocType::Heap: + allocOp = rewriter.replaceOpWithNewOp(op, memrefType, op.getOperation()->getOperands(), op.alignmentAttr()); + + // Create a dealloc op at the end of the block containing this alloc op + parentBlock = allocOp->getBlock(); + rewriter.setInsertionPoint(parentBlock->getTerminator()); + + allocatedMemref = allocOp.getResult(); + rewriter.create(allocOp.getLoc(), allocatedMemref); + break; default: llvm_unreachable("Unknown alloc type"); } @@ -506,19 +523,19 @@ struct AllocOpLowering : public OpRewritePattern using ValueCastOp = vir::CastOp; struct CastOpLowering : public OpRewritePattern { -#define CAST_FROM_TO_WITH_OP_IF(testFromType, testToType, castOp, conditional) \ - if (fromType && toType && fromType.isa() && toType.isa() && conditional) \ - { \ - mlir::Value castValue = rewriter.create(op.getLoc(), signlessFromValue, signlessToType); \ - if (toType.isIntOrIndex()) \ - { \ - rewriter.replaceOpWithNewOp(op, toType, castValue); \ - } \ - else \ - { \ - rewriter.replaceOp(op, { castValue } ); \ - } \ - return success(); \ +#define CAST_FROM_TO_WITH_OP_IF(testFromType, testToType, castOp, conditional) \ + if (fromType && toType && fromElementType.isa() && toElementType.isa() && conditional) \ + { \ + mlir::Value castValue = rewriter.create(op.getLoc(), signlessFromValue, signlessToType); \ + if (toType.isIntOrIndex()) \ + { \ + rewriter.replaceOpWithNewOp(op, toType, castValue); \ + } \ + else \ + { \ + rewriter.replaceOp(op, { castValue }); \ + } \ + return success(); \ } #define CAST_FROM_TO_WITH_OP(testFromType, testToType, castOp) CAST_FROM_TO_WITH_OP_IF(testFromType, testToType, castOp, true); @@ -532,10 +549,17 @@ struct CastOpLowering : public OpRewritePattern auto fromType = op.source().getType(); auto toType = op.result().getType(); - assert(fromType.isIntOrIndexOrFloat() && "Can only cast from an int, index, or float type"); - assert(toType.isIntOrIndexOrFloat() && "Can only cast to an int, index, or float type"); + auto isFromTypeVector = fromType.isa(); + auto isToTypeVector = toType.isa(); + assert(isFromTypeVector == isToTypeVector && "Can only cast vectors to vectors or scalars to scalars"); + + auto fromElementType = util::GetElementType(fromType); + auto toElementType = util::GetElementType(toType); + + assert(fromElementType.isIntOrIndexOrFloat() && "Can only cast from an int, index, or float type"); + assert(toElementType.isIntOrIndexOrFloat() && "Can only cast to an int, index, or float type"); - if (fromType == toType) + if (fromElementType == toElementType) { // No casting needed rewriter.replaceOp(op, { op.source() }); @@ -545,42 +569,43 @@ struct CastOpLowering : public OpRewritePattern auto signlessFromValue = accera::ir::util::ToSignlessMLIRValue(rewriter, op.source()); auto signlessToType = accera::ir::util::ToSignlessMLIRType(rewriter, toType); - auto unsignedFromType = fromType.isUnsignedInteger(); - auto unsignedToType = toType.isUnsignedInteger(); + auto unsignedFromElementType = fromElementType.isUnsignedInteger(); + auto unsignedToElementType = toElementType.isUnsignedInteger(); // Integer casts - CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::TruncIOp, (fromType.getIntOrFloatBitWidth() > toType.getIntOrFloatBitWidth())); - CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtSIOp, (fromType.getIntOrFloatBitWidth() < toType.getIntOrFloatBitWidth() && !unsignedToType)); - CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtUIOp, (fromType.getIntOrFloatBitWidth() < toType.getIntOrFloatBitWidth() && unsignedToType)); - if (fromType.isa() && toType.isa() && (fromType.getIntOrFloatBitWidth() == toType.getIntOrFloatBitWidth())) + CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::TruncIOp, (fromElementType.getIntOrFloatBitWidth() > toElementType.getIntOrFloatBitWidth())); + CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtSIOp, (fromElementType.getIntOrFloatBitWidth() < toElementType.getIntOrFloatBitWidth() && !unsignedFromElementType)); + CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtUIOp, (fromElementType.getIntOrFloatBitWidth() < toElementType.getIntOrFloatBitWidth() && unsignedFromElementType)); + if (fromElementType.isa() && toElementType.isa() && (fromElementType.getIntOrFloatBitWidth() == toElementType.getIntOrFloatBitWidth())) { - rewriter.replaceOpWithNewOp(op, toType, signlessFromValue); + rewriter.replaceOpWithNewOp(op, toElementType, signlessFromValue); return success(); } // Float casts - CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::SIToFPOp, (!unsignedFromType)); - CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::UIToFPOp, (unsignedFromType)); + CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::SIToFPOp, (!unsignedFromElementType)); + CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::UIToFPOp, (unsignedFromElementType)); - CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToSIOp, (!unsignedToType)); - CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToUIOp, (unsignedToType)); + CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToSIOp, (!unsignedToElementType)); + CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToUIOp, (unsignedToElementType)); - CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::TruncFOp, (fromType.getIntOrFloatBitWidth() > toType.getIntOrFloatBitWidth())); - CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::ExtFOp, (fromType.getIntOrFloatBitWidth() < toType.getIntOrFloatBitWidth())); + CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::TruncFOp, (fromElementType.getIntOrFloatBitWidth() > toElementType.getIntOrFloatBitWidth())); + CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::ExtFOp, (fromElementType.getIntOrFloatBitWidth() < toElementType.getIntOrFloatBitWidth())); // Index casts CAST_FROM_TO_WITH_OP(mlir::IntegerType, mlir::IndexType, mlir::arith::IndexCastOp); CAST_FROM_TO_WITH_OP(mlir::IndexType, mlir::IntegerType, mlir::arith::IndexCastOp); - if (fromType.isa() && toType.isa()) + auto i64IntermediateType = accera::ir::util::CloneTypeWithNewElementType(op.source().getType(), rewriter.getI64Type()); + if (fromElementType.isa() && toElementType.isa()) { - auto int64Value = rewriter.create(loc, op.source(), rewriter.getI64Type()); // index->int64 - rewriter.replaceOpWithNewOp(op, int64Value, toType); // int64->fp + auto int64Value = rewriter.create(loc, op.source(), i64IntermediateType); // index->int64 + rewriter.replaceOpWithNewOp(op, int64Value, toElementType); // int64->fp return success(); } - if (fromType.isa() && toType.isa()) + if (fromElementType.isa() && toElementType.isa()) { - auto int64Value = rewriter.create(loc, op.source(), rewriter.getI64Type()); // fp->int64 - rewriter.replaceOpWithNewOp(op, int64Value, toType); // int64->index + auto int64Value = rewriter.create(loc, op.source(), i64IntermediateType); // fp->int64 + rewriter.replaceOpWithNewOp(op, int64Value, toElementType); // int64->index return success(); } @@ -948,7 +973,7 @@ struct ValueLaunchFuncOpRewritePattern : OpRewritePattern switch (target) { case vir::ExecutionTarget::CPU: - rewriter.replaceOpWithNewOp(op, callee, ArrayRef{}, ValueRange{ op.operands() }); + rewriter.replaceOpWithNewOp(op, callee, op.getResultTypes(), ValueRange{ op.operands() }); return success(); case vir::ExecutionTarget::GPU: auto gpuSymRef = SymbolRefAttr::get(rewriter.getContext(), callee.str() + "_module", SymbolRefAttr::get(callee)); @@ -1034,6 +1059,10 @@ LogicalResult BinOpLowering::matchAndRewrite( return rewriter.create(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr())); case BinaryOpPredicate::SUB: return rewriter.create(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr())); + case BinaryOpPredicate::MAX: + return rewriter.create(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr())); + case BinaryOpPredicate::MIN: + return rewriter.create(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr())); default: assert(false); return {}; @@ -1067,6 +1096,32 @@ LogicalResult BinOpLowering::matchAndRewrite( return rewriter.create(loc, lhs, rhs); case BinaryOpPredicate::LOGICAL_OR: return rewriter.create(loc, lhs, rhs); + case BinaryOpPredicate::MAX: + if (lhs == rhs) + { + return lhs; + } + if (elementType.isUnsignedInteger()) + { + return rewriter.create(loc, ValueRange{ lhs, rhs }); + } + else + { + return rewriter.create(loc, ValueRange{ lhs, rhs }); + } + case BinaryOpPredicate::MIN: + if (lhs == rhs) + { + return lhs; + } + if (elementType.isUnsignedInteger()) + { + return rewriter.create(loc, ValueRange{ lhs, rhs }); + } + else + { + return rewriter.create(loc, ValueRange{ lhs, rhs }); + } default: assert(false); return {}; diff --git a/accera/value/include/EmitterContext.h b/accera/value/include/EmitterContext.h index bbd6fc79..1b0846c7 100644 --- a/accera/value/include/EmitterContext.h +++ b/accera/value/include/EmitterContext.h @@ -47,6 +47,8 @@ namespace value None = 0, ThreadLocal = 1 << 0, Stack = 1 << 1, + Heap = 1 << 2, + Global = 1 << 3, }; ACCERA_DEFINE_ENUM_FLAG_OPERATORS(AllocateFlags); @@ -361,6 +363,8 @@ namespace value Scalar Cast(Scalar value, ValueType type); + Scalar Round(Scalar value); + bool IsImplicitlyCastable(ValueType source, ValueType target) const; Scalar Bitcast(Scalar value, ValueType type); @@ -496,6 +500,8 @@ namespace value virtual Scalar CastImpl(Scalar value, ValueType type) = 0; + virtual Scalar RoundImpl(Scalar value) = 0; + virtual bool IsImplicitlyCastableImpl(ValueType source, ValueType target) const = 0; virtual Scalar BitcastImpl(Scalar value, ValueType type) = 0; diff --git a/accera/value/include/FunctionDeclaration.h b/accera/value/include/FunctionDeclaration.h index 28b61860..a085b944 100644 --- a/accera/value/include/FunctionDeclaration.h +++ b/accera/value/include/FunctionDeclaration.h @@ -72,6 +72,10 @@ namespace value /// A FunctionInlining value specifying whether this function should be inlined or not FunctionDeclaration& Inlined(FunctionInlining shouldInline = FunctionInlining::always); + /// Sets whether other functions should be inlined into this function + /// A FunctionInlining value specifying whether this function should be inlined or not + FunctionDeclaration& InlineInto(FunctionInlining shouldInlineInto = FunctionInlining::always); + /// Sets the execution target for this function /// A ExecutionTarget value specifying where this function should execute FunctionDeclaration& Target(ExecutionTarget target); @@ -186,6 +190,9 @@ namespace value /// Returns true if the instance is inlined [[nodiscard]] FunctionInlining InlineState() const; + /// Returns true if the instance can be inlined into + [[nodiscard]] FunctionInlining InlineIntoState() const; + [[nodiscard]] ExecutionTarget Target() const { return _execTarget; } [[nodiscard]] ExecutionRuntime Runtime() const { return _execRuntime; } @@ -240,6 +247,7 @@ namespace value ExecutionTarget _execTarget; ExecutionRuntime _execRuntime = ExecutionRuntime::DEFAULT; FunctionInlining _inlineState = FunctionInlining::defaultInline; + FunctionInlining _inlineIntoState = FunctionInlining::defaultInline; bool _isDecorated = true; bool _isPublic = false; bool _isEmpty = true; diff --git a/accera/value/include/MLIREmitterContext.h b/accera/value/include/MLIREmitterContext.h index fc739cb8..700f7723 100644 --- a/accera/value/include/MLIREmitterContext.h +++ b/accera/value/include/MLIREmitterContext.h @@ -176,6 +176,8 @@ namespace value Scalar CastImpl(Scalar value, ValueType type) override; + Scalar RoundImpl(Scalar value) override; + bool IsImplicitlyCastableImpl(ValueType source, ValueType target) const override; Scalar BitcastImpl(Scalar value, ValueType type) override; diff --git a/accera/value/include/Plan.h b/accera/value/include/Plan.h index a821ab35..93d82342 100644 --- a/accera/value/include/Plan.h +++ b/accera/value/include/Plan.h @@ -179,6 +179,8 @@ namespace value /// The policy used to schedule work across the threads. void Parallelize(std::vector indices, int64_t numThreads, ParallelizationPolicy policy); + void _EraseLoop(const value::ScalarIndex& index); + private: friend class Schedule; Plan(Schedule& sched, ExecutionRuntime execRuntime = ExecutionRuntime::DEFAULT); diff --git a/accera/value/include/ScalarOperations.h b/accera/value/include/ScalarOperations.h index ed1bfccd..e9607fa3 100644 --- a/accera/value/include/ScalarOperations.h +++ b/accera/value/include/ScalarOperations.h @@ -52,7 +52,8 @@ namespace value Scalar Tanh(Scalar s); Scalar Square(Scalar s); - Scalar Round(Scalar s); // Note: not implemented + Scalar Round(Scalar s); + Scalar Remainderf(Scalar numer, Scalar denom); Scalar Floor(Scalar s); Scalar Ceil(Scalar s); Scalar CopySign(Scalar s1, Scalar s2); // Note: not implemented diff --git a/accera/value/include/ValueType.h b/accera/value/include/ValueType.h index 247a4814..cd7eb0b3 100644 --- a/accera/value/include/ValueType.h +++ b/accera/value/include/ValueType.h @@ -87,8 +87,14 @@ namespace value divide, /// Remainder operation modulus, + /// Logical AND operation logicalAnd, - logicalOr + /// Logical OR operation + logicalOr, + /// Max operation + max, + /// Min operation + min }; enum class ValueLogicalOperation diff --git a/accera/value/src/EmitterContext.cpp b/accera/value/src/EmitterContext.cpp index 8e56ce2f..460b14d2 100644 --- a/accera/value/src/EmitterContext.cpp +++ b/accera/value/src/EmitterContext.cpp @@ -265,6 +265,11 @@ namespace value return CastImpl(value, type); } + Scalar EmitterContext::Round(Scalar value) + { + return RoundImpl(value); + } + bool EmitterContext::IsImplicitlyCastable(ValueType source, ValueType target) const { return IsImplicitlyCastableImpl(source, target); diff --git a/accera/value/src/FunctionDeclaration.cpp b/accera/value/src/FunctionDeclaration.cpp index d7c504e0..571545b8 100644 --- a/accera/value/src/FunctionDeclaration.cpp +++ b/accera/value/src/FunctionDeclaration.cpp @@ -114,6 +114,14 @@ namespace value return *this; } + FunctionDeclaration& FunctionDeclaration::InlineInto(FunctionInlining shouldInlineInto) + { + CheckNonEmpty(); + + _inlineIntoState = shouldInlineInto; + return *this; + } + FunctionDeclaration& FunctionDeclaration::Target(ExecutionTarget target) { CheckNonEmpty(); @@ -303,6 +311,12 @@ namespace value return _inlineState; } + FunctionInlining FunctionDeclaration::InlineIntoState() const + { + CheckNonEmpty(); + return _inlineIntoState; + } + void FunctionDeclaration::CheckNonEmpty() const { if (_isEmpty) diff --git a/accera/value/src/MLIREmitterContext.cpp b/accera/value/src/MLIREmitterContext.cpp index 9914ed06..893e104d 100644 --- a/accera/value/src/MLIREmitterContext.cpp +++ b/accera/value/src/MLIREmitterContext.cpp @@ -150,6 +150,9 @@ mlir::MemRefType MemoryLayoutToMemRefType(mlir::OpBuilder& builder, const Memory // strided maps and memory spaces are not supported for variable-sized layouts auto type = layout.IsVariableSized() ? mlir::MemRefType::get(size, mlirElemType) : mlir::MemRefType::get(size, mlirElemType, stridedMap, (unsigned)layout.GetMemorySpace()); + // Canonicalize and simplify the memref map + type = mlir::canonicalizeStridedLayout(type); + // represent pointers as memrefs of memrefs (memrefs start at pointer level 1) return (pointerLevel > 1) ? mlir::MemRefType::get(MemRefPointerShape, type) : type; } @@ -942,6 +945,27 @@ GPUIndex MLIRContext::GetGPUIndex() } } +static accera::ir::value::MemoryAllocType AllocateFlagToAllocateType(accera::value::AllocateFlags flags) +{ +#define MAP_FLAGS(fromFlag, toFlag) \ + case accera::value::AllocateFlags::fromFlag: \ + return accera::ir::value::MemoryAllocType::toFlag + + switch (flags) + { + MAP_FLAGS(None, Global); + MAP_FLAGS(Global, Global); + MAP_FLAGS(Stack, Stack); + MAP_FLAGS(Heap, Heap); + // MAP_FLAGS(ThreadLocal, ThreadLocal); // Not implemented + default: + assert(false); + } + +#undef MAP_PREDICATE +} + + Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t alignment, AllocateFlags flags, const std::vector& runtimeSizes) { auto& b = _impl->builder; @@ -975,6 +999,7 @@ Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t std::transform(runtimeSizes.cbegin(), runtimeSizes.cend(), std::back_inserter(sizes), [](ScalarDimension d) { return Unwrap(d); }); mlir::Value result; + if (layout.IsVariableSized()) { result = b.create(loc, @@ -982,9 +1007,7 @@ Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t alignment ? llvm::Optional{ static_cast(alignment) } : llvm::None, - static_cast(flags & AllocateFlags::Stack) - ? llvm::Optional{ accera::ir::value::MemoryAllocType::Stack } - : llvm::None, + AllocateFlagToAllocateType(flags), mlir::ValueRange{ sizes}); } else @@ -994,9 +1017,7 @@ Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t alignment ? llvm::Optional{ static_cast(alignment) } : llvm::None, - static_cast(flags & AllocateFlags::Stack) - ? llvm::Optional{ accera::ir::value::MemoryAllocType::Stack } - : llvm::None); + AllocateFlagToAllocateType(flags)); } EmittableInfo& emittableInfo = StoreLocalEmittable({ result.getAsOpaquePointer(), { valueType, 1 } }); @@ -1146,6 +1167,10 @@ EmitterContext::DefinedFunction MLIRContext::CreateFunctionImpl(FunctionDeclarat { fnOp->setAttr(ir::NoInlineAttrName, b.getUnitAttr()); } + if (decl.InlineIntoState() == FunctionInlining::never) + { + fnOp->setAttr(ir::NoInlineIntoAttrName, b.getUnitAttr()); + } // Set dynamic arg size references. This is a vector>, where each entry is either a reference to another // argument's position or is -1. The outer vector has one entry per function argument, and each inner vector has one @@ -2008,22 +2033,22 @@ namespace auto Convert(ValueBinaryOperation op) { using namespace accera::ir::value; + +#define MAP_BIN_OP(fromEnum, toEnum) \ + case ValueBinaryOperation::fromEnum: \ + return BinaryOpPredicate::toEnum + switch (op) { - case ValueBinaryOperation::add: - return BinaryOpPredicate::ADD; - case ValueBinaryOperation::divide: - return BinaryOpPredicate::DIV; - case ValueBinaryOperation::logicalAnd: - return BinaryOpPredicate::LOGICAL_AND; - case ValueBinaryOperation::logicalOr: - return BinaryOpPredicate::LOGICAL_OR; - case ValueBinaryOperation::modulus: - return BinaryOpPredicate::MOD; - case ValueBinaryOperation::multiply: - return BinaryOpPredicate::MUL; - case ValueBinaryOperation::subtract: - return BinaryOpPredicate::SUB; + MAP_BIN_OP(add, ADD); + MAP_BIN_OP(subtract, SUB); + MAP_BIN_OP(multiply, MUL); + MAP_BIN_OP(divide, DIV); + MAP_BIN_OP(modulus, MOD); + MAP_BIN_OP(logicalAnd, LOGICAL_AND); + MAP_BIN_OP(logicalOr, LOGICAL_OR); + MAP_BIN_OP(max, MAX); + MAP_BIN_OP(min, MIN); } llvm_unreachable("Unknown binary operation"); } @@ -2221,6 +2246,20 @@ Scalar MLIRContext::BitcastImpl(Scalar value, ValueType type) throw utilities::InputException(utilities::InputExceptionErrors::invalidArgument, "Can only bitcast between types of the same size"); } +Scalar MLIRContext::RoundImpl(Scalar value) +{ + auto& builder = _impl->builder; + mlir::Value mlirValue = ResolveMLIRScalar(builder, ToMLIRValue(builder, value)); + auto loc = mlirValue.getLoc(); + + auto floatType = mlirValue.getType(); + auto width = floatType.getIntOrFloatBitWidth(); + auto intType = builder.getIntegerType(width); + + mlir::Value roundedVal = builder.create(loc, intType, mlirValue); + return Scalar(Wrap(roundedVal)); +} + namespace { mlir::ValueRange CascadingConditionBuilder( diff --git a/accera/value/src/Plan.cpp b/accera/value/src/Plan.cpp index 2fac6557..f83f8012 100644 --- a/accera/value/src/Plan.cpp +++ b/accera/value/src/Plan.cpp @@ -278,6 +278,14 @@ namespace value } } + void _EraseLoop(const value::ScalarIndex& scalarIndex) + { + auto builder = GetBuilder(); + auto symbolicIndexOp = GetIndexOp(scalarIndex); + auto index = symbolicIndexOp.getValue(); + _scheduleOp.addLoopAttribute(index, builder.getStringAttr("_erase"), builder.getUnitAttr()); + } + private: mlir::OpBuilder& GetBuilder() { @@ -408,6 +416,11 @@ namespace value _impl->Parallelize(indices, numThreads, policy); } + void Plan::_EraseLoop(const value::ScalarIndex& index) + { + _impl->_EraseLoop(index); + } + // // GPUPlan impl // diff --git a/accera/value/src/ScalarOperations.cpp b/accera/value/src/ScalarOperations.cpp index ea4f141e..a3e21220 100644 --- a/accera/value/src/ScalarOperations.cpp +++ b/accera/value/src/ScalarOperations.cpp @@ -10,6 +10,8 @@ #include "Scalar.h" #include "ValueType.h" +#include "ir/include/value/ValueDialect.h" + #include #include #include @@ -158,6 +160,24 @@ namespace value } } + Scalar Round(Scalar s) + { + return GetContext().Round(s); + } + + Scalar Remainderf(Scalar numer, Scalar denom) + { + static auto remainderfFunction = [&]() { + FunctionDeclaration remainderfDecl("remainderf"); + remainderfDecl.External(true) + .Decorated(false) + .Parameters(Value(ValueType::Float, ScalarLayout), Value(ValueType::Float, ScalarLayout)) + .Returns(Value(ValueType::Float, ScalarLayout)); + return GetContext().DeclareExternalFunction(remainderfDecl); + }(); + return Scalar(*remainderfFunction(std::vector{Wrap(UnwrapScalar(numer)), Wrap(UnwrapScalar(denom))})); // TODO : fix this Wrap(Unwrap(...)) pattern... it's currently needed to invoke GetElement on a sliced array + } + Scalar Ceil(Scalar s) { return ScalarOpBuilder(s); @@ -200,16 +220,12 @@ namespace value Scalar Max(Scalar s1, Scalar s2) { - std::tie(s1, s2) = Scalar::MakeTypeCompatible(s1, s2); - - return Select(s1 > s2, s1, s2); + return GetContext().BinaryOperation(ValueBinaryOperation::max, s1.GetValue(), s2.GetValue()); } Scalar Min(Scalar s1, Scalar s2) { - std::tie(s1, s2) = Scalar::MakeTypeCompatible(s1, s2); - - return Select(s1 < s2, s1, s2); + return GetContext().BinaryOperation(ValueBinaryOperation::min, s1.GetValue(), s2.GetValue()); } Scalar Clamp(Scalar s, Scalar min, Scalar max) diff --git a/docs/.bumpversion.cfg b/docs/.bumpversion.cfg index b2e58ff6..6c7b9a65 100644 --- a/docs/.bumpversion.cfg +++ b/docs/.bumpversion.cfg @@ -1,5 +1,5 @@ [bumpversion] -current_version = 1.2.12 +current_version = 1.2.13 [bumpversion:glob:**/*.md] search = Version: v{current_version} diff --git a/docs/Case Studies/CONTRIBUTING.md b/docs/Case Studies/CONTRIBUTING.md index 97f42493..29cc7480 100644 --- a/docs/Case Studies/CONTRIBUTING.md +++ b/docs/Case Studies/CONTRIBUTING.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Contributing Guide diff --git a/docs/Case Studies/README.md b/docs/Case Studies/README.md index 3add9995..a2dcce62 100644 --- a/docs/Case Studies/README.md +++ b/docs/Case Studies/README.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Accera Case Studies diff --git a/docs/Install/Building_on_MacOS.md b/docs/Install/Building_on_MacOS.md index d2daaadc..48c5cd91 100644 --- a/docs/Install/Building_on_MacOS.md +++ b/docs/Install/Building_on_MacOS.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Installing on MacOS diff --git a/docs/Install/Building_on_Ubuntu.md b/docs/Install/Building_on_Ubuntu.md index 27c89a13..6b90f3c5 100644 --- a/docs/Install/Building_on_Ubuntu.md +++ b/docs/Install/Building_on_Ubuntu.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Installing on Ubuntu diff --git a/docs/Install/Building_on_Windows.md b/docs/Install/Building_on_Windows.md index 56684d3b..3586c4a4 100644 --- a/docs/Install/Building_on_Windows.md +++ b/docs/Install/Building_on_Windows.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Installing on Windows diff --git a/docs/Install/Installing_Accera_on_MacOS.md b/docs/Install/Installing_Accera_on_MacOS.md index 2dff0957..4c2d700d 100644 --- a/docs/Install/Installing_Accera_on_MacOS.md +++ b/docs/Install/Installing_Accera_on_MacOS.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Installing on MacOS diff --git a/docs/Install/Installing_Accera_on_Ubuntu.md b/docs/Install/Installing_Accera_on_Ubuntu.md index 77654ada..47b042ea 100644 --- a/docs/Install/Installing_Accera_on_Ubuntu.md +++ b/docs/Install/Installing_Accera_on_Ubuntu.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Installing on Ubuntu diff --git a/docs/Install/Installing_Accera_on_Windows.md b/docs/Install/Installing_Accera_on_Windows.md index bcc86673..4e69af93 100644 --- a/docs/Install/Installing_Accera_on_Windows.md +++ b/docs/Install/Installing_Accera_on_Windows.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Installing on Windows diff --git a/docs/Install/README.md b/docs/Install/README.md index d4df0e82..ffa790cd 100644 --- a/docs/Install/README.md +++ b/docs/Install/README.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Install from PyPI The quickest way to get up and running is to install the pre-built Python packages: diff --git a/docs/Manual/00 Introduction.md b/docs/Manual/00 Introduction.md index 279b72af..d7826f8b 100644 --- a/docs/Manual/00 Introduction.md +++ b/docs/Manual/00 Introduction.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Introduction Accera is a framework with a Python-based Domain-specific Language (eDSL) that produces optimized compute-intensive code. Accera's primary focus is the optimization of affine and semi-affine nested for-loops for CPU and GPU targets. diff --git a/docs/Manual/01 Arrays and Scalars.md b/docs/Manual/01 Arrays and Scalars.md index b23d4ff6..2273acb7 100644 --- a/docs/Manual/01 Arrays and Scalars.md +++ b/docs/Manual/01 Arrays and Scalars.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 1: Arrays and Scalars diff --git a/docs/Manual/02 Simple Affine Loop Nests.md b/docs/Manual/02 Simple Affine Loop Nests.md index f07dff2b..8e8ae66e 100644 --- a/docs/Manual/02 Simple Affine Loop Nests.md +++ b/docs/Manual/02 Simple Affine Loop Nests.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 2: Simple affine loop nests This section introduces *loop nests* and their different types that are provided in Accera programming model. diff --git a/docs/Manual/03 Schedules.md b/docs/Manual/03 Schedules.md index dddbc7c7..3d34f204 100644 --- a/docs/Manual/03 Schedules.md +++ b/docs/Manual/03 Schedules.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 3: Schedules We begin with `nest` from [Section 2](<02%20Simple%20Affine%20Loop%20Nests.md>) which captures the logic of matrix-matrix multiplication. We use `nest` to create a `Schedule` that controls the execution order of the nest's iterations. Schedules are target-independent in the sense that the same schedule can be used to emit code for multiple target platforms. diff --git a/docs/Manual/04 Fusing.md b/docs/Manual/04 Fusing.md index 65fc9363..c193b768 100644 --- a/docs/Manual/04 Fusing.md +++ b/docs/Manual/04 Fusing.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 4: Fusing With `fuse` operation, multiple schedules can be combined into a single schedule representing the union of the work in the original schedules. These fused schedules can be transformed by any of the transformations presented in [Section 3](<03%20Schedules.md>). diff --git a/docs/Manual/05 Targets.md b/docs/Manual/05 Targets.md index 39bf5630..e61be97b 100644 --- a/docs/Manual/05 Targets.md +++ b/docs/Manual/05 Targets.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 5: Targets Accera is a cross compiler, which means that it can generate executable code for different target platforms. A target is described using the `Target` class. Accera already supports many different targets, for example: diff --git a/docs/Manual/06 Plans - Caching.md b/docs/Manual/06 Plans - Caching.md index 8d77d0b2..b0032159 100644 --- a/docs/Manual/06 Plans - Caching.md +++ b/docs/Manual/06 Plans - Caching.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 6: Plans - Caching In the previous sections, we defined the logic and then scheduled its iterations. Now, let's move on to completing the implementation with target-specific options. diff --git a/docs/Manual/07 Plans - Operations and Optimizations.md b/docs/Manual/07 Plans - Operations and Optimizations.md index 51eb82de..53f0f7a7 100644 --- a/docs/Manual/07 Plans - Operations and Optimizations.md +++ b/docs/Manual/07 Plans - Operations and Optimizations.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 7: Plans - Operations and Optimizations We can control target-specific operations and optimizations using a plan. Examples include instruction pipelining, applying SIMD vector instructions, and so on. diff --git a/docs/Manual/08 Deferred Layout of Constant Arrays.md b/docs/Manual/08 Deferred Layout of Constant Arrays.md index ce621bcb..1f050de0 100644 --- a/docs/Manual/08 Deferred Layout of Constant Arrays.md +++ b/docs/Manual/08 Deferred Layout of Constant Arrays.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 8: Deferred layout of constant arrays Let's revisit the memory layout of constant arrays. As explained in [Section 1](<01%20Arrays%20and%20Scalars.md>), the contents of constant arrays are known at compile-time, and these contents are immutable. Accera stores constant arrays in a non-standard memory layout optimized for a particular plan. In some cases, storing multiple copies of each array element may even prove advantageous (e.g., storing a matrix in row-major and column-major layouts). diff --git a/docs/Manual/09 Parameters.md b/docs/Manual/09 Parameters.md index 22325ea1..156eb632 100644 --- a/docs/Manual/09 Parameters.md +++ b/docs/Manual/09 Parameters.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 9: Parameters diff --git a/docs/Manual/10 Packages.md b/docs/Manual/10 Packages.md index be48c0c3..09cab47b 100644 --- a/docs/Manual/10 Packages.md +++ b/docs/Manual/10 Packages.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Section 10: Building Packages The `Package` class represents a collection of Accera-generated functions. Whenever a package is built, it creates a stand-alone function library that other pieces of software can use. Currently, Accera supports two package formats: HAT and MLIR. diff --git a/docs/Manual/README.md b/docs/Manual/README.md index 8bbe69fc..88ba69a2 100644 --- a/docs/Manual/README.md +++ b/docs/Manual/README.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Accera v1.2.1 Manual diff --git a/docs/Reference/accera.md b/docs/Reference/accera.md index 42bc0c84..59673bea 100644 --- a/docs/Reference/accera.md +++ b/docs/Reference/accera.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference # Module functions * [`accera.cast`](functions/cast.md) `(value, type)` diff --git a/docs/Reference/classes/Array/Array.md b/docs/Reference/classes/Array/Array.md index de9fe5b1..b8eedeed 100644 --- a/docs/Reference/classes/Array/Array.md +++ b/docs/Reference/classes/Array/Array.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Array(role[, data, element_type, layout, offset, shape])` Constructs an array. diff --git a/docs/Reference/classes/Array/Layout.md b/docs/Reference/classes/Array/Layout.md index adea1bea..47182bf9 100644 --- a/docs/Reference/classes/Array/Layout.md +++ b/docs/Reference/classes/Array/Layout.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Array.Layout` type | description diff --git a/docs/Reference/classes/Array/Role.md b/docs/Reference/classes/Array/Role.md index 0bbe145d..0d654abb 100644 --- a/docs/Reference/classes/Array/Role.md +++ b/docs/Reference/classes/Array/Role.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Array.Role` type | description diff --git a/docs/Reference/classes/Array/deferred_layout.md b/docs/Reference/classes/Array/deferred_layout.md index 20395107..e831d3d1 100644 --- a/docs/Reference/classes/Array/deferred_layout.md +++ b/docs/Reference/classes/Array/deferred_layout.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Array.deferred_layout(cache)` Specifies the layout for a `Array.Role.CONST` array based on a `Cache`. For more details, see [Deferred layout of constant arrays](<../../../Manual/08%20Deferred%20Layout%20of%20Constant%20Arrays.md>) diff --git a/docs/Reference/classes/Array/sub_array.md b/docs/Reference/classes/Array/sub_array.md index 9b75a4d6..73f40fe3 100644 --- a/docs/Reference/classes/Array/sub_array.md +++ b/docs/Reference/classes/Array/sub_array.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Array.sub_array(offsets, shape[, strides])` Creates a sub-array of a specific shape from an array. The sub-array is created from elements at specified offsets and strides into the original array. diff --git a/docs/Reference/classes/Dimension/Dimension.md b/docs/Reference/classes/Dimension/Dimension.md index 28e878f8..1b98fbdd 100644 --- a/docs/Reference/classes/Dimension/Dimension.md +++ b/docs/Reference/classes/Dimension/Dimension.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Dimension([role, value])` Constructs a runtime dimension size with optional initialization. diff --git a/docs/Reference/classes/Dimension/Role.md b/docs/Reference/classes/Dimension/Role.md index 62a447b1..7f0bdc85 100644 --- a/docs/Reference/classes/Dimension/Role.md +++ b/docs/Reference/classes/Dimension/Role.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Dimension.Role` type | description diff --git a/docs/Reference/classes/Nest/Nest.md b/docs/Reference/classes/Nest/Nest.md index 3508a587..89c0ea1d 100644 --- a/docs/Reference/classes/Nest/Nest.md +++ b/docs/Reference/classes/Nest/Nest.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Nest(shape)` Creates an affine loop nest. diff --git a/docs/Reference/classes/Nest/create_plan.md b/docs/Reference/classes/Nest/create_plan.md index 87529ae5..a04c0931 100644 --- a/docs/Reference/classes/Nest/create_plan.md +++ b/docs/Reference/classes/Nest/create_plan.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Nest.create_plan([target])` Creates a plan using the default schedule for the nest. diff --git a/docs/Reference/classes/Nest/create_schedule.md b/docs/Reference/classes/Nest/create_schedule.md index e73cd2b8..0eb71a7c 100644 --- a/docs/Reference/classes/Nest/create_schedule.md +++ b/docs/Reference/classes/Nest/create_schedule.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Nest.create_schedule()` Create a default schedule for a nest. diff --git a/docs/Reference/classes/Nest/get_indices.md b/docs/Reference/classes/Nest/get_indices.md index bc884f72..0ef8b2dc 100644 --- a/docs/Reference/classes/Nest/get_indices.md +++ b/docs/Reference/classes/Nest/get_indices.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Nest.get_indices()` Gets the iteration space dimensions for a nest. diff --git a/docs/Reference/classes/Nest/iteration_logic.md b/docs/Reference/classes/Nest/iteration_logic.md index 32a9b184..b318e05c 100644 --- a/docs/Reference/classes/Nest/iteration_logic.md +++ b/docs/Reference/classes/Nest/iteration_logic.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.120) +[//]: # (Version: v1.2.130) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Nest.iteration_logic(logic)` Adds an iteration logic function to a `Nest`. diff --git a/docs/Reference/classes/Package/Format.md b/docs/Reference/classes/Package/Format.md index 008fd9cf..4ffd1ba7 100644 --- a/docs/Reference/classes/Package/Format.md +++ b/docs/Reference/classes/Package/Format.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.Format` type | description diff --git a/docs/Reference/classes/Package/Mode.md b/docs/Reference/classes/Package/Mode.md index 9845cab3..8c5aa194 100644 --- a/docs/Reference/classes/Package/Mode.md +++ b/docs/Reference/classes/Package/Mode.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.Mode` type | description diff --git a/docs/Reference/classes/Package/Package.md b/docs/Reference/classes/Package/Package.md index 2dd36a3a..cde07921 100644 --- a/docs/Reference/classes/Package/Package.md +++ b/docs/Reference/classes/Package/Package.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.Package()` A package of functions that can be built and linked with client code. diff --git a/docs/Reference/classes/Package/Platform.md b/docs/Reference/classes/Package/Platform.md index 0dd6664f..8680cb0f 100644 --- a/docs/Reference/classes/Package/Platform.md +++ b/docs/Reference/classes/Package/Platform.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.Platform` type | description diff --git a/docs/Reference/classes/Package/add.md b/docs/Reference/classes/Package/add.md index 1c65a6d1..557c9b7a 100644 --- a/docs/Reference/classes/Package/add.md +++ b/docs/Reference/classes/Package/add.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.add(source, args[, base_name, parameters])` Adds one or more functions to the package. diff --git a/docs/Reference/classes/Package/add_description.md b/docs/Reference/classes/Package/add_description.md index b252f41d..7af2aea8 100644 --- a/docs/Reference/classes/Package/add_description.md +++ b/docs/Reference/classes/Package/add_description.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.add_description([author, license, other, version])` Adds descriptive metadata to the HAT package. diff --git a/docs/Reference/classes/Package/build.md b/docs/Reference/classes/Package/build.md index 69d41b7f..441bfd7b 100644 --- a/docs/Reference/classes/Package/build.md +++ b/docs/Reference/classes/Package/build.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Package.build(name[, format, mode, platform, tolerance, output_dir])` Builds a HAT package. diff --git a/docs/Reference/classes/Plan/bind.md b/docs/Reference/classes/Plan/bind.md index dd78611f..b6198a27 100644 --- a/docs/Reference/classes/Plan/bind.md +++ b/docs/Reference/classes/Plan/bind.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.bind(mapping)` Only available for targets that can execute a grid of work (such as GPUs). The `bind` function binds dimensions of the iteration space to axes of the target-specific grid (such as `v100.GridUnit.BLOCK_X`, `v100.GridUnit.THREAD_X` or `v100.GridUnit.WARP_X` on an Nvidia GPU). diff --git a/docs/Reference/classes/Plan/cache.md b/docs/Reference/classes/Plan/cache.md index 4c64ebd3..a370814f 100644 --- a/docs/Reference/classes/Plan/cache.md +++ b/docs/Reference/classes/Plan/cache.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.cache(source[, index, trigger_index, layout, level, trigger_level, max_elements, element_type, strategy, thrifty, location, double_buffer, double_buffer_location, vectorize])` Adds a caching strategy to a plan. diff --git a/docs/Reference/classes/Plan/kernelize.md b/docs/Reference/classes/Plan/kernelize.md index 611677f6..9cff8032 100644 --- a/docs/Reference/classes/Plan/kernelize.md +++ b/docs/Reference/classes/Plan/kernelize.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.kernelize(unroll_indices[, vectorize_indices])` A convenience method for a sequence of `unroll` instructions followed by a possible sequence of `vectorize` instructions. diff --git a/docs/Reference/classes/Plan/parallelize.md b/docs/Reference/classes/Plan/parallelize.md index 38cf7b02..16f3d076 100644 --- a/docs/Reference/classes/Plan/parallelize.md +++ b/docs/Reference/classes/Plan/parallelize.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.parallelize(indices[, pin, policy, max_threads])` diff --git a/docs/Reference/classes/Plan/tensorize.md b/docs/Reference/classes/Plan/tensorize.md index 84abac99..3109ce1b 100644 --- a/docs/Reference/classes/Plan/tensorize.md +++ b/docs/Reference/classes/Plan/tensorize.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.tensorize(indices, mma_shape [, use_static_offsets, num_total_passes, num_fused_passes, scheduling_policy])` Only available for targets with native matrix multiplication instruction (tensor core) support. Marks the dimensions of the iteration-space for tensorization. Only perfectly nested loops of the following form can be tensorized: diff --git a/docs/Reference/classes/Plan/unroll.md b/docs/Reference/classes/Plan/unroll.md index 921d375f..880e2e16 100644 --- a/docs/Reference/classes/Plan/unroll.md +++ b/docs/Reference/classes/Plan/unroll.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.unroll(index)` Marks a dimension of the iteration-space for unrolling. diff --git a/docs/Reference/classes/Plan/vectorize.md b/docs/Reference/classes/Plan/vectorize.md index db9c3c25..50875e43 100644 --- a/docs/Reference/classes/Plan/vectorize.md +++ b/docs/Reference/classes/Plan/vectorize.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Plan.vectorize(index)` Only available for targets that have SIMD registers and support vector instructions. Marks a dimension of the iteration-space for vectorization. diff --git a/docs/Reference/classes/Scalar/Scalar.md b/docs/Reference/classes/Scalar/Scalar.md index 757237d0..78c4e6e5 100644 --- a/docs/Reference/classes/Scalar/Scalar.md +++ b/docs/Reference/classes/Scalar/Scalar.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Scalar([element_type, value])` Constructs a scalar that holds a number. diff --git a/docs/Reference/classes/Schedule/create_plan.md b/docs/Reference/classes/Schedule/create_plan.md index 47fe7f20..3c6ec2f8 100644 --- a/docs/Reference/classes/Schedule/create_plan.md +++ b/docs/Reference/classes/Schedule/create_plan.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.create_plan([target])` Creates a plan for running this schedule. diff --git a/docs/Reference/classes/Schedule/is_valid_loop_order.md b/docs/Reference/classes/Schedule/is_valid_loop_order.md index 303188b6..dccd8dcb 100644 --- a/docs/Reference/classes/Schedule/is_valid_loop_order.md +++ b/docs/Reference/classes/Schedule/is_valid_loop_order.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.is_valid_loop_order(*order)` The `is_valid_loop_order` function determines if an order of indices is valid. For a description of valid schedule orders, refer to [reorder](reorder.md). diff --git a/docs/Reference/classes/Schedule/pad.md b/docs/Reference/classes/Schedule/pad.md index 642b40b0..cadbdb6e 100644 --- a/docs/Reference/classes/Schedule/pad.md +++ b/docs/Reference/classes/Schedule/pad.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.pad(index, size)` Pads the beginning of a specified dimension of the iteration-space with empty (no-op) elements. diff --git a/docs/Reference/classes/Schedule/reorder.md b/docs/Reference/classes/Schedule/reorder.md index ffec70a8..14682e79 100644 --- a/docs/Reference/classes/Schedule/reorder.md +++ b/docs/Reference/classes/Schedule/reorder.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.reorder(order, *args)` The `reorder` transformation sets the order of the indices in the schedule. diff --git a/docs/Reference/classes/Schedule/skew.md b/docs/Reference/classes/Schedule/skew.md index 0a8f2065..8916b7f0 100644 --- a/docs/Reference/classes/Schedule/skew.md +++ b/docs/Reference/classes/Schedule/skew.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.skew(index, reference_index [, unroll_loops_smaller_than])` Transforms a dimension with respect to a reference dimension into a parallelogram by padding with empty elements. diff --git a/docs/Reference/classes/Schedule/split.md b/docs/Reference/classes/Schedule/split.md index 6d2afd32..67db50cb 100644 --- a/docs/Reference/classes/Schedule/split.md +++ b/docs/Reference/classes/Schedule/split.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.split(index, size)` The `split` transformation takes a dimension `i` and a `size`, modifies `i`, and creates a new dimension `ii`. diff --git a/docs/Reference/classes/Schedule/tile.md b/docs/Reference/classes/Schedule/tile.md index 096ee0cf..ee5f30a6 100644 --- a/docs/Reference/classes/Schedule/tile.md +++ b/docs/Reference/classes/Schedule/tile.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Schedule.tile(shape)` The `tile` transformation is a convenience syntax that takes a tuple of indices and a tuple of sizes, and splits each index by the corresponding size. The indices involved in the split are then ordered such that all the outer indices precede all of their respective inner indices. diff --git a/docs/Reference/classes/Target/Architecture.md b/docs/Reference/classes/Target/Architecture.md index fd02faf4..d84ef329 100644 --- a/docs/Reference/classes/Target/Architecture.md +++ b/docs/Reference/classes/Target/Architecture.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Target.Architecture` Defines the supported target architectures. diff --git a/docs/Reference/classes/Target/Category.md b/docs/Reference/classes/Target/Category.md index 4ca5e41e..6f828e53 100644 --- a/docs/Reference/classes/Target/Category.md +++ b/docs/Reference/classes/Target/Category.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Target.Category` Defines the target processor category. diff --git a/docs/Reference/classes/Target/Model.md b/docs/Reference/classes/Target/Model.md index a67947a6..86e4a97e 100644 --- a/docs/Reference/classes/Target/Model.md +++ b/docs/Reference/classes/Target/Model.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Target.Model` Defines constants for some well-known CPU models. diff --git a/docs/Reference/classes/Target/Runtime.md b/docs/Reference/classes/Target/Runtime.md index 9aa7c42b..8d75cd46 100644 --- a/docs/Reference/classes/Target/Runtime.md +++ b/docs/Reference/classes/Target/Runtime.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Target.Runtime` The runtime for code generation and/or compilation. diff --git a/docs/Reference/classes/Target/Target.md b/docs/Reference/classes/Target/Target.md index a4cafcc4..8ea4d78d 100644 --- a/docs/Reference/classes/Target/Target.md +++ b/docs/Reference/classes/Target/Target.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.120) +[//]: # (Version: v1.2.130) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.Target([architecture, cache_lines, cache_sizes, category, extensions, family, frequency_GHz, known_name, model, name, num_cores, num_threads, runtime, tensor_core_info, turbo_frequency_GHz, vector_bytes, vector_registers)` diff --git a/docs/Reference/enumerations/CacheStrategy.md b/docs/Reference/enumerations/CacheStrategy.md index 930c569d..e66b8f6d 100644 --- a/docs/Reference/enumerations/CacheStrategy.md +++ b/docs/Reference/enumerations/CacheStrategy.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.CacheStrategy` type | description diff --git a/docs/Reference/enumerations/MMASchedulingPolicy.md b/docs/Reference/enumerations/MMASchedulingPolicy.md index 1e66d544..12dac678 100644 --- a/docs/Reference/enumerations/MMASchedulingPolicy.md +++ b/docs/Reference/enumerations/MMASchedulingPolicy.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.MMASchedulingPolicy` type | description diff --git a/docs/Reference/enumerations/MMAShape.md b/docs/Reference/enumerations/MMAShape.md index 474ac73d..f7debce4 100644 --- a/docs/Reference/enumerations/MMAShape.md +++ b/docs/Reference/enumerations/MMAShape.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.MMAShape` The following table shows the matrix multiplication parameters associated with the different enum values, for different data types for a single pass. So for example a single pass of the `M32xN32xK2_B1` operation would take input matrices of dimensions [32x2] (A) and [2x32] (B) to produce a matrix multiplication result of dimensions [32x32] (C). These operations can then be composed together to perform matrix multiplication of larger matrices. diff --git a/docs/Reference/enumerations/ScalarType.md b/docs/Reference/enumerations/ScalarType.md index 8ca323e0..c9abc606 100644 --- a/docs/Reference/enumerations/ScalarType.md +++ b/docs/Reference/enumerations/ScalarType.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.ScalarType` type | description diff --git a/docs/Reference/functions/cast.md b/docs/Reference/functions/cast.md index c103f4ae..f4969e82 100644 --- a/docs/Reference/functions/cast.md +++ b/docs/Reference/functions/cast.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.cast(value, type)` The `cast` operation converts a value from one `acc.ScalarType` to another. diff --git a/docs/Reference/functions/create_dimensions.md b/docs/Reference/functions/create_dimensions.md index 81cd263c..d74dbcdd 100644 --- a/docs/Reference/functions/create_dimensions.md +++ b/docs/Reference/functions/create_dimensions.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.create_dimensions([role])` Creates placeholder dimensions of the specified role. These represent runtime `Array` and `Nest` dimensions. diff --git a/docs/Reference/functions/create_parameter_grid.md b/docs/Reference/functions/create_parameter_grid.md index 912dbd63..eeafa7db 100644 --- a/docs/Reference/functions/create_parameter_grid.md +++ b/docs/Reference/functions/create_parameter_grid.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.create_parameter_grid(parameter_choices, [filter_func, sample, seed])` Create a parameter grid from a dictionary that maps each parameter to its possible values. diff --git a/docs/Reference/functions/create_parameters.md b/docs/Reference/functions/create_parameters.md index 30ec9bc8..2d191b95 100644 --- a/docs/Reference/functions/create_parameters.md +++ b/docs/Reference/functions/create_parameters.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.create_parameters()` Creates placeholder parameters. diff --git a/docs/Reference/functions/fuse.md b/docs/Reference/functions/fuse.md index 598d1dcf..419cad67 100644 --- a/docs/Reference/functions/fuse.md +++ b/docs/Reference/functions/fuse.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference ## `accera.fuse(schedules[, *args, partial])` The `fuse` operation combines multiple iteration spaces into a single "fused" iteration space. The fused iteration space represents the union of the work in the original spaces. diff --git a/docs/Reference/safety_analysis.md b/docs/Reference/safety_analysis.md index 87914010..b4d9b009 100644 --- a/docs/Reference/safety_analysis.md +++ b/docs/Reference/safety_analysis.md @@ -1,7 +1,7 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) -# Accera v1.2.12 Reference +# Accera v1.2.13 Reference # Safety Analysis diff --git a/docs/Tutorials/Hello_MatMul.md b/docs/Tutorials/Hello_MatMul.md index da8e1ce7..85f850a7 100644 --- a/docs/Tutorials/Hello_MatMul.md +++ b/docs/Tutorials/Hello_MatMul.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Hello MatMul diff --git a/docs/Tutorials/Hello_MatMul_GPU.md b/docs/Tutorials/Hello_MatMul_GPU.md index 3f1768c0..9cf9ecda 100644 --- a/docs/Tutorials/Hello_MatMul_GPU.md +++ b/docs/Tutorials/Hello_MatMul_GPU.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Hello MatMul GPU diff --git a/docs/Tutorials/Optimized_MatMul.md b/docs/Tutorials/Optimized_MatMul.md index e0737ac3..4f7f6789 100644 --- a/docs/Tutorials/Optimized_MatMul.md +++ b/docs/Tutorials/Optimized_MatMul.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) ## Optimized MatMul diff --git a/docs/Tutorials/Pi3_Cross_Compilation.md b/docs/Tutorials/Pi3_Cross_Compilation.md index 59552aa1..e4dc1742 100644 --- a/docs/Tutorials/Pi3_Cross_Compilation.md +++ b/docs/Tutorials/Pi3_Cross_Compilation.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Cross Compiling for the Raspberry Pi 3 diff --git a/docs/Tutorials/README.md b/docs/Tutorials/README.md index 44d3fd6b..b163c151 100644 --- a/docs/Tutorials/README.md +++ b/docs/Tutorials/README.md @@ -1,5 +1,5 @@ [//]: # (Project: Accera) -[//]: # (Version: v1.2.12) +[//]: # (Version: v1.2.13) # Accera Tutorials