From 6c09b4a5b7b32df658aea5edf66766eb4f9da828 Mon Sep 17 00:00:00 2001
From: Lisa Ong <onglisa@microsoft.com>
Date: Wed, 14 Dec 2022 17:36:48 +0800
Subject: [PATCH] Squashed commit of the following:

commit a272d35955fe3a05d2c52f54481af40869a74849
Author: Mason Remy <masonr@microsoft.com>
Date:   Wed Dec 14 06:51:40 2022 +0000

    Merged PR 2987: Add support for max/min/round ops and vectorizing those ops

    Add support for max/min/round ops and vectorizing those ops

commit 375be08681b88df01e2e3043d5094684c134d862
Author: Mason Remy <masonr@microsoft.com>
Date:   Tue Dec 13 23:30:28 2022 +0000

    Merged PR 2963: Control TEMP array allocation location

    Control TEMP array allocation location

commit 929eeafe8263f866bacc77b958953268f58d8b8e
Author: Mason Remy <masonr@microsoft.com>
Date:   Tue Dec 13 21:56:38 2022 +0000

    Merged PR 2962: Expand vpmaddwd matching and add intrinsic call

    Expand vpmaddwd matching and add intrinsic call

    Matches more vpmaddwd cases and creates a pathway to invoking the LLVM
    intrinsic directly.

commit e47a02ed4929e8ba9a085c7870cc5e4fe9f0db62
Author: Mason Remy <masonr@microsoft.com>
Date:   Sat Dec 10 00:40:42 2022 +0000

    Merged PR 2961: Match more vectorization patterns and support vectorized cast

    Match more vectorization patterns and support vectorized cast

    Tries to match and rewrite vectorization patterns:
    - 2-loop interleaving store -> vector shuffle and store
    - simple horizontal reductions (not always efficient currently)
    - vectorized casts

    Makes vectorization of non-innermost loops do a per-op "inplace" unroll and
    vectorize the innermost loop
    TODO : update documentation to describe this behavior better

commit 628983a1a3c5f9ea42dac0cdb7db3cebcb427f43
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Dec 9 05:54:01 2022 +0000

    Merged PR 2960: Enable marking functions as no-inline-into

    Enable marking functions as no-inline-into

    Functions marked no-inline-into
    won't inline calls to other functions within their body. This
    is a useful compiler performance (not emitted code performance)
    optimization when we have many nested functions calls

commit d4404ea31cccff456a28ef6998403d228e427507
Author: Denny Sun <dennys@microsoft.com>
Date:   Fri Dec 9 00:40:16 2022 +0000

    Merged PR 2986: [output array] Emit range function with input_output type arguments

    Instead of using output type, we use input_output instead to generate two functions for the Range function.
    Now Accera can successfully generate code for range function.

    ```
    ```

commit 7d867a33afc36a1a2fa68b49f507b6ad202c14ce
Author: Mason Remy <masonr@microsoft.com>
Date:   Thu Dec 8 22:12:14 2022 +0000

    Merged PR 2959: Improved affine for op range simplification

    Improved affine for op range simplification

    Add range value / constant-cmp-result patterns and affine for op range
    simplifications to the affine simplification pass and run it after
    inlining functions.
    When inlining a dynamically-sized function into a statically-sized
    function, this change is useful for resolving the dynamic ranges to
    constants and pruning dynamic-range loops that are not needed given the
    specific constant value being used.

commit 511112c61b513c5d8d7ed4dba06ee266d5affbca
Author: Mason Remy <masonr@microsoft.com>
Date:   Thu Dec 8 17:14:00 2022 +0000

    Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest or overfused

    Hack to erase loops in a nest to support nest-of-nest or overfused
    scenarios

    This change enables an action plan to erase loops. Typically this would
    be used when an outer nest traverses tiles and invokes an inner nest (or
    multiple nests) which operate within each tile. The outer nest still
    needs to cover the full iteration space, however after splitting by the
    tile sizes a user will not want the outer nest to perform the inner
    loops

commit 5dd35c423e3878a8f490de07ca21d3ac261c6224
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Dec 7 01:59:14 2022 +0000

    Merged PR 2985: [release] Rev docs to 1.2.13

commit b5697107f084bf910d4d77e75e67a90363855375
Author: Captain Jack Sparrow <ritdas@microsoft.com>
Date:   Wed Dec 7 00:57:08 2022 +0000

    Merged PR 2983: Increase timeouts of GPU benchmarks

    Increase timeouts of GPU benchmarks

commit 05c096f116216fbc9505c7d9a6f1e88b7626411f
Author: Mason Remy <masonr@microsoft.com>
Date:   Sat Dec 3 01:25:01 2022 +0000

    Merged PR 2982: Work around bug with redundant splits of dynamic dimensions

    Work around bug with redundant splits of dynamic dimensions

commit 4056d3177c5b14987e4c5fcd4aa91ddac67c4ed1
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Nov 30 07:55:06 2022 +0000

    Merged PR 2972: Build both static and dynamic binaries by default, put both in aux dependencies

commit b79602b9cf543b0852c7e0c85e548970d5ac7fbb
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Nov 29 22:34:04 2022 +0000

    Merged PR 2975: Updates llc/opt build flags to enable more optimizations by default

    Updates llc/opt build flags to enable more optimizations by default

commit 8a856b8af10227538ebb72486bd0bfd52af98873
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Nov 29 21:49:40 2022 +0000

    Merged PR 2977: Updates CMake to do FindPython before pybind11 config

    Updates CMake to do FindPython before pybind11 config

commit 6d05fc0e8a6d1933d7507cfa8b6838c04606a798
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Nov 22 22:34:50 2022 +0000

    Merged PR 2955: Reduce Linux PR runtime to under 60mins

    Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort.
---
 .azure/cuda/cuda-benchmark-fp16-bert.yml      |    2 +-
 .azure/linux-pr.yml                           |    2 +-
 .azure/rocm/rocm-benchmark-fp16-bert.yml      |    2 +-
 .azure/rocm/rocm-benchmark-fp16-big.yml       |    2 +-
 .azure/rocm/rocm-benchmark-fp16.yml           |    2 +-
 .azure/rocm/rocm-benchmark-fp32-bert.yml      |    2 +-
 .azure/rocm/rocm-benchmark-fp32-big.yml       |    2 +-
 .azure/rocm/rocm-benchmark-fp32.yml           |    2 +-
 CMake/AddPyBind11.cmake                       |    5 +-
 CMakeLists.txt                                |    2 +-
 accera/CMakeLists.txt                         |    1 +
 accera/acc-opt/test/commandline.mlir          |    1 +
 accera/acc-opt/test/thrifty_caching.mlir      |    4 +-
 accera/acc-opt/test/value_mlir_test.cpp       |   62 +-
 accera/acc-translate/CMakeLists.txt           |   16 +
 .../acc-translate/src/AcceraTranslateMain.cpp |   28 +
 accera/acc-translate/src/CMakeLists.txt       |    7 +
 .../acc-translate/src/Target/CMakeLists.txt   |    6 +
 .../Target/Cpp/AcceraDialectCppPrinter.cpp    |    2 +-
 .../Target/Cpp/AffineDialectCppPrinter.cpp    |    7 +-
 .../src/Target/LLVMIR/CMakeLists.txt          |   24 +
 .../LLVMIR/IntrinsicToLLVMIRTranslation.cpp   |   50 +
 .../LLVMIR/IntrinsicToLLVMIRTranslation.h     |   27 +
 accera/accc/accc.py                           |   16 +-
 accera/ir/CMakeLists.txt                      |   34 +
 accera/ir/include/CMakeLists.txt              |    1 +
 accera/ir/include/Common.td                   |    6 +
 accera/ir/include/IRUtil.h                    |   11 +-
 .../ir/include/intrinsics/AcceraIntrinsics.td |   69 +
 .../intrinsics/AcceraIntrinsicsDialect.h      |   18 +
 accera/ir/include/intrinsics/CMakeLists.txt   |   10 +
 accera/ir/include/value/ValueAttrs.td         |    4 +-
 accera/ir/include/value/ValueDialect.h        |    4 +
 accera/ir/include/value/ValueOps.td           |   66 +-
 accera/ir/src/DialectRegistry.cpp             |    2 +
 accera/ir/src/IRUtil.cpp                      |  126 +-
 .../intrinsics/AcceraIntrinsicsDialect.cpp    |   32 +
 .../ir/src/nest/LoopNestAffineConstraints.cpp |   46 +-
 accera/ir/src/nest/LoopNestBuilder.cpp        |    2 +-
 accera/python/accera/Debug.py                 |    2 -
 accera/python/accera/Package.py               |   58 +-
 accera/python/accera/Targets.py               |    1 +
 accera/python/accera/__init__.py              |    4 +-
 accera/python/accera/lang/Array.py            |   12 +-
 accera/python/accera/lang/Dimension.py        |   17 +
 accera/python/accera/lang/Function.py         |   24 +-
 accera/python/accera/lang/Nest.py             |    4 +-
 accera/python/accera/lang/Plan.py             |   16 +-
 accera/python/accera/lang/__init__.py         |    2 +-
 accera/python/accera/test/dsl_tests.py        |  886 ++++++++-
 accera/python/accera/test/smoke_tests.py      |  583 +++++-
 accera/python/lib/src/ContainerTypes.cpp      |    7 +-
 accera/python/lib/src/ExecutionPlanTypes.cpp  |    3 +-
 accera/python/lib/src/PackagingTypes.cpp      |   15 +-
 accera/transforms/include/AcceraPasses.h      |    1 +
 accera/transforms/include/AcceraPasses.td     |    1 +
 .../include/affine/AffineSimplifications.h    |    3 +-
 .../exec/ExecutionPlanToAffineLoweringPass.h  |    1 +
 .../include/util/RangeValueUtilities.h        |    2 +
 .../include/util/VectorizationUtil.h          |    6 +-
 .../include/value/RangeValueOptimizePass.h    |    3 +
 accera/transforms/src/AcceraPasses.cpp        |    1 +
 .../src/affine/AffineSimplifications.cpp      |  147 +-
 .../ExecutionPlanToAffineLoweringPass.cpp     |   34 +-
 .../transforms/src/nest/LoopNestToValue.cpp   |   14 +-
 .../src/nest/LoopNestToValueFunc.cpp          |   17 +-
 .../src/util/RangeValueUtilities.cpp          |  148 +-
 .../transforms/src/util/VectorizationUtil.cpp | 1624 ++++++++++++++---
 .../src/value/RangeValueOptimizePass.cpp      |  299 ++-
 .../src/value/ValueFuncToTargetPass.cpp       |   15 +-
 .../src/value/ValueSimplifyPass.cpp           |    2 +-
 .../src/value/ValueToLLVMLoweringPass.cpp     |  112 +-
 .../src/value/ValueToStandardLoweringPass.cpp |  147 +-
 accera/value/include/EmitterContext.h         |    6 +
 accera/value/include/FunctionDeclaration.h    |    8 +
 accera/value/include/MLIREmitterContext.h     |    2 +
 accera/value/include/Plan.h                   |    2 +
 accera/value/include/ScalarOperations.h       |    3 +-
 accera/value/include/ValueType.h              |    8 +-
 accera/value/src/EmitterContext.cpp           |    5 +
 accera/value/src/FunctionDeclaration.cpp      |   14 +
 accera/value/src/MLIREmitterContext.cpp       |   79 +-
 accera/value/src/Plan.cpp                     |   13 +
 accera/value/src/ScalarOperations.cpp         |   28 +-
 docs/.bumpversion.cfg                         |    2 +-
 docs/Case Studies/CONTRIBUTING.md             |    2 +-
 docs/Case Studies/README.md                   |    2 +-
 docs/Install/Building_on_MacOS.md             |    2 +-
 docs/Install/Building_on_Ubuntu.md            |    2 +-
 docs/Install/Building_on_Windows.md           |    2 +-
 docs/Install/Installing_Accera_on_MacOS.md    |    2 +-
 docs/Install/Installing_Accera_on_Ubuntu.md   |    2 +-
 docs/Install/Installing_Accera_on_Windows.md  |    2 +-
 docs/Install/README.md                        |    2 +-
 docs/Manual/00 Introduction.md                |    2 +-
 docs/Manual/01 Arrays and Scalars.md          |    2 +-
 docs/Manual/02 Simple Affine Loop Nests.md    |    2 +-
 docs/Manual/03 Schedules.md                   |    2 +-
 docs/Manual/04 Fusing.md                      |    2 +-
 docs/Manual/05 Targets.md                     |    2 +-
 docs/Manual/06 Plans - Caching.md             |    2 +-
 ...07 Plans - Operations and Optimizations.md |    2 +-
 .../08 Deferred Layout of Constant Arrays.md  |    2 +-
 docs/Manual/09 Parameters.md                  |    2 +-
 docs/Manual/10 Packages.md                    |    2 +-
 docs/Manual/README.md                         |    2 +-
 docs/Reference/accera.md                      |    4 +-
 docs/Reference/classes/Array/Array.md         |    4 +-
 docs/Reference/classes/Array/Layout.md        |    4 +-
 docs/Reference/classes/Array/Role.md          |    4 +-
 .../classes/Array/deferred_layout.md          |    4 +-
 docs/Reference/classes/Array/sub_array.md     |    4 +-
 docs/Reference/classes/Dimension/Dimension.md |    4 +-
 docs/Reference/classes/Dimension/Role.md      |    4 +-
 docs/Reference/classes/Nest/Nest.md           |    4 +-
 docs/Reference/classes/Nest/create_plan.md    |    4 +-
 .../Reference/classes/Nest/create_schedule.md |    4 +-
 docs/Reference/classes/Nest/get_indices.md    |    4 +-
 .../Reference/classes/Nest/iteration_logic.md |    4 +-
 docs/Reference/classes/Package/Format.md      |    4 +-
 docs/Reference/classes/Package/Mode.md        |    4 +-
 docs/Reference/classes/Package/Package.md     |    4 +-
 docs/Reference/classes/Package/Platform.md    |    4 +-
 docs/Reference/classes/Package/add.md         |    4 +-
 .../classes/Package/add_description.md        |    4 +-
 docs/Reference/classes/Package/build.md       |    4 +-
 docs/Reference/classes/Plan/bind.md           |    4 +-
 docs/Reference/classes/Plan/cache.md          |    4 +-
 docs/Reference/classes/Plan/kernelize.md      |    4 +-
 docs/Reference/classes/Plan/parallelize.md    |    4 +-
 docs/Reference/classes/Plan/tensorize.md      |    4 +-
 docs/Reference/classes/Plan/unroll.md         |    4 +-
 docs/Reference/classes/Plan/vectorize.md      |    4 +-
 docs/Reference/classes/Scalar/Scalar.md       |    4 +-
 .../Reference/classes/Schedule/create_plan.md |    4 +-
 .../classes/Schedule/is_valid_loop_order.md   |    4 +-
 docs/Reference/classes/Schedule/pad.md        |    4 +-
 docs/Reference/classes/Schedule/reorder.md    |    4 +-
 docs/Reference/classes/Schedule/skew.md       |    4 +-
 docs/Reference/classes/Schedule/split.md      |    4 +-
 docs/Reference/classes/Schedule/tile.md       |    4 +-
 docs/Reference/classes/Target/Architecture.md |    4 +-
 docs/Reference/classes/Target/Category.md     |    4 +-
 docs/Reference/classes/Target/Model.md        |    4 +-
 docs/Reference/classes/Target/Runtime.md      |    4 +-
 docs/Reference/classes/Target/Target.md       |    4 +-
 docs/Reference/enumerations/CacheStrategy.md  |    4 +-
 .../enumerations/MMASchedulingPolicy.md       |    4 +-
 docs/Reference/enumerations/MMAShape.md       |    4 +-
 docs/Reference/enumerations/ScalarType.md     |    4 +-
 docs/Reference/functions/cast.md              |    4 +-
 docs/Reference/functions/create_dimensions.md |    4 +-
 .../functions/create_parameter_grid.md        |    4 +-
 docs/Reference/functions/create_parameters.md |    4 +-
 docs/Reference/functions/fuse.md              |    4 +-
 docs/Reference/safety_analysis.md             |    4 +-
 docs/Tutorials/Hello_MatMul.md                |    2 +-
 docs/Tutorials/Hello_MatMul_GPU.md            |    2 +-
 docs/Tutorials/Optimized_MatMul.md            |    2 +-
 docs/Tutorials/Pi3_Cross_Compilation.md       |    2 +-
 docs/Tutorials/README.md                      |    2 +-
 161 files changed, 4582 insertions(+), 756 deletions(-)
 create mode 100644 accera/acc-translate/src/CMakeLists.txt
 create mode 100644 accera/acc-translate/src/Target/CMakeLists.txt
 create mode 100644 accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt
 create mode 100644 accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp
 create mode 100644 accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h
 create mode 100644 accera/ir/include/intrinsics/AcceraIntrinsics.td
 create mode 100644 accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h
 create mode 100644 accera/ir/include/intrinsics/CMakeLists.txt
 create mode 100644 accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp

diff --git a/.azure/cuda/cuda-benchmark-fp16-bert.yml b/.azure/cuda/cuda-benchmark-fp16-bert.yml
index f7fe35fe..a6ff9236 100644
--- a/.azure/cuda/cuda-benchmark-fp16-bert.yml
+++ b/.azure/cuda/cuda-benchmark-fp16-bert.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "CUDA_Benchmarking_FP16_BERT"
-    timeoutInMinutes: 480
+    timeoutInMinutes: 600
 
     pool:
       name: LinuxNVGPUPool
diff --git a/.azure/linux-pr.yml b/.azure/linux-pr.yml
index eadbb565..8031b316 100644
--- a/.azure/linux-pr.yml
+++ b/.azure/linux-pr.yml
@@ -89,7 +89,7 @@ steps:
     displayName: Run all ctest targets
     workingDirectory: "$(Build.SourcesDirectory)/build"
 
-  - bash: python -m unittest discover accera/test *.py
+  - bash: python -m unittest discover accera/test dsl_tests.py
     displayName: Run tests in DEV_MODE
     workingDirectory: "$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.9"
 
diff --git a/.azure/rocm/rocm-benchmark-fp16-bert.yml b/.azure/rocm/rocm-benchmark-fp16-bert.yml
index 69ce40dd..f091b042 100644
--- a/.azure/rocm/rocm-benchmark-fp16-bert.yml
+++ b/.azure/rocm/rocm-benchmark-fp16-bert.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP16_BERT"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 
diff --git a/.azure/rocm/rocm-benchmark-fp16-big.yml b/.azure/rocm/rocm-benchmark-fp16-big.yml
index e74faa92..94713bcb 100644
--- a/.azure/rocm/rocm-benchmark-fp16-big.yml
+++ b/.azure/rocm/rocm-benchmark-fp16-big.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP16_Big"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 
diff --git a/.azure/rocm/rocm-benchmark-fp16.yml b/.azure/rocm/rocm-benchmark-fp16.yml
index c92c6d9b..0177f35e 100644
--- a/.azure/rocm/rocm-benchmark-fp16.yml
+++ b/.azure/rocm/rocm-benchmark-fp16.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP16"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 
diff --git a/.azure/rocm/rocm-benchmark-fp32-bert.yml b/.azure/rocm/rocm-benchmark-fp32-bert.yml
index 6b46c7bd..2f620e82 100644
--- a/.azure/rocm/rocm-benchmark-fp32-bert.yml
+++ b/.azure/rocm/rocm-benchmark-fp32-bert.yml
@@ -47,7 +47,7 @@ jobs:
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
           export LD_LIBRARY_PATH=${ROCM_PATH}/lib
           echo "LD_LIBRARY_PATH" ${LD_LIBRARY_PATH}
-          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check
+          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --category bert --type s --target 'AMD MI100' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose
         displayName: Run fp32 benchmarks BERT
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:
diff --git a/.azure/rocm/rocm-benchmark-fp32-big.yml b/.azure/rocm/rocm-benchmark-fp32-big.yml
index 0e138c36..2218c889 100644
--- a/.azure/rocm/rocm-benchmark-fp32-big.yml
+++ b/.azure/rocm/rocm-benchmark-fp32-big.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP32_Big"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 
diff --git a/.azure/rocm/rocm-benchmark-fp32.yml b/.azure/rocm/rocm-benchmark-fp32.yml
index e6d27aed..3052884f 100644
--- a/.azure/rocm/rocm-benchmark-fp32.yml
+++ b/.azure/rocm/rocm-benchmark-fp32.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "ROCM_Benchmarking_FP32"
-    timeoutInMinutes: 540
+    timeoutInMinutes: 600
 
     pool: LinuxAMDGPUPool
 
diff --git a/CMake/AddPyBind11.cmake b/CMake/AddPyBind11.cmake
index b25ab36b..7622bb82 100644
--- a/CMake/AddPyBind11.cmake
+++ b/CMake/AddPyBind11.cmake
@@ -5,7 +5,7 @@
 
 include(FetchContent)
 
-set(PYBIND_VERSION "2.6.2" CACHE STRING "Version string to use for pybind11")
+set(PYBIND_VERSION "2.10.1" CACHE STRING "Version string to use for pybind11")
 
 set(FETCHCONTENT_QUIET FALSE)
 
@@ -16,6 +16,9 @@ FetchContent_Declare(
 
 FetchContent_GetProperties(pybind11)
 
+set(Python3_FIND_REGISTRY LAST)
+find_package(Python3 COMPONENTS Interpreter Development)
+
 if(NOT pybind11_POPULATED)
     FetchContent_Populate(pybind11)
     add_subdirectory(${pybind11_SOURCE_DIR} ${pybind11_BINARY_DIR})
diff --git a/CMakeLists.txt b/CMakeLists.txt
index b5f95b7f..0b20f53a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -123,7 +123,7 @@ set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
 set(CMAKE_PLATFORM_NO_VERSIONED_SONAME ON)
 if(MSVC)
   # Set Visual Studio-specific options
-  add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS)
+  add_definitions(-DUNICODE -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -D_SILENCE_NONFLOATING_COMPLEX_DEPRECATION_WARNING)
   add_compile_options(/utf-8)
   add_compile_options(/MP)
   add_compile_options(/bigobj)
diff --git a/accera/CMakeLists.txt b/accera/CMakeLists.txt
index 8cc3b4a3..f72de227 100644
--- a/accera/CMakeLists.txt
+++ b/accera/CMakeLists.txt
@@ -4,6 +4,7 @@
 ####################################################################################################
 
 set(ACCERA_LIBRARIES_DIR ${CMAKE_CURRENT_LIST_DIR})
+set(ACCERA_BIN_DIR ${CMAKE_CURRENT_BINARY_DIR})
 include_directories(${ACCERA_LIBRARIES_DIR})
 
 add_subdirectory(acc-opt)
diff --git a/accera/acc-opt/test/commandline.mlir b/accera/acc-opt/test/commandline.mlir
index 34a1c0c3..d2b4d2f2 100644
--- a/accera/acc-opt/test/commandline.mlir
+++ b/accera/acc-opt/test/commandline.mlir
@@ -1,6 +1,7 @@
 // RUN: acc-opt --show-dialects | FileCheck %s
 // CHECK: Registered Dialects:
 // CHECK: accera
+// CHECK-NEXT: accintr
 // CHECK-NEXT: accln
 // CHECK-NEXT: accv
 // CHECK-NEXT: accxp
diff --git a/accera/acc-opt/test/thrifty_caching.mlir b/accera/acc-opt/test/thrifty_caching.mlir
index c8fb4650..7fe325b0 100644
--- a/accera/acc-opt/test/thrifty_caching.mlir
+++ b/accera/acc-opt/test/thrifty_caching.mlir
@@ -69,8 +69,8 @@ module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "
 // CHECK:             affine.for %arg6 = 0 to 16 {
 // CHECK:               %1 = affine.load %arg1[%arg5, %arg4 + %arg6] : memref<32x32xf32, #map0>
 // CHECK:               affine.store %1, %0[%arg5, %arg6] : memref<32x16xf32, 3>
-// CHECK:             } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
-// CHECK:           } {accxp.access_bounds_check, beginMap = #map1, domain = #xdomain, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
+// CHECK:             } {accxp.access_bounds_check, beginMap = #map1, endMap = #map2, index = #accln<"index{j,7}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{j,7}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
+// CHECK:           } {accxp.access_bounds_check, beginMap = #map1, endMap = #map3, index = #accln<"index{i,6}">, operand_segment_sizes = dense<[0, 0, 1]> : vector<3xi32>, scheduledIndex = #accln<"index{i,6}">, subdomainIndexOrder = [#accln<"index{i,6}">, #accln<"index{j,7}">], subdomainSize = [32, 16]}
 // CHECK:           affine.for %arg5 = 0 to 4 {
 // CHECK:             affine.for %arg6 = 0 to 16 {
 // CHECK:               affine.for %arg7 = 0 to 32 {
diff --git a/accera/acc-opt/test/value_mlir_test.cpp b/accera/acc-opt/test/value_mlir_test.cpp
index 7ce33ed5..d1ceb028 100644
--- a/accera/acc-opt/test/value_mlir_test.cpp
+++ b/accera/acc-opt/test/value_mlir_test.cpp
@@ -115,7 +115,7 @@ TEST_CASE("function_decl1")
             .Parameters(Value{ ValueType::Float, MemoryLayout{ { 10 } } })
             .Define([](Value) {});
     CHECK(f3);
-    // CHECK: accv.func nested @f4_{{[0-9]+}}(%arg0: memref<3x4xf64, #map{{[0-9]*}}>)
+    // CHECK: accv.func nested @f4_{{[0-9]+}}(%arg0: memref<3x4xf64>)
     // COM: CHECK: accv.func @f4_{{[0-9]+}}(%arg0: memref<3x4xf64>)
     // CHECK-NEXT: return
     // CHECK-NEXT: }
@@ -311,9 +311,9 @@ TEST_CASE("mlir_test3")
                 // COM: Doesn't result in emitted code
                 CHECK_NOTHROW(MakeScalar<int>());
 
-                // CHECK-NEXT: [[v0:%[a-z0-9_]+]] = "accv.alloc"() : () -> memref<100xf32, 3>
+                // CHECK-NEXT: [[v0:%[a-z0-9_]+]] = "accv.alloc"() {allocType = 0 : i64} : () -> memref<100xf32, 3>
                 CHECK_NOTHROW(MakeVector<float>(100));
-                // CHECK-NEXT: [[v1:%[a-z0-9_]+]] = "accv.alloc"() : () -> memref<2x3xi16
+                // CHECK-NEXT: [[v1:%[a-z0-9_]+]] = "accv.alloc"() {allocType = 0 : i64} : () -> memref<2x3xi16
                 CHECK_NOTHROW(MakeMatrix<int16_t>(2, 3));
                 // CHECK-NEXT: return
                 // CHECK-NEXT: }
@@ -325,7 +325,7 @@ TEST_CASE("mlir_test3")
 
 // CHECK-LABEL: module @mlir_test4 {
 // CHECK-NEXT: accv.module "mlir_test4" {
-// CHECK-NEXT: accv.func nested @foo_{{[0-9]+}}(%arg0: memref<10x10xi32, [[MAP:#map[0-9]*]]>)
+// CHECK-NEXT: accv.func nested @foo_{{[0-9]+}}(%arg0: memref<10x10xi32>)
 // COM: CHECK-NEXT: accv.func @foo_{{[0-9]+}}(%arg0: memref<10x10xi32>) attributes {args_symbol = ["{{[a-z0-9_]+}}"], exec_target = 0 : i64, sym_visibility = "nested"} {
 // CHECK-NEXT: [[c0:%c[0-9]+]] = arith.constant 0 : index
 // CHECK-NEXT: [[c10_1:%c[0-9_]+]] = arith.constant 10 : index
@@ -372,7 +372,7 @@ TEST_CASE("mlir_test5")
             .Define([](Scalar i) {
                 CHECK_NOTHROW(StaticAllocate<int>("foo", std::vector{ 1, 2, 3, 4 }));
 
-                // CHECK-NEXT: "accv.alloc"() : () -> memref<100xf32, 3>
+                // CHECK-NEXT: "accv.alloc"() {allocType = 0 : i64} : () -> memref<100xf32, 3>
                 CHECK_NOTHROW(MakeVector<float>(100));
 
                 // CHECK-NEXT: return
@@ -473,7 +473,7 @@ TEST_CASE("mlir_test11")
             // CHECK-NEXT:      [[c0_0:%c[0-9a-z_]+]] = arith.constant 0 : i32
             // CHECK-NEXT:      [[c4_0:%c[0-9a-z_]+]] = arith.constant 4
             // CHECK-NEXT:      [[c4_1:%c[0-9a-z_]+]] = arith.constant 4
-            // CHECK-NEXT:      [[v0:%[a-z0-9_]+]] = "accv.alloc"() {sym_name = "a"} : () -> memref<1xi32, 3>
+            // CHECK-NEXT:      [[v0:%[a-z0-9_]+]] = "accv.alloc"()  {allocType = 0 : i64, sym_name = "a"} : () -> memref<1xi32, 3>
             Scalar a = MakeVector<int>(1, "a")[0];
             Scalar c = 4;
             // CHECK-NEXT:      %[[v1:[a-z0-9_]+]] = arith.index_cast [[c0_0]] : i32 to index
@@ -844,10 +844,10 @@ TEST_CASE("mlir_schedule_test_4")
 // COM: CHECK: memref.subview %arg0[0, %{{[a-z0-9_]+}}] [10, 1] [10, 1] : memref<10x10xf32, #map0> to memref<10xf32, #map3>
 // COM: CHECK: memref.subview %arg0[%{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}] [3, 4] [1, 1] : memref<10x10xf32, #map0> to memref<3x4xf32, #map4>
 // COM: CHECK-NEXT: accv.func @MatrixView_{{[0-9]+}}(%arg0: memref<10x10xf32
-// CHECK: "accv.slice"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) {sliceDimensions = [0, 1]} : (memref<10x10xf32, #map0>, index, index) -> memref<f32>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<10x10xf32, #map0>, index) -> memref<10xf32, #map1>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<10x10xf32, #map0>, index) -> memref<10xf32, #map2>
-// CHECK: "accv.view"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) : (memref<10x10xf32, #map0>, !accv.range, !accv.range) -> memref<3x4xf32, #map3>
+// CHECK: "accv.slice"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) {sliceDimensions = [0, 1]} : (memref<10x10xf32>, index, index) -> memref<f32>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<10x10xf32>, index) -> memref<10xf32, #map0>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<10x10xf32>, index) -> memref<10xf32, #map1>
+// CHECK: "accv.view"(%arg0, %{{[0-9]+}}, %{{[0-9]+}}) : (memref<10x10xf32>, !accv.range, !accv.range) -> memref<3x4xf32, #map2>
 TEST_CASE("mlir_matrix_view_test")
 {
     DeclareFunction("MatrixView")
@@ -874,13 +874,13 @@ TEST_CASE("mlir_matrix_view_test")
 // COM: CHECK: memref.subview %arg0[0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}] [5, 1, 1] [150, 15, 1] : memref<5x10x15xf32, #map0> to memref<5xf32, #map7>
 // COM: CHECK: memref.subview %arg0[%{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}] [3, 2, 1] [1, 1, 1] : memref<5x10x15xf32, #map0> to memref<3x2x1xf32, #map8>
 // COM: CHECK-NEXT: accv.func @TensorView_{{[0-9]+}}(%arg0: memref<5x10x15xf32
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<5x10x15xf32, #map0>, index) -> memref<10x15xf32, #map1>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<5x10x15xf32, #map0>, index) -> memref<5x15xf32, #map2>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [2]} : (memref<5x10x15xf32, #map0>, index) -> memref<5x10xf32, #map3>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 1]} : (memref<5x10x15xf32, #map0>, index, index) -> memref<15xf32, #map4>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 2]} : (memref<5x10x15xf32, #map0>, index, index) -> memref<10xf32, #map5>
-// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [1, 2]} : (memref<5x10x15xf32, #map0>, index, index) -> memref<5xf32, #map6>
-// CHECK: "accv.view"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) : (memref<5x10x15xf32, #map0>, !accv.range, !accv.range, !accv.range) -> memref<3x2x1xf32, #map7>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [0]} : (memref<5x10x15xf32>, index) -> memref<10x15xf32, #map0>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [1]} : (memref<5x10x15xf32>, index) -> memref<5x15xf32, #map1>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}) {sliceDimensions = [2]} : (memref<5x10x15xf32>, index) -> memref<5x10xf32, #map2>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 1]} : (memref<5x10x15xf32>, index, index) -> memref<15xf32, #map3>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [0, 2]} : (memref<5x10x15xf32>, index, index) -> memref<10xf32, #map4>
+// CHECK: "accv.slice"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) {sliceDimensions = [1, 2]} : (memref<5x10x15xf32>, index, index) -> memref<5xf32, #map5>
+// CHECK: "accv.view"(%arg0, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}, %{{[a-z0-9_]+}}) : (memref<5x10x15xf32>, !accv.range, !accv.range, !accv.range) -> memref<3x2x1xf32, #map6>
 TEST_CASE("mlir_tensor_view_test")
 {
     DeclareFunction("TensorView")
@@ -957,8 +957,8 @@ TEST_CASE("mlir_intrinsic_test")
 // COM: CHECK-NEXT: %[[v8:[0-9]+]] = "accv.get_element"(%[[v4]]) : (memref<f32, #map2>) -> f32
 // COM: CHECK-NEXT: "accv.copy"(%[[v8]], %[[v6]]) : (f32, memref<f32, #map2>) -> ()
 // CHECK-NEXT: [[v2:%[0-9]+]] = "accv.bin_op"([[v0]], %[[v1]]) {predicate = 0 : i64} : (index, index) -> index
-// CHECK-NEXT: [[v3:%[0-9]+]] = "accv.slice"(%arg0, [[v0]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<8x18xf32, #map0>, index, index) -> memref<f32>
-// CHECK-NEXT: [[v4:%[0-9]+]] = "accv.slice"(%arg1, [[v0]], %[[v1]]) {sliceDimensions = [0, 1]} : (memref<8x10xf32, #map1>, index, index) -> memref<f32>
+// CHECK-NEXT: [[v3:%[0-9]+]] = "accv.slice"(%arg0, [[v0]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<8x18xf32>, index, index) -> memref<f32>
+// CHECK-NEXT: [[v4:%[0-9]+]] = "accv.slice"(%arg1, [[v0]], %[[v1]]) {sliceDimensions = [0, 1]} : (memref<8x10xf32>, index, index) -> memref<f32>
 // CHECK-NEXT: [[v5:%[0-9]+]] = "accv.get_element"([[v3]]) : (memref<f32>) -> f32
 // CHECK-NEXT: "accv.copy"([[v5]], [[v4]]) : (f32, memref<f32>) -> ()
 TEST_CASE("mlir_index_arithmetic_test")
@@ -1024,7 +1024,7 @@ TEST_CASE("mlir_scalar_float_test")
             // COM: CHECK-NEXT: scf.if %[[v4]] {
             // CHECK-NEXT: [[v0:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
             // CHECK-NEXT: [[v1:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
-            // CHECK-NEXT: [[v2:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32, #map0>, index, index) -> memref<f32>
+            // CHECK-NEXT: [[v2:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32>, index, index) -> memref<f32>
             // CHECK-NEXT: [[v3:%[0-9]+]] = "accv.get_element"([[v2]]) : (memref<f32>) -> f32
             // CHECK-NEXT: [[v4:%[0-9]+]] = "accv.cmp"([[v3]], %[[A]]) {predicate = 1 : i64} : (f32, f32) -> i1
             // CHECK-NEXT: scf.if [[v4]] {
@@ -1045,7 +1045,7 @@ TEST_CASE("mlir_scalar_float_test")
                 // CHECK:      [[v3:%[0-9]+]] = "accv.bin_op"([[v2]], [[CST0]]) {predicate = 0 : i64} : (f32, f32) -> f32
                 // CHECK-NEXT: [[v0:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
                 // CHECK-NEXT: [[v1:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
-                // CHECK-NEXT: [[Cslice:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32, #map0>, index, index) -> memref<f32>
+                // CHECK-NEXT: [[Cslice:%[0-9]+]] = "accv.slice"(%[[C]], [[v0]], [[v1]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32>, index, index) -> memref<f32>
                 // CHECK-NEXT: "accv.copy"([[v3]], [[Cslice]]) : (f32, memref<f32>) -> ()
                 C(idx, idx) = B[idx] + Cast(c, A.GetType());
 
@@ -1059,7 +1059,7 @@ TEST_CASE("mlir_scalar_float_test")
             // CHECK-NEXT: [[v0:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
             // CHECK-NEXT: [[v1:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
             // CHECK-NEXT: [[v2:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
-            // CHECK-NEXT: [[Dslice:%[0-9]+]] = "accv.slice"(%[[D]], [[v0]], [[v1]], [[v2]]) {sliceDimensions = [0, 1, 2]} : (memref<1000x1000x1000xf32, #map1>, index, index, index) -> memref<f32>
+            // CHECK-NEXT: [[Dslice:%[0-9]+]] = "accv.slice"(%[[D]], [[v0]], [[v1]], [[v2]]) {sliceDimensions = [0, 1, 2]} : (memref<1000x1000x1000xf32>, index, index, index) -> memref<f32>
             auto dVal = D(idx, idx, idx);
 
             // CHECK-NEXT: %[[v3:[0-9]+]] = "accv.get_element"([[Dslice]]) : (memref<f32>) -> f32
@@ -1077,7 +1077,7 @@ TEST_CASE("mlir_scalar_float_test")
                 // CHECK-NEXT:  [[v0:%[0-9]+]] = "accv.bin_op"(%[[IDX]], [[c2_0]]) {predicate = 0 : i64} : (i32, i32) -> i32
                 // CHECK-DAG:   [[v1:%[0-9]+]] = arith.index_cast [[v0]] : i32 to index
                 // CHECK-DAG:   [[v2:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
-                // CHECK:       [[v3:%[0-9]+]] = "accv.slice"(%[[C]], [[v1]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32, #map0>, index, index) -> memref<f32>
+                // CHECK:       [[v3:%[0-9]+]] = "accv.slice"(%[[C]], [[v1]], [[v2]]) {sliceDimensions = [0, 1]} : (memref<100x100xf32>, index, index) -> memref<f32>
                 // CHECK-NEXT:  [[v4:%[0-9]+]] = "accv.get_element"([[v3]]) : (memref<f32>) -> f32
                 // CHECK-NEXT:  "accv.copy"([[v4]], [[Dslice]]) : (f32, memref<f32>) -> ()
                 dVal = C(idx + c, idx);
@@ -1100,7 +1100,7 @@ TEST_CASE("mlir_scalar_float_test")
             // CHECK-NEXT: [[v3:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
             // CHECK-NEXT: [[v4:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
             // CHECK-NEXT: [[v5:%[0-9]+]] = arith.index_cast %[[IDX]] : i32 to index
-            // CHECK-NEXT: [[Eslice:%[0-9]+]] = "accv.slice"(%[[E]], [[v2]], [[v3]], [[v4]], [[v5]]) {sliceDimensions = [0, 1, 2, 3]} : (memref<10000x10000x10000x10000xf32, #map2>, index, index, index, index) -> memref<f32>
+            // CHECK-NEXT: [[Eslice:%[0-9]+]] = "accv.slice"(%[[E]], [[v2]], [[v3]], [[v4]], [[v5]]) {sliceDimensions = [0, 1, 2, 3]} : (memref<10000x10000x10000x10000xf32>, index, index, index, index) -> memref<f32>
             auto eVal = E(idx, idx, idx, idx);
 
             // CHECK-NEXT: %[[v7:[0-9]+]] = "accv.get_element"([[Eslice]]) : (memref<f32>) -> f32
@@ -2287,11 +2287,11 @@ TEST_CASE("jit_float_cached_matrix_multiply_test")
 
             // JIT-LABEL: A*B:
             Print("A*B:\n"s);
-            // JIT-NEXT: 20832.000000 21328.000000 21824.000000 22320.000000 22816.000000 23312.000000 23808.000000 24304.000000
-            // JIT-NEXT: 21824.000000 22352.000000 22880.000000 23408.000000 23936.000000 24464.000000 24992.000000 25520.000000
-            // JIT-NEXT: 22816.000000 23376.000000 23936.000000 24496.000000 25056.000000 25616.000000 26176.000000 26736.000000
-            // JIT-NEXT: 23808.000000 24400.000000 24992.000000 25584.000000 26176.000000 26768.000000 27360.000000 27952.000000
-            // JIT-NEXT: 24800.000000 25424.000000 26048.000000 26672.000000 27296.000000 27920.000000 28544.000000 29168.000000
+            // JIT: 20832.000000 21328.000000 21824.000000 22320.000000 22816.000000 23312.000000 23808.000000 24304.000000
+            // JIT: 21824.000000 22352.000000 22880.000000 23408.000000 23936.000000 24464.000000 24992.000000 25520.000000
+            // JIT: 22816.000000 23376.000000 23936.000000 24496.000000 25056.000000 25616.000000 26176.000000 26736.000000
+            // JIT: 23808.000000 24400.000000 24992.000000 25584.000000 26176.000000 26768.000000 27360.000000 27952.000000
+            // JIT: 24800.000000 25424.000000 26048.000000 26672.000000 27296.000000 27920.000000 28544.000000 29168.000000
             Print(C);
         });
     SUCCEED();
@@ -2404,7 +2404,7 @@ TEST_CASE("jit_matrix_transpose_test")
         .Public(true)
         .Decorated(false)
         .Define([=]() {
-            // COM: CHECK: [[m:%[0-9]+]] = "accv.alloc"() : () -> memref<3x4xf32, #map0, 3>
+            // COM: CHECK: [[m:%[0-9]+]] = "accv.alloc"() {allocType = 0 : i64} : () -> memref<3x4xf32, #map0, 3>
             Matrix m = MakeMatrix<float>(M, N);
             CHECK(m.GetMatrixLayout() == Matrix::MatrixLayout::rowMajor);
 
@@ -2990,7 +2990,7 @@ TEST_CASE("jit_array_reorder_test1")
 // COM: CHECK: [[map1:#map[0-9]+]] = affine_map<(d0, d1, d2) ->
 // COM: CHECK-LABEL: module @jit_array_reorder_test2 {
 // COM: CHECK-NEXT: accv.module "jit_array_reorder_test2" {
-// COM: CHECK: %0 = "accv.alloc"()
+// COM: CHECK: %0 = "accv.alloc"() {allocType = 0 : i64}
 // COM: CHECK-SAME: () -> memref<2x3x4xi32, [[map0]], 3>
 // COM: CHECK: %1 = memref.transpose %0 (d0, d1, d2) -> (d1, d2, d0)
 // COM: JIT-LABEL: @jit_array_reorder_test2
diff --git a/accera/acc-translate/CMakeLists.txt b/accera/acc-translate/CMakeLists.txt
index 9fb7df89..10d90976 100644
--- a/accera/acc-translate/CMakeLists.txt
+++ b/accera/acc-translate/CMakeLists.txt
@@ -3,6 +3,15 @@
 # Licensed under the MIT License. See LICENSE in the project root for license information.
 ####################################################################################################
 
+# setup for using LLVM and MLIR
+list(APPEND CMAKE_MODULE_PATH "${LLVM_DIR}")
+list(APPEND CMAKE_MODULE_PATH "${MLIR_CMAKE_DIR}")
+include(TableGen)
+include(AddLLVM)
+include(AddMLIR)
+
+add_subdirectory(src)
+
 set(util_name acc-translate)
 
 set(target_src
@@ -18,6 +27,7 @@ set(target_src
     src/Target/Cpp/AMDGPU.cpp
     src/Target/Cpp/VectorDialectCppPrinter.cpp
     src/Target/Cpp/LLVMDialectCppPrinter.cpp
+    src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp
     )
 
 set(target_include
@@ -33,6 +43,7 @@ set(target_include
     src/Target/Cpp/AMDGPU.h
     src/Target/Cpp/VectorDialectCppPrinter.h
     src/Target/Cpp/LLVMDialectCppPrinter.h
+    src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h
     )
 
 
@@ -45,6 +56,9 @@ source_group("include" FILES ${util_include})
 add_executable(${util_name} ${util_src} ${util_include})
 target_include_directories(${util_name} PRIVATE ${ACCERA_ROOT}/accera)
 
+get_property(dialect_libs GLOBAL PROPERTY MLIR_DIALECT_LIBS)
+get_property(translation_libs GLOBAL PROPERTY MLIR_TRANSLATION_LIBS)
+
 target_link_libraries(
   ${util_name}
   PRIVATE MLIROptLib
@@ -53,6 +67,8 @@ target_link_libraries(
           transforms
           value
           mlirHelpers
+          ${translation_libs}
+          ${dialect_libs}
 )
 copy_shared_libraries(${util_name})
 
diff --git a/accera/acc-translate/src/AcceraTranslateMain.cpp b/accera/acc-translate/src/AcceraTranslateMain.cpp
index 8360b341..4fc427f4 100644
--- a/accera/acc-translate/src/AcceraTranslateMain.cpp
+++ b/accera/acc-translate/src/AcceraTranslateMain.cpp
@@ -5,9 +5,13 @@
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include <mlir/InitAllDialects.h>
+#include <mlir/InitAllTranslations.h>
 #include <mlir/Support/LogicalResult.h>
 #include <mlir/Translation.h>
 
+#include <mlir/Target/LLVMIR/Dialect/All.h>
+#include <mlir/Target/LLVMIR/Export.h>
+
 #include <llvm/Support/CommandLine.h>
 #include <llvm/Support/InitLLVM.h>
 #include <llvm/Support/Process.h>
@@ -20,6 +24,8 @@
 
 #include "Target/Cpp/TranslateToCpp.h"
 
+#include "Target/LLVMIR/IntrinsicToLLVMIRTranslation.h"
+
 
 
 using namespace mlir;
@@ -50,11 +56,33 @@ inline void registerArgoTranslations()
         return true;
     }();
 }
+
+void registerAcceraToLLVMIRTranslation() {
+  TranslateFromMLIRRegistration registration(
+      "acc-to-llvmir",
+      [](ModuleOp module, llvm::raw_ostream &output) {
+        llvm::LLVMContext llvmContext;
+        auto llvmModule = translateModuleToLLVMIR(module, llvmContext);
+        if (!llvmModule)
+          return failure();
+
+        llvmModule->print(output, nullptr);
+        return success();
+      },
+      [](DialectRegistry &registry) {
+        registerAllDialects(registry);
+        accera::ir::GetDialectRegistry().appendTo(registry);
+        accera::transforms::intrinsics::registerIntrinsicsDialectTranslation(registry);
+        registerAllToLLVMIRTranslations(registry);
+      });
+}
 } // namespace
 
 int main(int argc, char** argv)
 {
     registerArgoTranslations();
+    registerAcceraToLLVMIRTranslation();
+    mlir::registerAllTranslations();
 
     return failed(mlirTranslateMain(argc, argv, "acc-translate"));
 }
diff --git a/accera/acc-translate/src/CMakeLists.txt b/accera/acc-translate/src/CMakeLists.txt
new file mode 100644
index 00000000..a7edbd82
--- /dev/null
+++ b/accera/acc-translate/src/CMakeLists.txt
@@ -0,0 +1,7 @@
+####################################################################################################
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See LICENSE in the project root for license information.
+####################################################################################################
+
+add_subdirectory(Target)
+
diff --git a/accera/acc-translate/src/Target/CMakeLists.txt b/accera/acc-translate/src/Target/CMakeLists.txt
new file mode 100644
index 00000000..6ce7b2ba
--- /dev/null
+++ b/accera/acc-translate/src/Target/CMakeLists.txt
@@ -0,0 +1,6 @@
+####################################################################################################
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See LICENSE in the project root for license information.
+####################################################################################################
+
+add_subdirectory(LLVMIR)
diff --git a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp
index 5505582b..d1ec2a10 100644
--- a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp
@@ -203,7 +203,7 @@ namespace cpp_printer
         const auto srcMemSpace = srcMemrefType.getMemorySpaceAsInt();
         auto elementType = srcMemrefType.getElementType();
         AffineDialectCppPrinter* affineDialectPrinter = dynamic_cast<AffineDialectCppPrinter*>(printer->getDialectPrinter("Affine"));
-        auto srcMap = srcMemrefType.getLayout().getAffineMap();
+        auto srcMap = mlir::getStridedLinearLayoutMap(srcMemrefType);
         const auto srcRowMajor = mlir::canonicalizeStridedLayout(srcMemrefType).getLayout().isIdentity();
 
         auto dstMemrefType = blockLoadOp.dest().getType().cast<MemRefType>();
diff --git a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp
index b70c2731..9c02dcdf 100644
--- a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp
@@ -47,7 +47,12 @@ void AffineMapVisitor::visit(Type type)
     }
     else if (auto memRefType = type.dyn_cast<MemRefType>())
     {
-        visit(AffineMapAttr::get(memRefType.getLayout().getAffineMap()));
+        // Flatten the memref layout map to a N-D -> 1-D map
+        // This will convert the map for an identity mapped layout like memref<16x16xf32>
+        // from (d0, d1) -> (d0, d1)
+        // to (d0, d1) -> (d0 * 16 + d1)
+        auto stridedLinearLayoutMap = mlir::getStridedLinearLayoutMap(memRefType);
+        visit(AffineMapAttr::get(stridedLinearLayoutMap));
     }
     else if (auto shapedType = type.dyn_cast<ShapedType>())
     {
diff --git a/accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt b/accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt
new file mode 100644
index 00000000..58994a74
--- /dev/null
+++ b/accera/acc-translate/src/Target/LLVMIR/CMakeLists.txt
@@ -0,0 +1,24 @@
+add_mlir_translation_library(IntrinsicToLLVMIRTranslation
+  IntrinsicToLLVMIRTranslation.cpp
+
+  ADDITIONAL_HEADER_DIRS
+  ${ACCERA_BIN_DIR}/accera/ir/include
+
+  DEPENDS
+  MLIRAcceraIntrinsics
+  AcceraIntrinsicsConversionsIncGen
+
+  LINK_COMPONENTS
+  Core
+
+  LINK_LIBS PUBLIC
+  MLIRIR
+  MLIRAcceraIntrinsics
+  MLIRLLVMIR
+  MLIRSupport
+  MLIRTargetLLVMIRExport
+  )
+
+target_include_directories(IntrinsicToLLVMIRTranslation PUBLIC
+    ${ACCERA_BIN_DIR}/ir/include
+)
diff --git a/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp
new file mode 100644
index 00000000..e1647492
--- /dev/null
+++ b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.cpp
@@ -0,0 +1,50 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+#include "IntrinsicToLLVMIRTranslation.h"
+
+#include <ir/include/intrinsics/AcceraIntrinsicsDialect.h>
+
+#include "mlir/IR/Operation.h"
+#include "mlir/Target/LLVMIR/ModuleTranslation.h"
+
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/IntrinsicsX86.h"
+
+using namespace mlir;
+using namespace mlir::LLVM;
+using namespace accera::transforms::intrinsics;
+
+namespace {
+class IntrinsicsDialectLLVMIRTranslationInterface
+    : public LLVMTranslationDialectInterface {
+public:
+  using LLVMTranslationDialectInterface::LLVMTranslationDialectInterface;
+
+  /// Translates the given operation to LLVM IR using the provided IR builder
+  /// and saving the state in `moduleTranslation`.
+  LogicalResult
+  convertOperation(Operation *op, llvm::IRBuilderBase &builder,
+                   LLVM::ModuleTranslation &moduleTranslation) const final {
+    Operation &opInst = *op;
+#include "intrinsics/AcceraIntrinsicsConversions.inc"
+
+    return failure();
+  }
+};
+} // namespace
+
+void accera::transforms::intrinsics::registerIntrinsicsDialectTranslation(DialectRegistry &registry) {
+  registry.insert<accera::ir::intrinsics::AcceraIntrinsicsDialect>();
+  registry.addDialectInterface<accera::ir::intrinsics::AcceraIntrinsicsDialect,
+                               IntrinsicsDialectLLVMIRTranslationInterface>();
+}
+
+void accera::transforms::intrinsics::registerIntrinsicsDialectTranslation(MLIRContext &context) {
+  DialectRegistry registry;
+  registerIntrinsicsDialectTranslation(registry);
+  context.appendDialectRegistry(registry);
+}
diff --git a/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h
new file mode 100644
index 00000000..5a27797b
--- /dev/null
+++ b/accera/acc-translate/src/Target/LLVMIR/IntrinsicToLLVMIRTranslation.h
@@ -0,0 +1,27 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#pragma once
+
+namespace mlir
+{
+
+class DialectRegistry;
+class MLIRContext;
+
+} // namespace mlir
+
+namespace accera::transforms::intrinsics
+{
+
+/// Register the Intrinsic dialect and the translation from it to the LLVM IR
+/// in the given registry;
+void registerIntrinsicsDialectTranslation(mlir::DialectRegistry& registry);
+
+/// Register the Intrinsic dialect and the translation from it in the registry
+/// associated with the given context.
+void registerIntrinsicsDialectTranslation(mlir::MLIRContext& context);
+
+} // namespace accera::transforms::intrinsics
diff --git a/accera/accc/accc.py b/accera/accc/accc.py
index 08788acd..5871498b 100644
--- a/accera/accc/accc.py
+++ b/accera/accc/accc.py
@@ -98,7 +98,7 @@ def bstr(val):
 
 DEFAULT_ACC_TRANSLATE_ARGS = []
 
-DEFAULT_MLIR_TRANSLATE_ARGS = ["--mlir-print-op-on-diagnostic", "--mlir-to-llvmir"]
+DEFAULT_MLIR_TRANSLATE_ARGS = ["--mlir-print-op-on-diagnostic", "--acc-to-llvmir"]
 
 LLVM_TOOLING_OPTS = {
     SystemTarget.HOST.value: ["-O3", "-fp-contract=fast", "-mcpu=native"],
@@ -120,9 +120,17 @@ def bstr(val):
     ],
 }
 
-DEFAULT_OPT_ARGS = []
+DEFAULT_LLVM_TOOLING_OPTS = [
+    '--enable-unsafe-fp-math',
+    '--enable-no-infs-fp-math',
+    '--enable-no-nans-fp-math',
+    '--enable-no-signed-zeros-fp-math',
+    '--enable-no-trapping-fp-math'
+]
 
-DEFAULT_LLC_ARGS = ["-relocation-model=pic"]
+DEFAULT_OPT_ARGS = DEFAULT_LLVM_TOOLING_OPTS + []
+
+DEFAULT_LLC_ARGS = DEFAULT_LLVM_TOOLING_OPTS + ["-relocation-model=pic"]
 
 
 def get_default_deploy_shared_libraries(target=CPU_TARGET):
@@ -818,7 +826,7 @@ def translate_mlir_with_mlir_translate(
             stdout = None
             stderr = None
         for module_file_set in self.module_file_sets:
-            mlir_translate_exe = os.path.abspath(ACCCConfig.mlir_translate)
+            mlir_translate_exe = os.path.abspath(ACCCConfig.acc_translate)
             full_mlir_translate_args = []    # empty list every iteration
             full_mlir_translate_args += mlir_translate_args or DEFAULT_MLIR_TRANSLATE_ARGS
             full_mlir_translate_args += [f'-o="{module_file_set.translated_ll_filepath}"']
diff --git a/accera/ir/CMakeLists.txt b/accera/ir/CMakeLists.txt
index f0d6607a..6881b8fe 100644
--- a/accera/ir/CMakeLists.txt
+++ b/accera/ir/CMakeLists.txt
@@ -32,6 +32,12 @@ set(include
     include/TranslateToHeader.h
     )
 
+set(intrinsics_src
+    src/intrinsics/AcceraIntrinsicsDialect.cpp
+    )
+set(intrinsics_include
+    include/intrinsics/AcceraIntrinsicsDialect.h)
+
 set(accvalue_src
     src/value/ValueDialect.cpp
     src/value/ValueCanonicalization.cpp
@@ -113,6 +119,21 @@ set(argo_include
     include/argo/Utils.h
     )
 
+add_mlir_dialect_library(MLIRAcceraIntrinsics # This is an accera dialect, but the add_mlir_dialect() cmake function prepends "MLIR"
+    ${intrinsics_src}
+
+    ADDITIONAL_HEADER_DIRS
+    ${CMAKE_CURRENT_SOURCE_DIR}/include
+
+    DEPENDS
+    MLIRAcceraIntrinsicsIncGen
+
+    LINK_LIBS PUBLIC
+    MLIRIR
+    )
+
+InstallAcceraLibrary(MLIRAcceraIntrinsics)
+
 # This is supposed to be overriden on the command line As of LLVM 8.0.1, the
 # possible values within the list are: AArch64 AMDGPU ARM BPF Hexagon Lanai Mips
 # MSP430 NVPTX PowerPC Sparc SystemZ WebAssembly X86 XCore
@@ -160,6 +181,7 @@ set(src
     ${accexec_src}
     ${accera_src}
     ${argo_src}
+    ${intrinsics_src}
     )
 
 set(include
@@ -169,6 +191,7 @@ set(include
     ${accexec_include}
     ${accera_include}
     ${argo_include}
+    ${intrinsics_include}
     build/LLVMEmitterTargets.h
     )
 
@@ -182,6 +205,15 @@ target_include_directories(
     $<INSTALL_INTERFACE:include/accera/ir/include>
 )
 
+target_include_directories(
+    MLIRAcceraIntrinsics PRIVATE ${CMAKE_CURRENT_BINARY_DIR} include
+    PUBLIC
+    $<BUILD_INTERFACE:${ACCERA_LIBRARIES_DIR}>
+    $<INSTALL_INTERFACE:include/accera>
+    $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/include>
+    $<INSTALL_INTERFACE:include/accera/ir/include>
+)
+
 target_include_directories(${library_name} SYSTEM PUBLIC ${LLVM_INCLUDE_DIRS})
 target_link_libraries(
     ${library_name}
@@ -207,6 +239,8 @@ add_dependencies(
     AcceraOpsIncGen
     ValueAttrsIncGen
     ValueOpsIncGen
+    MLIRAcceraIntrinsicsIncGen
+    MLIRAcceraIntrinsics
 
     ArgoOpsIncGen
     ArgoStructuredOpsIncGen
diff --git a/accera/ir/include/CMakeLists.txt b/accera/ir/include/CMakeLists.txt
index 16667e3f..a1adb102 100644
--- a/accera/ir/include/CMakeLists.txt
+++ b/accera/ir/include/CMakeLists.txt
@@ -9,5 +9,6 @@ add_subdirectory(nest)
 add_subdirectory(exec)
 add_subdirectory(accera)
 add_subdirectory(value)
+add_subdirectory(intrinsics)
 
 add_subdirectory(argo)
diff --git a/accera/ir/include/Common.td b/accera/ir/include/Common.td
index 2ece7891..bb04addb 100644
--- a/accera/ir/include/Common.td
+++ b/accera/ir/include/Common.td
@@ -84,6 +84,12 @@ def acc_NumericType :
 def acc_ScalarOrVectorNumericType :
     AnyTypeOf<[acc_NumericType, VectorOf<[acc_NumericType]>]>;
 
+def acc_IntegerOrIntegerVectorNumericType :
+    AnyTypeOf<[AnyInteger, VectorOf<[AnyInteger]>]>;
+
+def acc_FloatOrFloatVectorNumericType :
+    AnyTypeOf<[AnyFloat, VectorOf<[AnyFloat]>]>;
+
 class acc_Scalarlike<Type type> :
     AnyTypeOf<[type, acc_ContainerOfTypeWithNumElements<[type], 1>]>;
 
diff --git a/accera/ir/include/IRUtil.h b/accera/ir/include/IRUtil.h
index 94bacdbc..8dfc3555 100644
--- a/accera/ir/include/IRUtil.h
+++ b/accera/ir/include/IRUtil.h
@@ -259,6 +259,7 @@ namespace util
     std::vector<mlir::AffineApplyOp> AffineValueMapToAffineApplyOps(mlir::OpBuilder& builder, mlir::Location loc, mlir::AffineValueMap affineValueMap);
     mlir::AffineValueMap SimplifyAffineValueMap(mlir::AffineValueMap affineValueMap);
 
+    mlir::Type CloneTypeWithNewElementType(mlir::Type type, mlir::Type newElementType);
     mlir::Type GetElementType(mlir::Type type);
 
     int64_t GetUniqueId(mlir::Operation* where);
@@ -358,6 +359,8 @@ namespace util
     mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::StoreOp op);
     mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::LoadOp op);
 
+    void EraseOps(std::stack<mlir::Operation*>& opStack, mlir::PatternRewriter& rewriter);
+
     struct TempOpCleanupGuard
     {
         TempOpCleanupGuard(std::stack<mlir::Operation*>* opStack, mlir::PatternRewriter& rewriter);
@@ -400,11 +403,11 @@ namespace util
 
     mlir::Value GetGPUIndex(value::Processor idxType, mlir::OpBuilder& builder, mlir::Location& loc, ir::value::ExecutionRuntime execRuntime = ir::value::ExecutionRuntime::CUDA);
 
-    int64_t GetBlockDimSize(mlir::gpu::BlockDimOp op);
-    int64_t GetGridDimSize(mlir::gpu::GridDimOp op);
+    std::optional<int64_t> GetBlockDimSize(mlir::gpu::BlockDimOp op);
+    std::optional<int64_t> GetGridDimSize(mlir::gpu::GridDimOp op);
 
-    int64_t GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId);
-    int64_t GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId);
+    std::optional<int64_t> GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId);
+    std::optional<int64_t> GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId);
 
     // Gets the flattened thread ID of the current GPU thread within the context of the current block
     mlir::Value GetCurrentGPUBlockThreadID(mlir::OpBuilder& builder, mlir::Location loc);
diff --git a/accera/ir/include/intrinsics/AcceraIntrinsics.td b/accera/ir/include/intrinsics/AcceraIntrinsics.td
new file mode 100644
index 00000000..9173db9f
--- /dev/null
+++ b/accera/ir/include/intrinsics/AcceraIntrinsics.td
@@ -0,0 +1,69 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#ifndef ACCERA_intrinsic_OPS
+#define ACCERA_intrinsic_OPS
+
+include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
+include "mlir/Interfaces/InferTypeOpInterface.td"
+
+def AcceraIntrinsics_Dialect : Dialect {
+  let name = "accintr";
+  let cppNamespace = "::accera::ir::intrinsics";
+}
+
+// Implements the LLVM_IntrOpBase interface (from mlir/Dialect/LLVMIR/LLVMOpBase.td)
+// rather than LLVM_OneResultIntrOp because we don't want to put this op in the llvm dialect.
+// Otherwise it will screw up how the conversion is handled later in acc-translate.
+// However, we still want the other args to be like those in LLVM_OneResultIntrOp and LLVM_IntrOp
+def accintr_VpmaddwdOp : LLVM_IntrOpBase<AcceraIntrinsics_Dialect, // Dialect
+                                        "x86.avx2.pmadd.wd", // MLIR op name (will get prefixed with the "accintr." dialect name)
+                                        "x86_avx2_pmadd_wd", // LLVM IR C++ enum name (see <llvm-project>/llvm/include/llvm/IR/IntrinsicsX86.td )
+                                        [], // overloadedResults
+                                        [], // overloadedOperands 
+                                        [NoSideEffect], // traits
+                                        1>, // num results
+                          Arguments<(ins LLVM_Type, LLVM_Type)>;
+
+
+// TODO : this may not be needed when we have multi-dimensional reductions supporting max/min
+def accintr_VmaxpsOp : LLVM_IntrOpBase<AcceraIntrinsics_Dialect, // Dialect
+                                       "x86.avx.max.ps.256", // MLIR op name (will get prefixed with the "accintr." dialect name)
+                                       "x86_avx_max_ps_256", // LLVM IR C++ enum name (see <llvm-project>/llvm/include/llvm/IR/IntrinsicsX86.td )
+                                       [], // overloadedResults
+                                       [], // overloadedOperands 
+                                       [NoSideEffect], // traits
+                                       1>, // num results
+                          Arguments<(ins LLVM_Type, LLVM_Type)>;
+
+def accintr_VminpsOp : LLVM_IntrOpBase<AcceraIntrinsics_Dialect, // Dialect
+                                       "x86.avx.min.ps.256", // MLIR op name (will get prefixed with the "accintr." dialect name)
+                                       "x86_avx_min_ps_256", // LLVM IR C++ enum name (see <llvm-project>/llvm/include/llvm/IR/IntrinsicsX86.td )
+                                       [], // overloadedResults
+                                       [], // overloadedOperands 
+                                       [NoSideEffect], // traits
+                                       1>, // num results
+                          Arguments<(ins LLVM_Type, LLVM_Type)>;
+
+// TODO : remove after the next llvm update. There is a new math::roundeven op that we can use
+def accintr_RoundEvenOp : LLVM_IntrOpBase<AcceraIntrinsics_Dialect, // Dialect
+                                       "roundeven", // MLIR op name (will get prefixed with the "accintr." dialect name)
+                                       "roundeven", // LLVM IR C++ enum name (see <llvm-project>/llvm/include/llvm/IR/IntrinsicsX86.td )
+                                       [], // overloadedResults
+                                       [0], // overloadedOperands 
+                                       [NoSideEffect, SameOperandsAndResultType], // traits
+                                       1>, // num results
+                          Arguments<(ins LLVM_Type)>;
+
+def accintr_RoundF32VecAVX2 : LLVM_IntrOpBase<AcceraIntrinsics_Dialect, // Dialect
+                                       "x86.avx.cvt.ps2dq.256", // MLIR op name (will get prefixed with the "accintr." dialect name)
+                                       "x86_avx_cvt_ps2dq_256", // LLVM IR C++ enum name (see <llvm-project>/llvm/include/llvm/IR/IntrinsicsX86.td )
+                                       [], // overloadedResults
+                                       [], // overloadedOperands 
+                                       [NoSideEffect], // traits
+                                       1>, // num results
+                          Arguments<(ins LLVM_Type)>;
+
+#endif // ACCERA_intrinsic_OPS
diff --git a/accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h b/accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h
new file mode 100644
index 00000000..833f4e6b
--- /dev/null
+++ b/accera/ir/include/intrinsics/AcceraIntrinsicsDialect.h
@@ -0,0 +1,18 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#pragma once
+
+#include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Dialect.h"
+#include "mlir/IR/OpDefinition.h"
+#include "mlir/IR/OpImplementation.h"
+#include "mlir/Interfaces/InferTypeOpInterface.h"
+#include "mlir/Interfaces/SideEffectInterfaces.h"
+
+#include "intrinsics/AcceraIntrinsicsDialect.h.inc"
+
+#define GET_OP_CLASSES
+#include "intrinsics/AcceraIntrinsics.h.inc"
diff --git a/accera/ir/include/intrinsics/CMakeLists.txt b/accera/ir/include/intrinsics/CMakeLists.txt
new file mode 100644
index 00000000..9487f83b
--- /dev/null
+++ b/accera/ir/include/intrinsics/CMakeLists.txt
@@ -0,0 +1,10 @@
+####################################################################################################
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See LICENSE in the project root for license information.
+####################################################################################################
+
+add_mlir_dialect(AcceraIntrinsics accintr)
+
+set(LLVM_TARGET_DEFINITIONS AcceraIntrinsics.td)
+mlir_tablegen(AcceraIntrinsicsConversions.inc -gen-llvmir-conversions)
+add_public_tablegen_target(AcceraIntrinsicsConversionsIncGen)
diff --git a/accera/ir/include/value/ValueAttrs.td b/accera/ir/include/value/ValueAttrs.td
index 7f2fec66..d3da5522 100644
--- a/accera/ir/include/value/ValueAttrs.td
+++ b/accera/ir/include/value/ValueAttrs.td
@@ -60,10 +60,12 @@ def ProcessorAttr : I64EnumAttr<"Processor", "processor for loop mapping", [
 
 def MEMORY_ALLOC_GLOBAL : I64EnumAttrCase<"Global", 0>;
 def MEMORY_ALLOC_STACK : I64EnumAttrCase<"Stack", 1>;
+def MEMORY_ALLOC_HEAP : I64EnumAttrCase<"Heap", 2>;
+def MEMORY_ALLOC_THREAD_LOCAL : I64EnumAttrCase<"ThreadLocal", 3>; // TODO : include in enum below and plumb through to python DSL and add appropriate lowering rewrite
 def MemoryAllocTypeAttr : I64EnumAttr<
         "MemoryAllocType",
         "Describes the memory type in which an allocation resides.",
-        [ MEMORY_ALLOC_GLOBAL, MEMORY_ALLOC_STACK]> {
+        [ MEMORY_ALLOC_GLOBAL, MEMORY_ALLOC_STACK, MEMORY_ALLOC_HEAP]> {
     let cppNamespace = "::accera::ir::value";
 }
 
diff --git a/accera/ir/include/value/ValueDialect.h b/accera/ir/include/value/ValueDialect.h
index b8a6eff4..1e5c2b41 100644
--- a/accera/ir/include/value/ValueDialect.h
+++ b/accera/ir/include/value/ValueDialect.h
@@ -7,6 +7,7 @@
 #pragma once
 
 #include <mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.h>
+#include <mlir/Dialect/LLVMIR/LLVMTypes.h>
 #include <mlir/IR/Builders.h>
 #include <mlir/IR/BuiltinOps.h>
 #include <mlir/IR/BuiltinTypes.h>
@@ -15,6 +16,7 @@
 #include <mlir/IR/TypeUtilities.h>
 #include <mlir/Interfaces/CallInterfaces.h>
 #include <mlir/Interfaces/ControlFlowInterfaces.h>
+#include <mlir/Interfaces/InferTypeOpInterface.h>
 #include <mlir/Interfaces/LoopLikeInterface.h>
 #include <mlir/Interfaces/SideEffectInterfaces.h>
 
@@ -55,6 +57,7 @@ using mlir::FloatType;
 using mlir::FuncOp;
 using mlir::FunctionType;
 using mlir::IndexType;
+using mlir::InferTypeOpInterface;
 using mlir::IntegerAttr;
 using mlir::Location;
 using mlir::LogicalResult;
@@ -99,6 +102,7 @@ const mlir::StringRef RawPointerAPIAttrName = "accv.emit_raw_pointer_api";
 const mlir::StringRef HeaderDeclAttrName = "accv.emit_header_decl";
 const mlir::StringRef FunctionTagsAttrName = "accv.function_tags";
 const mlir::StringRef NoInlineAttrName = "accv.no_inline";
+const mlir::StringRef NoInlineIntoAttrName = "accv.no_inline_into";
 const mlir::StringRef BaseNameAttrName = "accv.base_name";
 const mlir::StringRef DynamicArgSizeReferencesAttrName = "accv.dyn_arg_size_refs";
 const mlir::StringRef UsagesAttrName = "accv.usages";
diff --git a/accera/ir/include/value/ValueOps.td b/accera/ir/include/value/ValueOps.td
index 5f33cace..fa34447d 100644
--- a/accera/ir/include/value/ValueOps.td
+++ b/accera/ir/include/value/ValueOps.td
@@ -12,9 +12,12 @@ include "ir/include/value/ValueBase.td"
 include "ir/include/value/ValueAttrs.td"
 
 include "mlir/Interfaces/ControlFlowInterfaces.td"
+include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/IR/FunctionInterfaces.td"
 include "mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.td"
 
+include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
+
 def accv_ValueLambdaOp : accv_Op<"lambda", [
     SymbolTable,
     Symbol,
@@ -291,11 +294,15 @@ def accv_BINARY_OP_DIV : I64EnumAttrCase<"DIV", 3>;
 def accv_BINARY_OP_MOD : I64EnumAttrCase<"MOD", 4>;
 def accv_BINARY_OP_AND : I64EnumAttrCase<"LOGICAL_AND", 5>;
 def accv_BINARY_OP_OR : I64EnumAttrCase<"LOGICAL_OR", 6>;
+def accv_BINARY_OP_MAX : I64EnumAttrCase<"MAX", 7>;
+def accv_BINARY_OP_MIN : I64EnumAttrCase<"MIN", 8>;
 
 def accv_BinaryOpPredicateAttr : I64EnumAttr<
   "BinaryOpPredicate", "",
-  [accv_BINARY_OP_ADD, accv_BINARY_OP_SUB, accv_BINARY_OP_MUL, accv_BINARY_OP_DIV, accv_BINARY_OP_MOD,
-  accv_BINARY_OP_AND, accv_BINARY_OP_OR]> {
+  [accv_BINARY_OP_ADD, accv_BINARY_OP_SUB,
+  accv_BINARY_OP_MUL, accv_BINARY_OP_DIV, accv_BINARY_OP_MOD,
+  accv_BINARY_OP_AND, accv_BINARY_OP_OR,
+  accv_BINARY_OP_MAX, accv_BINARY_OP_MIN]> {
   let cppNamespace = "::accera::ir::value";
 }
 
@@ -374,6 +381,24 @@ def accv_CmpOp : accv_Op<"cmp",
   }];
 }
 
+// TODO : remove after the next llvm update. There is a new math::roundeven op that we can use
+// TODO : add more control for rounding modes other than "roundeven"
+def accv_RoundOp : accv_Op<"round", [NoSideEffect]> {
+  let description = [{
+    Rounds a given floating point value to an integer of the same bitwidth according to the currently set rounding mode.
+  }];
+
+  let arguments = (ins acc_FloatOrFloatVectorNumericType:$val);
+  let results = (outs acc_IntegerOrIntegerVectorNumericType:$result);
+
+  let extraClassDeclaration = [{
+    static bool SupportsVectorization(int count) {
+      // TODO : generalize this for more target types than AVX-2
+      return count == 8;
+    }
+  }];
+}
+
 def accv_CopyOp : accv_Op<"copy"> {
   let description = [{
     Copies the data in the input view into the output view.
@@ -671,7 +696,7 @@ def accv_MemRefCastOp : accv_Op<"memref_cast", [SameOperandsAndResultShape]> {
   }];
 }
 
-def accv_CastOp : accv_Op<"cast"> {
+def accv_CastOp : accv_Op<"cast", [NoSideEffect]> {
   let summary = "casting operation";
   let description = [{
     The `accv.cast` operation converts an element to an element of another type.
@@ -1493,4 +1518,39 @@ def accv_MMAStoreSyncOp : accv_Op<"wmma_store_sync", [
   let verifier = [{ return ::verify(*this); }];
 }
 
+// TODO : move to new dialect?
+def accv_vpmaddwd : accv_Op<"vpmaddwd", [NoSideEffect]>{
+  let summary = "vpmaddwd intrinsic operation";
+
+  let description = [{
+    The `accv.vpmaddwd` operation lowers to the vpmaddwd LLVM intrinsic.
+  }];
+
+  let arguments = (ins AnyVector:$lhs, AnyVector:$rhs); // TODO : shape verification
+  let results = (outs AnyVector:$result);
+}
+
+def accv_vmaxps : accv_Op<"vmaxps", [NoSideEffect]>{
+  let summary = "vmaxps intrinsic operation";
+
+  let description = [{
+    The `accv.vmaxps` operation lowers to the vmaxps LLVM intrinsic.
+  }];
+
+  let arguments = (ins AnyVector:$lhs, AnyVector:$rhs); // TODO : shape verification
+  let results = (outs AnyVector:$result);
+}
+
+def accv_vminps : accv_Op<"vminps", [NoSideEffect]>{
+  let summary = "vminps intrinsic operation";
+
+  let description = [{
+    The `accv.vminps` operation lowers to the vminps LLVM intrinsic.
+  }];
+
+  let arguments = (ins AnyVector:$lhs, AnyVector:$rhs); // TODO : shape verification
+  let results = (outs AnyVector:$result);
+}
+
+
 #endif // ACCERA_accv_OPS
diff --git a/accera/ir/src/DialectRegistry.cpp b/accera/ir/src/DialectRegistry.cpp
index 6e6b63f8..e5d96fca 100644
--- a/accera/ir/src/DialectRegistry.cpp
+++ b/accera/ir/src/DialectRegistry.cpp
@@ -9,6 +9,7 @@
 #include "nest/LoopNestOps.h"
 #include "accera/AcceraOps.h"
 #include "value/ValueDialect.h"
+#include "intrinsics/AcceraIntrinsicsDialect.h"
 
 #include <mlir/Dialect/Affine/IR/AffineOps.h>
 #include <mlir/Dialect/GPU/GPUDialect.h>
@@ -38,6 +39,7 @@ mlir::DialectRegistry& GetDialectRegistry()
         registry.insert<value::ValueDialect,
                         loopnest::LoopNestDialect,
                         executionPlan::ExecutionPlanDialect,
+                        intrinsics::AcceraIntrinsicsDialect,
                         rc::AcceraDialect,
 
                         // MLIR dialects
diff --git a/accera/ir/src/IRUtil.cpp b/accera/ir/src/IRUtil.cpp
index 5bd927e9..da9be5f4 100644
--- a/accera/ir/src/IRUtil.cpp
+++ b/accera/ir/src/IRUtil.cpp
@@ -313,6 +313,42 @@ namespace util
         return mlir::AffineValueMap(map, operandsVec);
     }
 
+    std::optional<int64_t> SimplifyAffineValueMapToConstant(mlir::AffineValueMap affineValueMap)
+    {
+        auto simplified = SimplifyAffineValueMap(affineValueMap);
+        auto map = simplified.getAffineMap();
+        if (map.isSingleConstant())
+        {
+            return map.getSingleConstantResult();
+        }
+        return std::nullopt;
+    }
+
+    template <typename ShapedTy>
+    mlir::Type CloneTypeWithNewElementType(ShapedTy type, mlir::Type newElementType)
+    {
+        typename ShapedTy::Builder builder(type);
+        builder.setElementType(newElementType);
+
+        return builder;
+    }
+
+    mlir::Type CloneTypeWithNewElementType(mlir::Type type, mlir::Type newElementType)
+    {
+        auto result =
+            mlir::TypeSwitch<mlir::Type, mlir::Type>(type)
+                .Case([&](mlir::MemRefType memrefType) {
+                    return CloneTypeWithNewElementType(memrefType, newElementType);
+                })
+                .Case([&](mlir::VectorType vectorType) {
+                    return CloneTypeWithNewElementType(vectorType, newElementType);
+                })
+                .Default([&](mlir::Type) {
+                    return newElementType;
+                });
+        return result;
+    }
+
     mlir::Type GetElementType(mlir::Type type)
     {
         auto result =
@@ -734,42 +770,42 @@ namespace util
         if (forOp.getLowerBoundMap().getNumResults() != 1)
             return mlir::failure();
 
+        mlir::OpBuilder::InsertionGuard insertGuard(rewriter);
+        rewriter.setInsertionPoint(forOp);
         // Replaces all IV uses to its single iteration value.
         auto iv = forOp.getInductionVar();
-        auto* parentBlock = forOp->getBlock();
+        mlir::Value ivValueReplacement;
         if (!iv.use_empty())
         {
             if (forOp.hasConstantLowerBound())
             {
-                mlir::OpBuilder topBuilder(forOp->getParentOfType<vir::ValueFuncOp>().getBody());
-                auto constOp = topBuilder.create<mlir::arith::ConstantIndexOp>(
+                ivValueReplacement = rewriter.create<mlir::arith::ConstantIndexOp>(
                     forOp.getLoc(), forOp.getConstantLowerBound());
-                iv.replaceAllUsesWith(constOp);
             }
             else
             {
                 auto lbOperands = forOp.getLowerBoundOperands();
                 auto lbMap = forOp.getLowerBoundMap();
-                mlir::OpBuilder builder(parentBlock, mlir::Block::iterator(forOp));
-                if (lbMap == builder.getDimIdentityMap())
+                if (lbMap == rewriter.getDimIdentityMap())
                 {
                     // No need of generating an affine.apply.
-                    iv.replaceAllUsesWith(lbOperands[0]);
+                    ivValueReplacement = lbOperands[0];
                 }
                 else
                 {
-                    auto affineApplyOp =
-                        builder.create<mlir::AffineApplyOp>(forOp.getLoc(), lbMap, lbOperands);
-                    iv.replaceAllUsesWith(affineApplyOp);
+                    ivValueReplacement =
+                        rewriter.create<mlir::AffineApplyOp>(forOp.getLoc(), lbMap, lbOperands);
                 }
             }
+            iv.replaceAllUsesWith(ivValueReplacement);
         }
+
         // Move the loop body operations, except for its terminator, to the loop's
         // containing block.
-        rewriter.eraseOp(forOp.getBody()->getTerminator());
 
-        parentBlock->getOperations().splice(mlir::Block::iterator(forOp),
-                                            forOp.getBody()->getOperations());
+        // Erase the terminator so we don't merge it into the parent block
+        rewriter.eraseOp(forOp.getBody()->getTerminator());
+        rewriter.mergeBlockBefore(forOp.getBody(), forOp, mlir::ValueRange{ ivValueReplacement });
 
         rewriter.eraseOp(forOp);
         return mlir::success();
@@ -900,6 +936,17 @@ namespace util
         return GetMemRefIndexToMemoryLocationMap(context, op);
     }
 
+    void EraseOps(std::stack<mlir::Operation*>& opStack, mlir::PatternRewriter& rewriter)
+    {
+        while (!opStack.empty())
+        {
+            auto eraseOp = opStack.top();
+            assert(eraseOp->use_empty());
+            rewriter.eraseOp(eraseOp);
+            opStack.pop();
+        }
+    }
+
     TempOpCleanupGuard::TempOpCleanupGuard(std::stack<mlir::Operation*>* opStack, mlir::PatternRewriter& rewriter) :
         _opStack(opStack),
         _rewriter(rewriter)
@@ -907,13 +954,7 @@ namespace util
 
     TempOpCleanupGuard::~TempOpCleanupGuard()
     {
-        while (!_opStack->empty())
-        {
-            auto eraseOp = _opStack->top();
-            assert(eraseOp->use_empty());
-            _rewriter.eraseOp(eraseOp);
-            _opStack->pop();
-        }
+        EraseOps(*_opStack, _rewriter);
     }
 
     mlir::Attribute MemorySpaceToAttribute(const value::MemorySpace& memorySpace, mlir::MLIRContext* context)
@@ -944,14 +985,25 @@ namespace util
 
     mlir::Type ToSignlessMLIRType(mlir::OpBuilder& builder, mlir::Type type)
     {
-        if (type.isIntOrFloat())
-        {
-            if (auto width = type.getIntOrFloatBitWidth(); type.isInteger(width))
-            {
-                return builder.getIntegerType(width);
-            }
-        }
-        return type; // pass-through, no signless change
+        auto result =
+            mlir::TypeSwitch<mlir::Type, mlir::Type>(type)
+                .Case([&](mlir::MemRefType memrefType) -> mlir::Type {
+                    return CloneTypeWithNewElementType(memrefType, ToSignlessMLIRType(builder, memrefType.getElementType()));
+                })
+                .Case([&](mlir::VectorType vectorType) -> mlir::Type {
+                    return CloneTypeWithNewElementType(vectorType, ToSignlessMLIRType(builder, vectorType.getElementType()));
+                })
+                .Default([&](mlir::Type t) -> mlir::Type {
+                    if (t.isIntOrFloat())
+                    {
+                        if (auto width = t.getIntOrFloatBitWidth(); t.isInteger(width))
+                        {
+                            return builder.getIntegerType(width);
+                        }
+                    }
+                    return t; // pass-through, no signless change
+                });
+        return result;
     }
 
     mlir::Value ToSignlessMLIRValue(mlir::OpBuilder& builder, mlir::Value value)
@@ -1067,7 +1119,7 @@ namespace util
             });
     }
 
-    int64_t GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId)
+    std::optional<int64_t> GetBlockDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId)
     {
         if (auto gpuFunc = where->getParentOfType<mlir::gpu::GPUFuncOp>())
         {
@@ -1082,8 +1134,7 @@ namespace util
             mlir::Operation* vLambdaOp = where->getParentOfType<ir::value::ValueLambdaOp>();
             if (vFuncOp == nullptr && vLambdaOp == nullptr)
             {
-                assert(false && "Can only resolve block dim size inside of a gpu::GPUFuncOp, ir::value::ValueFuncOp, or ir::value::ValueLambdaOp");
-                return -1;
+                return std::nullopt;
             }
             // Prefer using the ValueLambdaOp as inner loopnests will be a ValueLambdaOp nested inside of a ValueFuncOp
             auto op = vLambdaOp != nullptr ? vLambdaOp : vFuncOp;
@@ -1094,7 +1145,7 @@ namespace util
         }
     }
 
-    int64_t GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId)
+    std::optional<int64_t> GetGridDimSize(mlir::Operation* where, mlir::gpu::Dimension dimId)
     {
         if (auto gpuFunc = where->getParentOfType<mlir::gpu::GPUFuncOp>())
         {
@@ -1109,8 +1160,7 @@ namespace util
             mlir::Operation* vLambdaOp = where->getParentOfType<ir::value::ValueLambdaOp>();
             if (vFuncOp == nullptr && vLambdaOp == nullptr)
             {
-                assert(false && "Can only resolve grid dim size inside of a gpu::GPUFuncOp, ir::value::ValueFuncOp, or ir::value::ValueLambdaOp");
-                return -1;
+                return std::nullopt;
             }
             auto op = vLambdaOp != nullptr ? vLambdaOp : vFuncOp;
             auto gpuParams = GetGPUFuncLaunchInfo(op);
@@ -1120,12 +1170,12 @@ namespace util
         }
     }
 
-    int64_t GetBlockDimSize(mlir::gpu::BlockDimOp op)
+    std::optional<int64_t> GetBlockDimSize(mlir::gpu::BlockDimOp op)
     {
         return GetBlockDimSize(op, op.dimension());
     }
 
-    int64_t GetGridDimSize(mlir::gpu::GridDimOp op)
+    std::optional<int64_t> GetGridDimSize(mlir::gpu::GridDimOp op)
     {
         return GetGridDimSize(op, op.dimension());
     }
@@ -1147,9 +1197,9 @@ namespace util
         auto blockDimXOp = GetGPUIndex(vir::Processor::BlockDimX, builder, loc);
         auto blockDimYOp = GetGPUIndex(vir::Processor::BlockDimY, builder, loc);
         auto blockDimZOp = GetGPUIndex(vir::Processor::BlockDimZ, builder, loc);
-        if (GetBlockDimSize(blockDimZOp.getDefiningOp<mlir::gpu::BlockDimOp>()) == 1) // 2D or 1D block
+        if (*(GetBlockDimSize(blockDimZOp.getDefiningOp<mlir::gpu::BlockDimOp>())) == 1) // 2D or 1D block
         {
-            if (GetBlockDimSize(blockDimYOp.getDefiningOp<mlir::gpu::BlockDimOp>()) == 1)
+            if (*(GetBlockDimSize(blockDimYOp.getDefiningOp<mlir::gpu::BlockDimOp>())) == 1)
             {
                 // 1D block
                 auto flattenedTidMap = mlir::AffineMap::get(0, 1, threadXSym);
diff --git a/accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp b/accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp
new file mode 100644
index 00000000..8454ba30
--- /dev/null
+++ b/accera/ir/src/intrinsics/AcceraIntrinsicsDialect.cpp
@@ -0,0 +1,32 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#include "ir/include/intrinsics/AcceraIntrinsicsDialect.h"
+
+#include "mlir/Dialect/LLVMIR/LLVMTypes.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/OpImplementation.h"
+#include "mlir/IR/TypeUtilities.h"
+#include "mlir/Interfaces/InferTypeOpInterface.h"
+
+using namespace mlir;
+
+#include "intrinsics/AcceraIntrinsicsDialect.cpp.inc"
+
+namespace accera::ir::intrinsics
+{
+
+void AcceraIntrinsicsDialect::initialize()
+{
+    addOperations<
+#define GET_OP_LIST
+#include "intrinsics/AcceraIntrinsics.cpp.inc"
+        >();
+}
+
+} // namespace accera::ir::intrinsics
+
+#define GET_OP_CLASSES
+#include "intrinsics/AcceraIntrinsics.cpp.inc"
diff --git a/accera/ir/src/nest/LoopNestAffineConstraints.cpp b/accera/ir/src/nest/LoopNestAffineConstraints.cpp
index 2d0ddd21..3bf3ae44 100644
--- a/accera/ir/src/nest/LoopNestAffineConstraints.cpp
+++ b/accera/ir/src/nest/LoopNestAffineConstraints.cpp
@@ -95,11 +95,11 @@ struct SplitLoopInfo
     IdWrapper largestMainLoopIVId;
 };
 
-SplitLoopInfo AddSplitPartitionHelper(LoopNestAffineConstraints& cst,
-                                      const Index& loopIndex,
-                                      mlir::OpBuilder& builder,
-                                      mlir::Location loc,
-                                      int64_t stepSize)
+std::optional<SplitLoopInfo> AddSplitPartitionHelper(LoopNestAffineConstraints& cst,
+                                                     const Index& loopIndex,
+                                                     mlir::OpBuilder& builder,
+                                                     mlir::Location loc,
+                                                     int64_t stepSize)
 {
     // Get the [begin, end) range for this loop id
     LoopNestAffineConstraints resolveRangeCst = cst.Clone();
@@ -107,8 +107,20 @@ SplitLoopInfo AddSplitPartitionHelper(LoopNestAffineConstraints& cst,
     auto [beginValueMap, endValueMap] = resolveRangeCst.GetLowerAndUpperBound(loopIndex, builder, loc);
 
     // Produce a begin and end value using affine apply ops
-    mlir::Value beginVal = mlir::makeComposedAffineApply(builder, loc, beginValueMap.getAffineMap(), beginValueMap.getOperands());
-    mlir::Value endVal = mlir::makeComposedAffineApply(builder, loc, endValueMap.getAffineMap(), endValueMap.getOperands());
+    auto beginApplyOp = mlir::makeComposedAffineApply(builder, loc, beginValueMap.getAffineMap(), beginValueMap.getOperands());
+    auto endApplyOp = mlir::makeComposedAffineApply(builder, loc, endValueMap.getAffineMap(), endValueMap.getOperands());
+
+    // If either the begin or end values are empty, then we've recursed into an empty part of the space and we should bail out without creating a loop
+    auto beginMap = beginApplyOp.getAffineMap();
+    auto endMap = endApplyOp.getAffineMap();
+
+    if (beginMap.isEmpty() || endMap.isEmpty())
+    {
+        return std::nullopt;
+    }
+
+    mlir::Value beginVal = beginApplyOp.getResult();
+    mlir::Value endVal = endApplyOp.getResult();
 
     auto partitionInfo = MakeSplitPartition(builder, beginVal, endVal, stepSize);
 
@@ -300,14 +312,19 @@ namespace loopnest
         auto levelScopedConstraints = Clone();
         auto loopId = levelScopedConstraints.GetId(index);
 
-        auto partitionInfo = AddSplitPartitionHelper(levelScopedConstraints,
-                                                     index,
-                                                     builder,
-                                                     loc,
-                                                     splitSize);
-
+        auto partitionInfoOpt = AddSplitPartitionHelper(levelScopedConstraints,
+                                                        index,
+                                                        builder,
+                                                        loc,
+                                                        splitSize);
 
         std::vector<LoopPartitionConstraints> partitionedLoopConstraints;
+        if (!partitionInfoOpt.has_value())
+        {
+            return partitionedLoopConstraints;
+        }
+        auto partitionInfo = *partitionInfoOpt;
+
         // Main loop partition
         {
             // Fork the constraints for inside the main loop
@@ -338,8 +355,7 @@ namespace loopnest
             // Set loop id equal to partition value inside the cleanup loop
             cleanupScopedConstraints.SetEqual(loopId, partitionInfo.partitionValueId);
 
-            // Bound loopId >= partition value. This is a looser constraint than we put on the mainScopedConstraints, but it is helpful
-            // for getting a simpler loop bound
+            // Bound loopId >= partition value.
             cleanupResolveConstraints.AddLowerBound(loopId, partitionInfo.partitionValueId);
 
             LoopPartitionConstraints cleanupPartitionConstraints(cleanupResolveConstraints, cleanupScopedConstraints);
diff --git a/accera/ir/src/nest/LoopNestBuilder.cpp b/accera/ir/src/nest/LoopNestBuilder.cpp
index e3ca7a96..3681d259 100644
--- a/accera/ir/src/nest/LoopNestBuilder.cpp
+++ b/accera/ir/src/nest/LoopNestBuilder.cpp
@@ -627,7 +627,7 @@ namespace loopnest
         // --> (0..1: S1), (0..N-1: S2), (N1-..N: S2, S3)
         // prefix of last partition matches entirety of second: move
         // --> (0..1: S1), (0..N: S2), (N1-..N: S3)
-        if (schedule.IsDone())
+        if (schedule.IsDone() || loops.empty())
         {
             return;
         }
diff --git a/accera/python/accera/Debug.py b/accera/python/accera/Debug.py
index f37ea762..0a5758a4 100644
--- a/accera/python/accera/Debug.py
+++ b/accera/python/accera/Debug.py
@@ -37,8 +37,6 @@ def add_check_allclose(package: Package, array: Array, atol: float = 1e-5, targe
     resolved_shape = [0 if isinstance(s, Dimension) else s for s in shape]
     shape_str = '_'.join(map(str, resolved_shape))
 
-    shape = [Dimension(role=Dimension.Role.OUTPUT, value=x) if isinstance(x, Dimension) else x for x in shape]
-
     # placeholders
     actual = Array(role=Array.Role.INPUT, element_type=element_type, shape=shape, layout=layout)
     desired = Array(role=Array.Role.INPUT, element_type=element_type, shape=shape, layout=layout)
diff --git a/accera/python/accera/Package.py b/accera/python/accera/Package.py
index 744951f4..a2c0ff4b 100644
--- a/accera/python/accera/Package.py
+++ b/accera/python/accera/Package.py
@@ -30,6 +30,11 @@
 
 @singledispatch
 def _convert_arg(arg: _lang_python._lang._Valor):
+    if isinstance(arg, lang.Dimension):
+        arg._native_dim = _lang_python._lang.Scalar(arg)
+        return arg._native_dim
+    if isinstance(arg, _lang_python._lang.Scalar):
+        return _lang_python._lang.Scalar(arg)
     if arg.layout == _lang_python._MemoryLayout():
         return _lang_python._lang.Scalar(arg)
     else:
@@ -224,7 +229,7 @@ def add(
         base_name: str = "",
         parameters: Union[dict, List[dict]] = {},
         function_opts: dict = {},
-        auxiliary: dict = {},
+        auxiliary: dict = {}
     ) -> Union["accera.Function", List["accera.Function"]]:
         """Adds a function to the package. If multiple parameters are provided,
         generates and adds them according to the parameter grid.
@@ -242,6 +247,16 @@ def add(
             auxiliary: A dictionary of auxiliary metadata to include in the HAT package.
         """
 
+        # TEMP arrays in the args list are a programming error because they are meant to be internally defined in a function
+        # Note: this does not prevent TEMP arrays from being passed as an argument to a function, but they cannot be the
+        #       api-defining arguments for the function
+        temp_array_pos = []
+        for idx, arg in enumerate(args):
+            if isinstance(arg, lang.Array) and arg.role == lang.Array.Role.TEMP:
+                temp_array_pos.append(idx)
+        if len(temp_array_pos) > 0:
+            raise ValueError(f"Error in package.add() for function {base_name}: args includes TEMP array at positions {temp_array_pos}")
+
         heuristic_parameters_dict = {}
         if isinstance(source, lang.Plan):
             heuristic_parameters_dict = self._create_mapping_of_heuristic_parameters_with_possible_values(source)
@@ -274,7 +289,7 @@ def _add_function(
         base_name: str = "",
         parameters: dict = {},
         function_opts: dict = {},
-        auxiliary: dict = {},
+        auxiliary: dict = {}
     ) -> "accera.Function":
         """Adds a function to the package.
 
@@ -385,7 +400,7 @@ def compute_arg_size_references(args, SENTINEL_VALUE=-1):
         if isinstance(source, lang.Plan):
             self._dynamic_dependencies.update(source._dynamic_dependencies)
             source = source._create_function(
-                args, public=True, no_inline=function_opts.get("no_inline", False)
+                args, **function_opts
             )
             # fall-through
 
@@ -395,9 +410,8 @@ def compute_arg_size_references(args, SENTINEL_VALUE=-1):
             # due to the fall-through, we only need to validate here
             validate_target(source.target)
 
-            native_array_dim_args = [arg._get_native_array() if isinstance(arg, lang.Array) else arg._native_dim for arg in args ]
+            native_array_dim_args = [arg._get_native_array() if isinstance(arg, lang.Array) else arg._native_dim if isinstance(arg, lang.Dimension) else arg for arg in args ]
 
-            assert source.public
             source.name = get_function_name(source.target)
             source.base_name = base_name
             source.auxiliary = auxiliary_metadata
@@ -422,15 +436,13 @@ def wrapper_fn(args):
             wrapped_func = lang.Function(
                 name=name,
                 base_name=base_name,
-                public=True,
-                decorated=function_opts.get("decorated", False),
-                no_inline=function_opts.get("no_inline", False),
                 args=tuple(map(_convert_arg, args)),
                 arg_size_references=compute_arg_size_references(args),
                 requested_args=args,
                 definition=wrapper_fn,
                 auxiliary=auxiliary_metadata,
                 target=Target.HOST,
+                **function_opts
             )
 
             self._fns[name] = wrapped_func
@@ -599,11 +611,9 @@ def build(
             if target.runtime in [Target.Runtime.CUDA, Target.Runtime.ROCM]:
                 format |= Package.Format.HAT_SOURCE
             else:
-                format |= (
-                    Package.Format.HAT_STATIC
-                    if cross_compile
-                    else Package.Format.HAT_DYNAMIC
-                )
+                format |= Package.Format.HAT_STATIC
+                if not cross_compile:
+                    format |= Package.Format.HAT_DYNAMIC
 
         dynamic_link = bool(format & Package.Format.DYNAMIC_LIBRARY)
         if cross_compile and dynamic_link:
@@ -805,14 +815,26 @@ def build(
 
             hat_file.Serialize(header_path)
 
-            if dynamic_link and (format & Package.Format.DYNAMIC_LIBRARY):
-                dyn_hat_path = f"{path_root}_dyn{extension}"
-                hat.create_dynamic_package(header_path, dyn_hat_path)
-                shutil.move(dyn_hat_path, header_path)
-            elif not cross_compile and (format & Package.Format.STATIC_LIBRARY):
+            if not cross_compile and (format & Package.Format.STATIC_LIBRARY):
                 lib_hat_path = f"{path_root}_lib{extension}"
                 hat.create_static_package(header_path, lib_hat_path)
+                
+                lib_hat_file = hat_file.Deserialize(lib_hat_path)
+                lib_hat_file.dependencies.auxiliary["static"] = lib_hat_file.dependencies.link_target
+                lib_hat_file.Serialize()
+                
                 shutil.move(lib_hat_path, header_path)
+
+            if dynamic_link:
+                dyn_hat_path = f"{path_root}_dyn{extension}"
+                hat.create_dynamic_package(header_path, dyn_hat_path)
+
+                dyn_hat_file = hat_file.Deserialize(dyn_hat_path)
+                dyn_hat_file.dependencies.auxiliary["dynamic"] = dyn_hat_file.dependencies.link_target
+                dyn_hat_file.Serialize()
+
+                shutil.move(dyn_hat_path, header_path)
+            
             # TODO: plumb cross-compilation of static libs
 
         return proj.module_file_sets
diff --git a/accera/python/accera/Targets.py b/accera/python/accera/Targets.py
index 746e6d96..63e4b72e 100644
--- a/accera/python/accera/Targets.py
+++ b/accera/python/accera/Targets.py
@@ -459,6 +459,7 @@ class Architecture(Enum):
     ["Intel E5-1650 v3",  "Haswell", "Xeon E5", 3.5, 3.8, 6, 12, [48, 256, 15 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
     ["Intel E5-1660 v3",  "Haswell", "Xeon E5", 3.0, 3.5, 8, 16, [48, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
     ["Intel E5-1680 v3",  "Haswell", "Xeon E5", 3.2, 3.8, 8, 16, [48, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E5-2620 v3",  "Haswell", "Xeon E5", 2.4, 3.2, 6, 12, [48, 256, 15 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # AMD Zen
     # ref: https://en.wikipedia.org/wiki/Zen_(first_generation)
diff --git a/accera/python/accera/__init__.py b/accera/python/accera/__init__.py
index 38482454..2d1f167c 100644
--- a/accera/python/accera/__init__.py
+++ b/accera/python/accera/__init__.py
@@ -15,10 +15,10 @@
 from .Package import Package
 
 from .lang import *
-from ._lang_python import CompilerOptions, ScalarType, _GetTargetDeviceFromName
+from ._lang_python import CompilerOptions, ScalarType, _GetTargetDeviceFromName, AllocateFlags
 from ._lang_python import (
     abs, max, min, ceil, floor, sqrt, exp, log, log10, log2, sin, cos, tan, sinh, cosh, tanh, logical_and, logical_or,
-    logical_not, cast
+    logical_not, cast, round, remainderf
 )
 
 # Global initialization
diff --git a/accera/python/accera/lang/Array.py b/accera/python/accera/lang/Array.py
index 3190e05c..b33bfc1c 100644
--- a/accera/python/accera/lang/Array.py
+++ b/accera/python/accera/lang/Array.py
@@ -8,7 +8,7 @@
 from enum import Enum, auto
 from functools import partial
 
-from .._lang_python import ScalarType, _MemoryLayout
+from .._lang_python import ScalarType, _MemoryLayout, AllocateFlags
 from .._lang_python._lang import Array as NativeArray
 from .Layout import Layout, MemoryMapLayout
 from ..Parameter import DelayedParameter
@@ -36,7 +36,8 @@ def __init__(
         element_type: Union["accera.ScalarType", type] = None,
         layout: Union["accera.Array.Layout", Tuple[int]] = Layout.FIRST_MAJOR,
         offset: int = 0,
-        shape: Tuple[Union[int, DelayedParameter, Dimension]] = None
+        shape: Tuple[Union[int, DelayedParameter, Dimension]] = None,
+        flags: "accera.AllocateFlags" = AllocateFlags.NONE
     ):
         """Creates an Array
 
@@ -74,6 +75,7 @@ def __init__(
         self._shape = shape
         self._native_array = None
         self._delayed_calls = {}
+        self._flags = flags
 
         if self._role == Array.Role.CONST:
             if self._data is None:
@@ -156,6 +158,10 @@ def role(self):
     def element_type(self):
         return self._element_type
 
+    @property
+    def flags(self):
+        return self._flags
+
     @property
     def _value(self):
         if self._native_array:
@@ -267,7 +273,7 @@ def _allocate(self):
             return    # already contains data
 
         # Note: we are blowing away the original Value and replacing with a new allocated Value
-        self._native_array = NativeArray(Allocate(type=self._element_type, layout=self._layout))
+        self._native_array = NativeArray(Allocate(type=self._element_type, layout=self._layout, flags=self._flags))
         assert (not self._value.is_empty)
 
 
diff --git a/accera/python/accera/lang/Dimension.py b/accera/python/accera/lang/Dimension.py
index d4232098..ffaa90b4 100644
--- a/accera/python/accera/lang/Dimension.py
+++ b/accera/python/accera/lang/Dimension.py
@@ -17,6 +17,7 @@ class Dimension:
     class Role(Enum):
         "Defines the Dimension role"
         INPUT = (auto())   #:  An input dimension (immutable and provided as an Accera function argument).
+        INPUT_OUTPUT = auto()   #:  An input/output dimension (mutable and updated by an Accera function). 
         OUTPUT = auto()    #: An output dimension (mutable and updated by an Accera function).
 
     def __init__(
@@ -30,6 +31,7 @@ def __init__(
         self._role = role
 
         if value:
+            self._value = value
             if self._role != Dimension.Role.OUTPUT:
                 raise ValueError("Only output dimension can accept the optional value to initialize itself")
             self._native_dim = value._native_dim if isinstance(value, Dimension) else Scalar(value)
@@ -40,6 +42,21 @@ def __init__(
     def role(self):
         return self._role
 
+    @property
+    def value(self):
+        return self._value
+
+    @value.setter
+    def value(self, val):
+        self._value = val
+        if self._role != Dimension.Role.OUTPUT:
+            raise ValueError("Only output dimension can accept the optional value to initialize itself")
+        self._native_dim = val._native_dim if isinstance(val, Dimension) else Scalar(val)
+
+    @role.setter
+    def role(self, val):
+        self._role = val
+
     def __eq__(self, other):
         return id(self) == id(other)
 
diff --git a/accera/python/accera/lang/Function.py b/accera/python/accera/lang/Function.py
index 51f7fb6a..9a86e300 100644
--- a/accera/python/accera/lang/Function.py
+++ b/accera/python/accera/lang/Function.py
@@ -44,21 +44,25 @@ def _(arg: Array):
             )
     return arg._get_native_array()    # unpack
 
-
-def role_to_usage(role):
+def role_to_usage(arg):
     from .._lang_python import _FunctionParameterUsage
 
-    if role == Array.Role.INPUT or role == Dimension.Role.INPUT:
-        return _FunctionParameterUsage.INPUT
+    if isinstance(arg, Array) or isinstance(arg, Dimension): 
+        role = arg.role
+        if role == Array.Role.INPUT or role == Dimension.Role.INPUT:
+            return _FunctionParameterUsage.INPUT
+        elif role == Dimension.Role.OUTPUT:
+            return _FunctionParameterUsage.OUTPUT
+        else:
+            return _FunctionParameterUsage.INPUT_OUTPUT                             
     else:
-        return _FunctionParameterUsage.INPUT_OUTPUT
-
+        return _FunctionParameterUsage.INPUT
 
 @dataclass
 class Function:
     name: str = ""    # base_name + _ + generated unique_id
     base_name: str = ""
-    public: bool = False
+    public: bool = True
     external: bool = False
     decorated: bool = True    # do we want to expose this?
     requested_args: tuple = ()    # args as provided into Package.add
@@ -66,7 +70,8 @@ class Function:
     arg_size_references: tuple = () # references from array args to dimension arg positions for dynamically sized arrays
     param_overrides: dict = field(default_factory=dict)    # overrides for constants
     definition: Callable = None
-    no_inline: bool = False
+    no_inline: bool = False # no_inline == True means that this function cannot be inlined into other functions
+    no_inline_into: bool = False # no_inline_into == True means that this function cannot have other functions inlined into it
     auxiliary: dict = field(default_factory=dict)
     target: Target = Target.HOST
     output_verifiers: list = field(default_factory=list)
@@ -87,13 +92,14 @@ def _emit(self):
             delayed_param.set_value(value)
 
         if self.args:
-            usages = [role_to_usage(arg.role) for arg in self.requested_args]
+            usages = [role_to_usage(arg) for arg in self.requested_args]
             self._native_fn.parameters(self.args, usages, self.arg_size_references)
 
             if self.output_verifiers:
                 self._native_fn.outputVerifiers(self.output_verifiers)
 
         self._native_fn.inlinable(not self.no_inline)
+        self._native_fn.inlinable_into(not self.no_inline_into)
 
         sig = signature(self.definition)
 
diff --git a/accera/python/accera/lang/Nest.py b/accera/python/accera/lang/Nest.py
index 1c07dfa6..49988a45 100644
--- a/accera/python/accera/lang/Nest.py
+++ b/accera/python/accera/lang/Nest.py
@@ -152,7 +152,7 @@ def _get_captures_to_replace(self, logic_fn, context: NativeLoopNestContext):
 
                     if v.role == Array.Role.TEMP:
                         temp_array = NativeArray(
-                            Allocate(type=v.element_type, layout=v.layout)
+                            Allocate(type=v.element_type, layout=v.layout, flags=v.flags)
                         )
                         captures_to_replace[k] = context.mapping[value_id] = temp_array
                     elif v.role == Array.Role.CONST:
@@ -208,6 +208,8 @@ def _build_native_context(self, context: NativeLoopNestContext):
             elif isinstance(x, Dimension):
                 x._native_dim = Scalar(y)
                 logic_args[id(x)] = x._native_dim
+            elif isinstance(x, Scalar):
+                logic_args[id(x)] = Scalar(y)   
             else:
                 logic_args[id(x)] = y              
         
diff --git a/accera/python/accera/lang/Plan.py b/accera/python/accera/lang/Plan.py
index 40db40e2..afb81716 100644
--- a/accera/python/accera/lang/Plan.py
+++ b/accera/python/accera/lang/Plan.py
@@ -958,6 +958,18 @@ def _is_valid_block_size(self, block_dims) -> bool:
         block_size = block_dims[0] * block_dims[1] * block_dims[2]
         return block_size <= max_threads
 
+    def _erase_loops(self, indices: List[LoopIndex]):
+        for index in indices:
+            self._add_index_attr(index, "_erase")
+
+        self._commands.append(
+            partial(self._erase_loops_delayed, indices)
+        )
+
+    def _erase_loops_delayed(self, indices: List[LoopIndex], context: NativeLoopNestContext):
+        for index in indices:
+            context.plan._erase_loop(context.mapping[id(index)])
+
     def _build_native_context(self, context: NativeLoopNestContext):
         target = self._target
 
@@ -1067,7 +1079,7 @@ def nest_wrapper_fn(*args: List[List[_Valor]]):
 
 
 def _create_function(
-    plan: "Plan", args: List[Union[Array, Dimension]], public: bool = True, no_inline: bool = False
+    plan: "Plan", args: List[Union[Array, Dimension]], public: bool = True, **kwargs
 ) -> Function:
     from secrets import token_hex
 
@@ -1078,8 +1090,8 @@ def _create_function(
         args=args,
         public=public,
         definition=_build_native_nest(plan, args),
-        no_inline=no_inline,
         target=plan._target,
+        **kwargs
     )
 
 
diff --git a/accera/python/accera/lang/__init__.py b/accera/python/accera/lang/__init__.py
index e9fb73c3..3924507e 100644
--- a/accera/python/accera/lang/__init__.py
+++ b/accera/python/accera/lang/__init__.py
@@ -15,4 +15,4 @@
 from .Function import Function
 from .LogicFunction import logic_function, LogicFunction
 from .LoopIndex import LoopIndex
-from .Dimension import Dimension
+from .Dimension import Dimension, create_dimensions
\ No newline at end of file
diff --git a/accera/python/accera/test/dsl_tests.py b/accera/python/accera/test/dsl_tests.py
index aedc70d3..944713af 100644
--- a/accera/python/accera/test/dsl_tests.py
+++ b/accera/python/accera/test/dsl_tests.py
@@ -26,10 +26,12 @@
     DEV_MODE = True
     sys.path.insert(1, os.getcwd())
 
-from accera import ScalarType, Array, Function, Nest, Target, Package, algorithms
+from accera import ScalarType, Array, Function, Nest, Target, Package, algorithms, Dimension, cast, AllocateFlags
 from accera.test import verifiers
 from accera.test.test_utils import expectedFailure, FailedReason
 
+INTERNAL_FUNCTION_OPTS = { "no_inline_into": True, "public": False }
+
 TEST_MODE = Package.Mode.DEBUG if DEV_MODE else Package.Mode.RELEASE
 TEST_FORMAT = Package.Format.MLIR_DYNAMIC if DEV_MODE else Package.Format.HAT_DYNAMIC
 TEST_PACKAGE_DIR = "test_acccgen"
@@ -451,6 +453,149 @@ def _():
             correctness_check_values=correctness_check_values,
         )
 
+    def test_array_vectorize_cast(self) -> None:
+        A = Array(
+            shape=(256, 32),
+            role=Array.Role.INPUT,
+            layout=Array.Layout.FIRST_MAJOR,
+            element_type=ScalarType.uint8,
+        )
+        B = Array(
+            shape=(256, 32),
+            role=Array.Role.INPUT_OUTPUT,
+            layout=Array.Layout.FIRST_MAJOR,
+            element_type=ScalarType.int16,
+        )
+
+        nest = Nest(shape=(256, 32))
+        i, j = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i, j] = A[i, j]
+
+        sched = nest.create_schedule()
+        ii = sched.split(i, 4)
+        jj = sched.split(j, 16)
+        sched.reorder(i, j, ii, jj)
+        plan = sched.create_plan()
+        plan.vectorize(ii) # ii to in-place-unroll ii and vectorize jj
+
+        A_test = np.random.random((256, 32)).astype(np.uint8)
+        B_test = np.random.random((256, 32)).astype(np.int16)
+        B_expected = np.ndarray((256, 32)).astype(np.int16)
+        B_expected[:,:] = A_test[:,:]
+
+        correctness_check_values = {
+            "pre": (A_test, B_test),
+            "post": (A_test, B_expected),
+        }
+        self._verify_nest(
+            plan,
+            (A, B),
+            "test_array_vectorize_cast",
+            correctness_check_values=correctness_check_values
+        )
+
+    def test_interleaved_vectorize_cast(self) -> None:
+        shape = (64, 32, 8, 2)
+        A = Array(
+            shape=shape,
+            role=Array.Role.INPUT,
+            layout=Array.Layout.FIRST_MAJOR,
+            element_type=ScalarType.uint8,
+        )
+        B = Array(
+            shape=shape,
+            role=Array.Role.INPUT_OUTPUT,
+            layout=Array.Layout.FIRST_MAJOR,
+            element_type=ScalarType.int16,
+        )
+
+        nest = Nest(shape=shape)
+        i, j, k, l = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i, j, k, l] = A[i, j, k, l]
+
+        sched = nest.create_schedule()
+        plan = sched.create_plan()
+        plan.vectorize(k)
+
+        A_test = np.random.random(shape).astype(np.uint8)
+        B_test = np.random.random(shape).astype(np.int16)
+        B_expected = np.ndarray(shape).astype(np.int16)
+        B_expected[:,:,:,:] = A_test[:,:,:,:]
+
+        correctness_check_values = {
+            "pre": (A_test, B_test),
+            "post": (A_test, B_expected),
+        }
+        self._verify_nest(
+            plan,
+            (A, B),
+            "test_interleaved_vectorize_cast",
+            correctness_check_values=correctness_check_values
+        )
+
+
+    def test_interleaved_vectorize_store(self) -> None:
+        M = 32
+        N = 48
+        M_tile = 2
+        N_tile = 16
+        input_shape = (M, N)
+        output_shape = (M // M_tile, N // N_tile, N_tile, M_tile)
+        A = Array(
+            shape=input_shape,
+            role=Array.Role.INPUT,
+            layout=Array.Layout.FIRST_MAJOR,
+            element_type=ScalarType.uint8,
+        )
+        B = Array(
+            shape=output_shape,
+            role=Array.Role.INPUT_OUTPUT,
+            layout=Array.Layout.FIRST_MAJOR,
+            element_type=ScalarType.uint8,
+        )
+
+        nest = Nest(shape=output_shape)
+        i_outer, j_outer, j_inner, i_inner = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i_outer, j_outer, j_inner, i_inner] = A[i_outer*M_tile + i_inner, j_outer*N_tile + j_inner]
+
+        sched = nest.create_schedule()
+        plan = sched.create_plan()
+        plan.vectorize(j_inner)
+
+        A_test = np.random.random(input_shape).astype(np.uint8)
+        B_test = np.random.random(output_shape).astype(np.uint8)
+        B_expected = np.ndarray(output_shape).astype(np.uint8)
+        for i_outer in range(0, M, M_tile):
+            i_outer_idx = i_outer // M_tile
+            for j_outer in range(0, N, N_tile):
+                j_outer_idx = j_outer // N_tile
+                for j_inner in range(0, N_tile):
+                    full_j = j_outer + j_inner
+                    for i_inner in range(0, M_tile):
+                        full_i = i_outer + i_inner
+                        B_expected[i_outer_idx, j_outer_idx, j_inner, i_inner] = A_test[full_i, full_j]
+
+        correctness_check_values = {
+            "pre": (A_test, B_test),
+            "post": (A_test, B_expected),
+        }
+        self._verify_nest(
+            plan,
+            (A, B),
+            "test_interleaved_vectorize_store",
+            correctness_check_values=correctness_check_values
+        )
+
+
     def test_subarray(self) -> None:
         package = Package()
 
@@ -1087,7 +1232,102 @@ def _():
 
         self._verify_helper(package, test_name, function.name, correctness_check_values)
 
+       
+    def test_output_array_range_node1(self) -> None:
+        from accera import Dimension, create_dimensions, floor, cast
+        from accera._lang_python._lang import Scalar
+
+        Start = Scalar(ScalarType.float32)
+        Limit = Scalar(ScalarType.float32)
+        Delta = Scalar(ScalarType.float32)
+
+        InputDim = create_dimensions()
+        InputDim.role = Dimension.Role.INPUT
 
+        OutputDims = Array(shape=(1,), element_type=ScalarType.int64, role=Array.Role.INPUT_OUTPUT)
+        Output = Array(shape=(InputDim, ), role=Array.Role.INPUT_OUTPUT)
+        Output_Start = Array(shape=(1,), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)
+
+        nest1 = Nest((1, ))
+        @nest1.iteration_logic
+        def _():      
+            OutputDims[0] = cast(floor((Limit - Start) / Delta), ScalarType.int64)
+
+        nest2 = Nest([InputDim])
+        i = nest2.get_indices()
+        @nest2.iteration_logic
+        def _():
+            Output[i] = Output_Start[0]
+            Output_Start[0] += Delta
+
+        # Generate a function like:
+        # range_get_size(float start, float limit, float delta, int64_t* output_dim);
+        # range_get_result(int64_t input_dim, float* output, float* start, float delta);
+        
+        package = Package()
+        # BUGBUG: dim args ordered first due to issue with Debug mode
+        package.add(nest1, args=(Start, Limit, Delta, OutputDims), base_name=f"range_get_size")
+        package.add(nest2, args=(InputDim, Output, Output_Start, Delta), base_name=f"range_get_result")
+
+        package.build("test_output_array_range_node1", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
+  
+        
+    def test_output_array_range_node2(self) -> None:
+        from accera import Dimension, create_dimensions, floor, cast
+        from accera._lang_python._lang import Scalar
+
+        Start = Scalar(ScalarType.float32)
+        Limit = Scalar(ScalarType.float32)
+        Delta = Scalar(ScalarType.float32)
+
+        InputDim = create_dimensions()
+        InputDim.role = Dimension.Role.INPUT
+
+        OutputDims = Array(shape=(1,), element_type=ScalarType.int64, role=Array.Role.INPUT_OUTPUT)
+        Output = Array(shape=(InputDim, ), role=Array.Role.INPUT_OUTPUT)
+        Output_Start = Array(shape=(1,), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)
+        Output_Start_Tmp = Array(shape=(1,), element_type=ScalarType.float32, role=Array.Role.TEMP)
+
+        nest1 = Nest((1, ))
+        @nest1.iteration_logic
+        def _():      
+            OutputDims[0] = cast(floor((Limit - Start) / Delta), ScalarType.int64)
+
+        nest2 = Nest((1, ))
+        @nest2.iteration_logic
+        def _():      
+            Output_Start[0] = Start
+
+        nest3 = Nest([InputDim])
+        i = nest3.get_indices()
+        @nest3.iteration_logic
+        def _():
+            Output[i] = Output_Start[0]
+            Output_Start[0] += Delta
+
+        # Generate a function like:
+        # range_get_size(float start, float limit, float delta, int64_t* output_dim);
+        # ini_start(float* output_Start, float start);
+        # get_result(int64_t input_dim, float* output, float* start, float delta);
+        # range_get_output_array(int64_t input_dim, float* output, float start, float delta);
+        
+        package = Package()
+        # BUGBUG: dim args ordered first due to issue with Debug mode
+        package.add(nest1, args=(Start, Limit, Delta, OutputDims), base_name=f"range_get_size")
+        ini_start_fn = package.add(nest2, args=(Output_Start, Start), base_name=f"ini_start")
+        get_result_fn = package.add(nest3, args=(InputDim, Output, Output_Start, Delta), base_name=f"get_result")
+
+        nest4 = Nest((1, ))
+        @nest4.iteration_logic
+        def _():      
+            ini_start_fn(Output_Start_Tmp, Start)
+            get_result_fn(InputDim, Output, Output_Start_Tmp, Delta)
+        
+        # BUGBUG: dim args ordered first due to issue with Debug mode
+        package.add(nest4, args=(InputDim, Output, Start, Delta), base_name=f"range_get_output_array")
+
+        package.build("test_output_array_range_node2", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
+  
 
 class DSLTest_02SimpleAffineLoopNests(unittest.TestCase):
     def _create_nest(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple:
@@ -1100,19 +1340,21 @@ def _create_nest(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple:
 
         return Nest(shape=(M, N, S)), A, B, C
 
-    def _build_nest(self, nest, args: Tuple[Array], package_name, correctness_check_values=None) -> None:
+    def _build_nest(self, nest, args: Tuple[Array], package_name, correctness_check_values=None, quiet=True) -> None:
         # helper function to build a nest so that we can focus on the logic function
         # create a HAT package and add the nest to it
         package = Package()
         function = package.add(nest, args, base_name=package_name)
 
         # build the HAT package
-        with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR) as v:
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
             package.build(
                 package_name,
                 format=TEST_FORMAT,
                 mode=TEST_MODE,
-                output_dir=TEST_PACKAGE_DIR,
+                output_dir=output_dir,
+                _quiet=quiet
             )
             if correctness_check_values:
                 v.check_correctness(
@@ -1317,6 +1559,324 @@ def _():
 
             self._build_nest(nest, [A, B, C], f"test_intrinsics_{t.name}")
 
+
+    def test_round_intrinsic(self) -> None:
+        from accera import round as accround
+
+        M = 16
+        N = 8
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, N))
+        B = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N))
+
+        nest = Nest((M, N))
+        i, j = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i, j] = accround(A[i, j])
+
+        A_test = np.random.uniform(low=-1000.0, high=1000.0, size=A.shape).astype(np.float32)
+        # Ensure there's at least one element which tests the roundeven behavior in both directions
+        A_test[0, 0] = 1.5 # Should round up to 2
+        A_test[0, 1] = 2.5 # Should round down to 2
+        B_test = np.zeros(B.shape).astype(np.int32)
+
+        B_ref = A_test.round().astype(np.int32)
+        self.assertEqual(B_ref[0, 0], 2)
+        self.assertEqual(B_ref[0, 1], 2)
+
+        correctness_check_values = {
+            "pre": [A_test, B_test],
+            "post": [A_test, B_ref]
+        }
+
+        self._build_nest(nest, [A, B], "test_round_intrinsic", correctness_check_values=correctness_check_values)
+
+
+    def test_round_intrinsic_vectorized(self) -> None:
+        from accera import round as accround
+
+        M = 256
+        N = 128
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, N))
+        B = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N))
+
+        nest = Nest((M, N))
+        i, j = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i, j] = accround(A[i, j])
+        
+        sched = nest.create_schedule()
+        ii, jj = sched.tile({i: 4, j: 8})
+        sched.reorder(i, j, ii, jj)
+        plan = sched.create_plan()
+        plan.vectorize(ii)
+
+        A_test = np.random.uniform(low=-1000.0, high=1000.0, size=A.shape).astype(np.float32)
+        # Ensure there's at least one element which tests the roundeven behavior in both directions
+        A_test[0, 0] = 1.5 # Should round up to 2
+        A_test[0, 1] = 2.5 # Should round down to 2
+        B_test = np.zeros(B.shape).astype(np.int32)
+
+        B_ref = A_test.round().astype(np.int32)
+        self.assertEqual(B_ref[0, 0], 2)
+        self.assertEqual(B_ref[0, 1], 2)
+
+        correctness_check_values = {
+            "pre": [A_test, B_test],
+            "post": [A_test, B_ref]
+        }
+
+        self._build_nest(plan, [A, B], "test_round_intrinsic_vectorized", correctness_check_values=correctness_check_values)
+
+
+    # TODO : fix this test - it appears to abort on just the linux buddy build machine
+    # def test_remainderf_intrinsic_rounding(self) -> None:
+    #     from accera import remainderf, cast
+
+    #     M = 16
+    #     N = 8
+
+    #     A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, N))
+    #     B = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N))
+
+    #     nest = Nest((M, N))
+    #     i, j = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         B[i, j] = cast(A[i, j] - remainderf(A[i, j], 1.0), ScalarType.int32)
+
+    #     A_test = np.random.uniform(low=-1000.0, high=1000.0, size=A.shape).astype(np.float32)
+    #     # Ensure there's at least one element which tests the roundeven behavior in both directions
+    #     A_test[0, 0] = 1.5 # Should round up to 2
+    #     A_test[0, 1] = 2.5 # Should round down to 2
+    #     B_test = np.zeros(B.shape).astype(np.int32)
+
+    #     B_ref = A_test.round().astype(np.int32)
+    #     self.assertEqual(B_ref[0, 0], 2)
+    #     self.assertEqual(B_ref[0, 1], 2)
+
+    #     correctness_check_values = {
+    #         "pre": [A_test, B_test],
+    #         "post": [A_test, B_ref]
+    #     }
+
+    #     self._build_nest(nest, [A, B], "test_remainderf_intrinsic_rounding", correctness_check_values=correctness_check_values)
+
+
+    def test_vectorized_max_min(self) -> None:
+        from accera import max, min
+
+        M = 128
+        N = 256
+
+        package = Package()
+        func_names = []
+        package_name = "test_vectorized_max_min"
+        correctness_check_values = {}
+        for t in [ScalarType.float32]:
+            fn_name = f"test_vectorized_max_min_{t.name}"
+            func_names.append(fn_name)
+
+            nest = Nest((M, N))
+            A = Array(role=Array.Role.INPUT, element_type=t, shape=(M, N))
+            B = Array(role=Array.Role.INPUT, element_type=t, shape=(M, N))
+            C_max = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(M, N))
+            C_min = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(M, N))
+
+            i, j = nest.get_indices()
+
+            @nest.iteration_logic
+            def _():
+                C_max[i, j] = max(A[i, j], B[i, j])
+                C_min[i, j] = min(A[i, j], B[i, j])
+
+            sched = nest.create_schedule()
+            ii, jj = sched.tile({i: 4, j: 8})
+            sched.reorder(i, j, ii, jj)
+            plan = sched.create_plan()
+            plan.vectorize(ii)
+            function = package.add(plan, args=(A, B, C_max, C_min), base_name=fn_name)
+
+            A_test = np.random.random(A.shape).astype(np.float32)
+            B_test = np.random.random(B.shape).astype(np.float32)
+            C_max_test = np.random.random(C_max.shape).astype(np.float32)
+            C_min_test = np.random.random(C_min.shape).astype(np.float32)
+
+            C_max_ref = np.maximum(A_test, B_test)
+            C_min_ref = np.minimum(A_test, B_test)
+
+            correctness_check_values[fn_name] = {
+                "pre": [A_test, B_test, C_max_test, C_min_test],
+                "post": [A_test, B_test, C_max_ref, C_min_ref]
+            }
+
+        # build the HAT package
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(
+                package_name,
+                format=TEST_FORMAT | Package.Format.MLIR_VERBOSE,
+                mode=Package.Mode.RELEASE,
+                output_dir=output_dir
+            )
+            for fn_name in func_names:
+                if fn_name in correctness_check_values:
+                    v.check_correctness(
+                        function.name,
+                        before=correctness_check_values[fn_name]["pre"],
+                        after=correctness_check_values[fn_name]["post"],
+                    )
+
+
+    def test_vectorized_single_max_min_block(self) -> None:
+        # In this test we're trying to find the single max and single min value of a 2-D array.
+        # To vectorize this, we'll want to compute several maxs and mins in paralle and then reduce them
+        # Note: This type of reduction can't be achieved with caching, so we manually construct a pattern similar to caching
+        from accera import max, min
+
+        M = 128
+        N = 256
+
+        M_outer_tile = 8
+        M_tile = 4
+        N_tile = 8
+
+        package = Package()
+        func_names = []
+        package_name = "test_vectorized_single_max_min_block"
+        correctness_check_values = {}
+        for t in [ScalarType.float32]:
+            fn_name = f"{package_name}_{t.name}"
+            func_names.append(fn_name)
+
+            A = Array(role=Array.Role.INPUT, element_type=t, shape=(M, N))
+            A_max = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(1, ))
+            A_min = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=(1, ))
+
+            A_max_cache = Array(role=Array.Role.TEMP, element_type=t, shape=(M_tile, N_tile), flags=AllocateFlags.STACK)
+            A_min_cache = Array(role=Array.Role.TEMP, element_type=t, shape=(M_tile, N_tile), flags=AllocateFlags.STACK)
+
+            io_A_max_cache = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=A_max_cache.shape)
+            io_A_min_cache = Array(role=Array.Role.INPUT_OUTPUT, element_type=t, shape=A_min_cache.shape)
+
+            outer_i_dim = Dimension()
+            outer_j_dim = Dimension()
+
+            # inner compute nest
+
+            inner_nest = Nest((M_tile, N_tile))
+            inner_i, inner_j = inner_nest.get_indices()
+            @inner_nest.iteration_logic
+            def _():
+                i = outer_i_dim + inner_i
+                j = outer_j_dim + inner_j
+                io_A_max_cache[inner_i, inner_j] = max(io_A_max_cache[inner_i, inner_j], A[i, j])
+                io_A_min_cache[inner_i, inner_j] = min(io_A_min_cache[inner_i, inner_j], A[i, j])
+
+            inner_sched = inner_nest.create_schedule()
+            inner_plan = inner_sched.create_plan()
+            inner_plan.vectorize(inner_i)
+            inner_fn = package.add(inner_plan, args=(A, io_A_max_cache, io_A_min_cache, outer_i_dim, outer_j_dim), base_name=f"{fn_name}_inner", function_opts=INTERNAL_FUNCTION_OPTS)
+
+            # Outer nest
+            outer_nest = Nest((M, N))
+            outer_i, outer_j = outer_nest.get_indices()
+            @outer_nest.iteration_logic
+            def _():
+                inner_fn(A, io_A_max_cache, io_A_min_cache, outer_i, outer_j)
+
+            outer_sched = outer_nest.create_schedule()
+            outer_ii = outer_sched.split(outer_i, M_outer_tile)
+            outer_iii, outer_jj = outer_sched.tile({outer_ii: M_tile, outer_j: N_tile})
+            outer_sched.reorder(outer_i, outer_j, outer_ii, outer_iii, outer_jj)
+            outer_plan = outer_sched.create_plan()
+            outer_plan._erase_loops([outer_iii, outer_jj])
+            outer_fn = package.add(outer_plan, args=(A, io_A_max_cache, io_A_min_cache), base_name=f"{fn_name}_outer", function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+            # Cache zeroing nests
+            
+            def _make_init_fn(package: Package, outer_arr: Array, arr: Array, base_name: str):
+                zero_nest = Nest(arr.shape)
+                indices = zero_nest.get_indices()
+                @zero_nest.iteration_logic
+                def _():
+                    arr[indices] = outer_arr[indices]
+
+                return package.add(zero_nest, args=(outer_arr, arr), base_name=base_name, function_opts=INTERNAL_FUNCTION_OPTS)
+
+            zero_max_cache_fn = _make_init_fn(package, A, io_A_max_cache, "max_cache_zeroing")
+            zero_min_cache_fn = _make_init_fn(package, A, io_A_min_cache, "min_cache_zeroing")
+
+            # Cache reducing nests
+
+            def _make_cache_reduce_fn(package: Package, cache: Array, outer_arr: Array, base_name: str, use_max):
+                reduce_nest = Nest(cache.shape)
+                indices = reduce_nest.get_indices()
+                if use_max:
+                    @reduce_nest.iteration_logic
+                    def _():
+                        outer_arr[0] = max(outer_arr[0], cache[indices])
+                else:
+                    @reduce_nest.iteration_logic
+                    def _():
+                        outer_arr[0] = min(outer_arr[0], cache[indices])
+
+                return package.add(reduce_nest, args=(cache, outer_arr), base_name=base_name, function_opts=INTERNAL_FUNCTION_OPTS)
+
+            reduce_max_cache_fn = _make_cache_reduce_fn(package, io_A_max_cache, A_max, "max_cache_reduce", True)
+            reduce_min_cache_fn = _make_cache_reduce_fn(package, io_A_min_cache, A_min, "min_cache_reduce", False)
+
+            # outer nest
+
+            top_nest = Nest((1,))
+
+            @top_nest.iteration_logic
+            def _():
+                zero_max_cache_fn(A, A_max_cache)
+                zero_min_cache_fn(A, A_min_cache)
+                outer_fn(A, A_max_cache, A_min_cache)
+                reduce_max_cache_fn(A_max_cache, A_max)
+                reduce_min_cache_fn(A_min_cache, A_min)
+
+            function = package.add(top_nest, args=(A, A_max, A_min), base_name=fn_name)
+
+            A_test = np.random.random(A.shape).astype(np.float32)
+            A_max_test = np.random.random(A_max.shape).astype(np.float32)
+            A_min_test = np.random.random(A_min.shape).astype(np.float32)
+
+            A_max_ref = np.max(A_test).reshape((1,))
+            A_min_ref = np.min(A_test).reshape((1,))
+
+            correctness_check_values[fn_name] = {
+                "pre": [A_test, A_max_test, A_min_test],
+                "post": [A_test, A_max_ref, A_min_ref]
+            }
+
+        # build the HAT package
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(
+                package_name,
+                format=TEST_FORMAT | Package.Format.MLIR_VERBOSE,
+                mode=Package.Mode.RELEASE,
+                output_dir=output_dir
+            )
+            for fn_name in func_names:
+                if fn_name in correctness_check_values:
+                    v.check_correctness(
+                        function.name,
+                        before=correctness_check_values[fn_name]["pre"],
+                        after=correctness_check_values[fn_name]["post"],
+                    )
+
+
     def test_intrinsics_float(self) -> None:
         from accera import (
             abs,
@@ -1461,11 +2021,11 @@ def _():
 
         schedule = nest.create_schedule()
         ii = schedule.split(i, 4)
-        iii = schedule.split(i, 2)
-        iiii = schedule.split(ii, 2)
+        iii = schedule.split(ii, 2)
+        iiii = schedule.split(iii, 2)
         for index in [ii, iii, iiii]:
             self.assertIsNotNone(index)
-        self.assertEqual(schedule._indices, [i, iii, ii, iiii, j, k])
+        self.assertEqual(schedule._indices, [i, ii, iii, iiii, j, k])
         self._verify_schedule(schedule, [A, B, C], "test_schedule_split1")
 
         # split size does not divide the dimension size
@@ -1966,17 +2526,14 @@ def _():
 
 
 class DSLTest_04Fusing(unittest.TestCase):
-    def _verify_schedule(
-        self, schedule, args: Tuple[Array], package_name, correctness_check_values, quiet=True
+    def _verify_func(
+        self, package, function, package_name, correctness_check_values, quiet=True, mode=TEST_MODE
     ) -> None:
-        # create a HAT package and add the function to it
-        package = Package()
-        function = package.add(schedule, args, base_name="fusing_test")
         output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
 
         # build the HAT package
         with verifiers.VerifyPackage(self, package_name, output_dir) as v:
-            package.build(package_name, format=TEST_FORMAT, mode=TEST_MODE, output_dir=output_dir, _quiet=quiet)
+            package.build(package_name, format=TEST_FORMAT, mode=mode, output_dir=output_dir, _quiet=quiet)
             if correctness_check_values:
                 v.check_correctness(
                     function.name,
@@ -1984,6 +2541,15 @@ def _verify_schedule(
                     after=correctness_check_values["post"],
                 )
 
+    def _verify_schedule(
+        self, schedule, args: Tuple[Array], package_name, correctness_check_values, quiet=True
+    ) -> None:
+        # create a HAT package and add the function to it
+        package = Package()
+        function = package.add(schedule, args, base_name="fusing_test")
+        self._verify_func(package, function, package_name, correctness_check_values, quiet)
+
+
     def test_full_iteration_space_fusing(self) -> None:
         from accera import fuse, Nest
 
@@ -2763,7 +3329,7 @@ def _():
 
         @nest1.iteration_logic
         def _():
-            C[i1, j1] = C[i1, j1] * 0.2
+            C[i1, j1] = C[i1, j1] * 0.1
 
         schedule1 = nest1.create_schedule()
         ii1, jj1 = schedule1.tile({ i1: M_tile, j1: N_tile })
@@ -2816,6 +3382,298 @@ def _():
         self._verify_schedule(plan, (A, B, C), "test_hierarchical_partial_fuse", None)
 
 
+    def test_nested_nests_matmul(self):
+        test_name = "test_nested_nests_matmul"
+
+        M = 20
+        N = 32
+        K = 12
+        M_tile = 4
+        N_tile = 16
+        K_tile = 3
+
+        package = Package()
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        B_temp = Array(role=Array.Role.TEMP, element_type=ScalarType.float32, shape=(K_tile, N_tile))
+        io_B_temp = Array(role=Array.Role.INPUT_OUTPUT, element_type=B_temp.element_type, shape=B_temp.shape)
+
+        i_tile_idx = Dimension()
+        j_tile_idx = Dimension()
+        k_tile_idx = Dimension()
+
+        pack_b_nest = Nest([K_tile, N_tile])
+        pb_k, pb_j = pack_b_nest.get_indices()
+
+        @pack_b_nest.iteration_logic
+        def _pack_b():
+            full_k = pb_k + k_tile_idx
+            full_j = pb_j + j_tile_idx
+            io_B_temp[pb_k, pb_j] = B[full_k, full_j]
+
+        pack_b_fn = package.add(pack_b_nest, args=(B, io_B_temp, j_tile_idx, k_tile_idx), base_name="pack_b_tile_fn")
+
+        matmul_nest = Nest([M_tile, N_tile, K_tile])
+        mm_i, mm_j, mm_k = matmul_nest.get_indices()
+
+        @matmul_nest.iteration_logic
+        def _matmul():
+            full_i = mm_i + i_tile_idx
+            full_j = mm_j + j_tile_idx
+            full_k = mm_k + k_tile_idx
+            C[full_i, full_j] += A[full_i, full_k] * io_B_temp[mm_k, mm_j]
+
+        matmul_sched = matmul_nest.create_schedule()
+        mm_jj = matmul_sched.split(mm_j, 8)
+        matmul_sched.reorder(mm_k, mm_i, mm_j, mm_jj)
+        matmul_plan = matmul_sched.create_plan()
+        matmul_plan.vectorize(mm_jj)
+        matmul_fn = package.add(matmul_plan, args=(A, B, C, io_B_temp, i_tile_idx, j_tile_idx, k_tile_idx), base_name="matmul_tile_fn")
+
+        tile_nest = Nest([M, N, K])
+        i, j, k = tile_nest.get_indices()
+
+        @tile_nest.iteration_logic
+        def _tile_logic():
+            pack_b_fn(B, B_temp, j, k)
+            matmul_fn(A, B, C, B_temp, i, j, k)
+
+        tile_sched = tile_nest.create_schedule()
+        ii, jj, kk = tile_sched.tile(dict(zip([i, j, k], [M_tile, N_tile, K_tile])))
+        tile_sched.reorder(i, j, k, ii, jj, kk)
+        tile_plan = tile_sched.create_plan()
+        tile_plan._erase_loops([ii, jj, kk])
+        full_fn = package.add(tile_plan, args=(A, B, C), base_name="full_matmul_fn")
+
+        A_test = np.random.random(A.shape).astype(np.float32)
+        B_test = np.random.random(B.shape).astype(np.float32)
+        C_test = np.random.random(C.shape).astype(np.float32)
+
+        A_ref = A_test
+        B_ref = B_test
+        C_ref = A_test @ B_test + C_test
+
+        correctness_check_values = {
+            "pre": [A_test, B_test, C_test],
+            "post": [A_ref, B_ref, C_ref],
+        }
+        self._verify_func(package, full_fn, test_name, correctness_check_values, quiet=False, mode=Package.Mode.RELEASE)
+
+
+    def test_nested_nests_matmul_boundary(self):
+        test_name = "test_nested_nests_matmul_boundary"
+        from accera import min, Dimension
+
+        M = 20
+        N = 32
+        K = 12
+        M_tile = 4
+        N_tile = 12 # 32 doesn't divide 12 so we should have an 8 element boundary in the N dimension
+        K_tile = 3
+
+        package = Package()
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        B_temp = Array(role=Array.Role.TEMP, element_type=ScalarType.float32, shape=(K_tile, N_tile))
+        io_B_temp = Array(role=Array.Role.INPUT_OUTPUT, element_type=B_temp.element_type, shape=B_temp.shape)
+
+        i_tile_idx = Dimension()
+        j_tile_idx = Dimension()
+        k_tile_idx = Dimension()
+
+        n_tile_dim = Dimension()
+
+        pack_b_nest = Nest([K_tile, n_tile_dim])
+        pb_k, pb_j = pack_b_nest.get_indices()
+
+        @pack_b_nest.iteration_logic
+        def _pack_b():
+            full_k = pb_k + k_tile_idx
+            full_j = pb_j + j_tile_idx
+            io_B_temp[pb_k, pb_j] = B[full_k, full_j]
+
+        pack_b_fn = package.add(pack_b_nest, args=(n_tile_dim, B, io_B_temp, j_tile_idx, k_tile_idx), base_name="pack_b_tile_fn")
+
+        matmul_nest = Nest([M_tile, n_tile_dim, K_tile])
+        mm_i, mm_j, mm_k = matmul_nest.get_indices()
+
+        @matmul_nest.iteration_logic
+        def _matmul():
+            full_i = mm_i + i_tile_idx
+            full_j = mm_j + j_tile_idx
+            full_k = mm_k + k_tile_idx
+            C[full_i, full_j] += A[full_i, full_k] * io_B_temp[mm_k, mm_j]
+
+        matmul_sched = matmul_nest.create_schedule()
+        mm_jj = matmul_sched.split(mm_j, 8)
+        matmul_sched.reorder(mm_k, mm_i, mm_j, mm_jj)
+        matmul_plan = matmul_sched.create_plan()
+        matmul_fn = package.add(matmul_plan, args=(n_tile_dim, A, B, C, io_B_temp, i_tile_idx, j_tile_idx, k_tile_idx), base_name="matmul_tile_fn")
+
+        tile_nest = Nest([M, N, K])
+        i, j, k = tile_nest.get_indices()
+
+        @tile_nest.iteration_logic
+        def _tile_logic():
+            n_tile_extent = min(cast(N_tile, ScalarType.index), cast(N, ScalarType.index) - j)
+            pack_b_fn(n_tile_extent, B, B_temp, j, k)
+            matmul_fn(n_tile_extent, A, B, C, B_temp, i, j, k)
+
+        tile_sched = tile_nest.create_schedule()
+        ii, jj, kk = tile_sched.tile(dict(zip([i, j, k], [M_tile, N_tile, K_tile])))
+        tile_sched.reorder(i, j, k, ii, jj, kk)
+        tile_plan = tile_sched.create_plan()
+        tile_plan._erase_loops([ii, jj, kk])
+        full_fn = package.add(tile_plan, args=(A, B, C), base_name="full_matmul_fn")
+
+        A_test = np.random.random(A.shape).astype(np.float32)
+        B_test = np.random.random(B.shape).astype(np.float32)
+        C_test = np.random.random(C.shape).astype(np.float32)
+
+        A_ref = A_test
+        B_ref = B_test
+        C_ref = A_test @ B_test + C_test
+
+        correctness_check_values = {
+            "pre": [A_test, B_test, C_test],
+            "post": [A_ref, B_ref, C_ref],
+        }
+        self._verify_func(package, full_fn, test_name, correctness_check_values, quiet=False, mode=Package.Mode.RELEASE)
+
+
+    def test_double_nested_nests_matmul_boundary(self):
+        test_name = "test_double_nested_nests_matmul_boundary"
+        from accera import min, Dimension
+
+        M = 20
+        N = 32
+        K = 12
+        M_tile = 4
+        N_tile = 12 # 32 doesn't divide 12 so we should have an 8 element boundary in the N dimension
+        N_kernel_tile = 8 # Doesn't divide N_tile so we should have a 4 element boundary in the N dimension in the outer main loop and no inner boundary in the outer boundary loop
+        K_tile = 3
+
+        package = Package()
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        B_temp = Array(role=Array.Role.TEMP, element_type=ScalarType.float32, shape=(K_tile, N_tile))
+        io_B_temp = Array(role=Array.Role.INPUT_OUTPUT, element_type=B_temp.element_type, shape=B_temp.shape)
+
+        n_tile_dim = Dimension()
+        n_kernel_dim = Dimension()
+
+        i_tile_idx = Dimension()
+        j_tile_idx = Dimension()
+        k_tile_idx = Dimension()
+
+        i_kernel_idx = Dimension()
+        j_kernel_idx = Dimension()
+        k_kernel_idx = Dimension()
+
+        pack_b_nest = Nest([K_tile, n_tile_dim])
+        pb_k, pb_j = pack_b_nest.get_indices()
+
+        @pack_b_nest.iteration_logic
+        def _pack_b():
+            full_k = pb_k + k_tile_idx
+            full_j = pb_j + i_tile_idx
+            io_B_temp[pb_k, pb_j] = B[full_k, full_j]
+
+        pack_b_fn = package.add(
+            pack_b_nest,
+            args=(n_tile_dim, B, io_B_temp, i_tile_idx, k_tile_idx),
+            base_name="pack_b_tile_fn",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+        matmul_kernel_nest = Nest((n_kernel_dim,))
+        mmk_j = matmul_kernel_nest.get_indices()
+
+        @matmul_kernel_nest.iteration_logic
+        def _matmul():
+            tile_j = mmk_j + j_kernel_idx
+
+            full_i = i_kernel_idx + i_tile_idx
+            full_j = tile_j + j_tile_idx
+            full_k = k_kernel_idx + k_tile_idx
+            C[full_i, full_j] += A[full_i, full_k] * io_B_temp[k_kernel_idx, tile_j]
+
+        matmul_kernel_sched = matmul_kernel_nest.create_schedule()
+        mmk_jj = matmul_kernel_sched.split(mmk_j, N_kernel_tile)
+        matmul_kernel_sched.reorder(mmk_j, mmk_jj)
+        matmul_kernel_plan = matmul_kernel_sched.create_plan()
+        matmul_kernel_fn = package.add(matmul_kernel_plan,
+            args=(n_kernel_dim,
+                    A, B, C, io_B_temp,
+                    i_tile_idx, j_tile_idx, k_tile_idx,
+                    i_kernel_idx, j_kernel_idx, k_kernel_idx),
+            base_name="matmul_kernel_fn",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        matmul_tile_nest = Nest([M_tile, n_tile_dim, K_tile])
+        mm_i, mm_j, mm_k = matmul_tile_nest.get_indices()
+
+        @matmul_tile_nest.iteration_logic
+        def _matmul():
+            n_kernel_extent = min(cast(N_kernel_tile, ScalarType.index), n_tile_dim - mm_j)
+            matmul_kernel_fn(n_kernel_extent,
+                A, B, C, io_B_temp,
+                i_tile_idx, j_tile_idx, k_tile_idx,
+                mm_i, mm_j, mm_k)
+
+        matmul_tile_sched = matmul_tile_nest.create_schedule()
+        mm_jj = matmul_tile_sched.split(mm_j, N_tile)
+        mm_jjj = matmul_tile_sched.split(mm_jj, N_kernel_tile)
+        matmul_tile_sched.reorder(mm_k, mm_i, mm_j, mm_jj, mm_jjj)
+        matmul_tile_plan = matmul_tile_sched.create_plan()
+        matmul_tile_plan._erase_loops([mm_jjj])
+        matmul_tile_fn = package.add(
+            matmul_tile_plan,
+            args=(n_tile_dim, A, B, C, io_B_temp, i_tile_idx, j_tile_idx, k_tile_idx),
+            base_name="matmul_tile_fn",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        tile_nest = Nest([M, N, K])
+        i, j, k = tile_nest.get_indices()
+
+        @tile_nest.iteration_logic
+        def _tile_logic():
+            n_tile_extent = min(cast(N_tile, ScalarType.index), cast(N, ScalarType.index) - j)
+            pack_b_fn(n_tile_extent, B, B_temp, j, k)
+            matmul_tile_fn(n_tile_extent, A, B, C, B_temp, i, j, k)
+
+        tile_sched = tile_nest.create_schedule()
+        ii, jj, kk = tile_sched.tile(dict(zip([i, j, k], [M_tile, N_tile, K_tile])))
+        tile_sched.reorder(i, j, k, ii, jj, kk)
+        tile_plan = tile_sched.create_plan()
+        tile_plan._erase_loops([ii, jj, kk])
+        full_fn = package.add(tile_plan, args=(A, B, C), base_name="full_matmul_fn")
+
+        A_test = np.random.random(A.shape).astype(np.float32)
+        B_test = np.random.random(B.shape).astype(np.float32)
+        C_test = np.random.random(C.shape).astype(np.float32)
+
+        A_ref = A_test
+        B_ref = B_test
+        C_ref = A_test @ B_test + C_test
+
+        correctness_check_values = {
+            "pre": [A_test, B_test, C_test],
+            "post": [A_ref, B_ref, C_ref],
+        }
+        self._verify_func(package, full_fn, test_name, correctness_check_values, quiet=False, mode=Package.Mode.RELEASE)
+
+
 class DSLTest_05Targets(unittest.TestCase):
     def test_known_targets(self) -> None:
         intel_name = "Intel 6400"
diff --git a/accera/python/accera/test/smoke_tests.py b/accera/python/accera/test/smoke_tests.py
index 514cd437..22784704 100644
--- a/accera/python/accera/test/smoke_tests.py
+++ b/accera/python/accera/test/smoke_tests.py
@@ -42,8 +42,11 @@
     DEV_MODE = True
     sys.path.insert(1, os.getcwd())
 
-from accera import Package, ScalarType, Nest, Array, Constants, Scalar, fuse, create_parameters
+INTERNAL_FUNCTION_OPTS = { "no_inline_into": True, "public": False }
+
+from accera import Package, ScalarType, Nest, Array, Constants, Scalar, fuse, create_parameters, Dimension, cast
 from accera._lang_python._lang import _MemorySpace, _MMASchedulingPolicy, _MMAShape
+from accera import min as accmin
 from accera.samples import MatrixMultiplication
 from accera.test import verifiers
 from accera.test.test_utils import expectedFailure, FailedReason
@@ -2843,6 +2846,475 @@ def _():
         self._verify_matrix_multiplication_function(function, package, test_name, check_correctness=check_correctness)
 
 
+    # TODO : move vpmaddwd tests to a different test file
+    def test_signextend_int16_matmul_vpmaddwd(self):
+        from accera import AllocateFlags
+        test_name = "test_signextend_int16_matmul_vpmaddwd"
+
+        def inout_array(arr: Array):
+            # Copy the array info but change it to input-output role for use in an inner function declaration
+            return Array(role=Array.Role.INPUT_OUTPUT, element_type=arr.element_type, shape=arr.shape)
+
+        M = 240
+        N = 256
+        K = 256
+
+        M_tile = 24
+        N_tile = 128
+        K_tile = 128
+
+        M_kernel_tile = 6
+        N_kernel_tile = 16
+        
+        N_vector_tile = 8
+        K_vector_tile = 2
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.uint8, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N), layout=Array.Layout.FIRST_MAJOR)
+
+        A_cache = Array(role=Array.Role.TEMP,
+                        element_type=ScalarType.int16,
+                        shape=(M_tile, K_tile),
+                        layout=Array.Layout.FIRST_MAJOR,
+                        flags=AllocateFlags.HEAP)
+        B_cache = Array(role=Array.Role.TEMP,
+                        element_type=ScalarType.uint8,
+                        shape=(N_tile // N_kernel_tile, K_tile // K_vector_tile, N_kernel_tile, K_vector_tile),
+                        layout=Array.Layout.FIRST_MAJOR,
+                        flags=AllocateFlags.HEAP)
+
+        C_cache = Array(role=Array.Role.TEMP,
+                        element_type=ScalarType.int32,
+                        shape=(M_kernel_tile, N_kernel_tile),
+                        layout=Array.Layout.FIRST_MAJOR,
+                        flags=AllocateFlags.STACK) # Stack allocate the small accumulation cache
+
+        io_A_cache = inout_array(A_cache)
+        io_B_cache = inout_array(B_cache)
+        io_C_cache = inout_array(C_cache)
+
+        B_ext = Array(role=Array.Role.TEMP,
+                        element_type=ScalarType.int16,
+                        shape=(N_kernel_tile, K_vector_tile),
+                        layout=Array.Layout.FIRST_MAJOR,
+                        flags=AllocateFlags.STACK)
+
+        io_B_ext = inout_array(B_ext)
+
+        m_tile_dim = Dimension()
+        n_tile_dim = Dimension()
+        k_tile_dim = Dimension()
+        m_kernel_dim = Dimension()
+        n_kernel_dim = Dimension()
+        k_kernel_dim = Dimension()
+        m_vector_dim = Dimension()
+
+        i_tile_idx = Dimension()
+        j_tile_idx = Dimension()
+        k_tile_idx = Dimension()
+        i_kernel_idx = Dimension()
+        j_kernel_idx = Dimension()
+        k_kernel_idx = Dimension()
+        i_vector_idx = Dimension()
+
+        package = Package()
+
+        ### Matmul inner kernel tile
+        mmi_nest = Nest(shape=(n_kernel_dim, k_kernel_dim))
+        mmi_j, mmi_k = mmi_nest.get_indices()
+        @mmi_nest.iteration_logic
+        def _matmul_inner():
+            mmi_i = i_vector_idx
+            tile_i = i_kernel_idx + mmi_i
+            tile_j = j_kernel_idx + mmi_j
+            tile_k = k_kernel_idx + mmi_k
+            io_C_cache[mmi_i, mmi_j] += io_A_cache[tile_i, tile_k] * io_B_ext[mmi_j, mmi_k]
+
+        mmi_sched = mmi_nest.create_schedule()
+        mmi_jj, mmi_kk = mmi_sched.tile(dict(zip([mmi_j, mmi_k], [N_kernel_tile, K_vector_tile])))
+        mmi_jjj = mmi_sched.split(mmi_jj, N_vector_tile)
+        mmi_sched.reorder(mmi_j, mmi_k, mmi_jj, mmi_jjj, mmi_kk)
+        mmi_plan = mmi_sched.create_plan()
+        mmi_plan.vectorize(mmi_jjj)
+        mmi_fn = package.add(mmi_plan,
+            args=(n_kernel_dim, k_kernel_dim,
+                io_A_cache, io_B_ext, io_C_cache,
+                i_kernel_idx, j_kernel_idx, k_kernel_idx, i_vector_idx),
+            base_name="matmul_kernel",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+        ### B element zero extend
+        bext_nest = Nest((n_kernel_dim, k_kernel_dim))
+        bext_j, bext_k = bext_nest.get_indices()
+        @bext_nest.iteration_logic
+        def _bext():
+            tile_j = j_kernel_idx
+            tile_k = k_kernel_idx
+            io_B_ext[bext_j, bext_k] = io_B_cache[tile_j / N_kernel_tile, tile_k / K_vector_tile, bext_j, bext_k]
+
+        bext_sched = bext_nest.create_schedule()
+        bext_jj, bext_kk = bext_sched.tile(dict(zip([bext_j, bext_k], [N_kernel_tile, K_vector_tile])))
+        bext_jjj = bext_sched.split(bext_jj, N_vector_tile)
+        bext_sched.reorder(bext_j, bext_k, bext_jj, bext_jjj, bext_kk)
+        bext_plan = bext_sched.create_plan()
+        bext_plan.vectorize(bext_jjj)
+        bext_fn = package.add(bext_plan,
+            args=(n_kernel_dim, k_kernel_dim,
+                io_B_cache, io_B_ext,
+                j_kernel_idx, k_kernel_idx),
+            base_name="b_ext_kernel",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        ### Matmul outer kernel tile
+        mmo_nest = Nest(shape=(m_kernel_dim, k_tile_dim))
+        mmo_i, mmo_k = mmo_nest.get_indices()
+        @mmo_nest.iteration_logic
+        def _matmul():
+
+            ### NOTE: The order of operands in this accmin is important
+            #           When we vectorize a min statement that is either always true or always false, we can simplify it better.
+            #           accmin internally uses "less-than" as the min operator, so here we order (k_tile_dim - mmo_k, K_vector_tile) because:
+            #           k_tile_dim - mmo_k < K_vector_tile
+            #           Is false for k_tile_dim - mmo_k >= K_vector_tile
+            #           And importantly for vectorization it is therefore false for the entire K_vector_tile inner split and gets simplified
+            k_kernel_extent = accmin(k_tile_dim - mmo_k, cast(K_vector_tile, ScalarType.index))
+
+            bext_fn(n_kernel_dim, k_kernel_extent, io_B_cache, B_ext, j_kernel_idx, mmo_k)
+            mmi_fn(n_kernel_dim, k_kernel_extent, io_A_cache, B_ext, io_C_cache, i_kernel_idx, j_kernel_idx, mmo_k, mmo_i)
+
+        mmo_sched = mmo_nest.create_schedule()
+        mmo_ii, mmo_kk = mmo_sched.tile(dict(zip([mmo_i, mmo_k], [M_kernel_tile, K_tile])))
+        mmo_kkk = mmo_sched.split(mmo_kk, K_vector_tile)
+        mmo_sched.reorder(mmo_k, mmo_i, mmo_kk, mmo_ii, mmo_kkk)
+        mmo_plan = mmo_sched.create_plan()
+        mmo_plan._erase_loops([mmo_kkk])
+        mmo_fn = package.add(mmo_plan,
+            args=(m_kernel_dim, n_kernel_dim, k_tile_dim,
+                io_A_cache, io_B_cache, io_C_cache,
+                i_kernel_idx, j_kernel_idx),
+            base_name="matmul_kernel",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        ### C cache init
+        cci_nest = Nest(shape=(M_kernel_tile, N_kernel_tile))
+        cci_i, cci_j = cci_nest.get_indices()
+        @cci_nest.iteration_logic
+        def _cci():
+            io_C_cache[cci_i, cci_j] = 0
+
+        cci_sched = cci_nest.create_schedule()
+        cci_plan = cci_sched.create_plan()
+        cci_fn = package.add(cci_plan, args=(io_C_cache,), base_name="c_cache_init_kernel", function_opts=INTERNAL_FUNCTION_OPTS)
+
+        ### C cache reduce
+        ccr_nest = Nest(shape=(m_kernel_dim, n_kernel_dim))
+        ccr_i, ccr_j = ccr_nest.get_indices()
+        @ccr_nest.iteration_logic
+        def _ccr():
+            global_i = i_tile_idx + i_kernel_idx + ccr_i
+            global_j = j_tile_idx + j_kernel_idx + ccr_j
+            C[global_i, global_j] += io_C_cache[ccr_i, ccr_j]
+
+        ccr_sched = ccr_nest.create_schedule()
+        ccr_ii, ccr_jj = ccr_sched.tile(dict(zip([ccr_i, ccr_j], [M_kernel_tile, N_kernel_tile])))
+        ccr_sched.reorder(ccr_i, ccr_j, ccr_ii, ccr_jj)
+        ccr_plan = ccr_sched.create_plan()
+        ccr_plan.vectorize(ccr_ii)
+        ccr_fn = package.add(ccr_plan,
+            args=(m_kernel_dim, n_kernel_dim,
+                C, io_C_cache,
+                i_tile_idx, j_tile_idx,
+                i_kernel_idx, j_kernel_idx),
+            base_name="c_cache_reduce_kernel",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+        ### A cache pack
+        pa_nest = Nest(shape=(m_tile_dim, k_tile_dim))
+        pa_i, pa_k = pa_nest.get_indices()
+        @pa_nest.iteration_logic
+        def _pack_a():
+            global_i = i_tile_idx + pa_i
+            global_k = k_tile_idx + pa_k
+            io_A_cache[pa_i, pa_k] = A[global_i, global_k]
+
+        pa_sched = pa_nest.create_schedule()
+        pa_ii, pa_kk = pa_sched.tile(dict(zip([pa_i, pa_k], [M_tile, K_tile])))
+        pa_sched.reorder(pa_i, pa_k, pa_ii, pa_kk)
+        pa_plan = pa_sched.create_plan()
+        pa_fn = package.add(pa_plan,
+            args=(m_tile_dim, k_tile_dim,
+                A, io_A_cache,
+                i_tile_idx, k_tile_idx),
+            base_name="pack_a",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        ### B cache pack
+        pb_nest = Nest(shape=(n_tile_dim, k_tile_dim))
+        pb_j, pb_k = pb_nest.get_indices()
+        @pb_nest.iteration_logic
+        def _pack_b():
+            global_j = j_tile_idx + pb_j
+            global_k = k_tile_idx + pb_k
+            io_B_cache[pb_j / N_kernel_tile, pb_k / K_vector_tile, pb_j % N_kernel_tile, pb_k % K_vector_tile] = B[global_k, global_j]
+
+        pb_sched = pb_nest.create_schedule()
+        pb_jj, pb_kk = pb_sched.tile(dict(zip([pb_j, pb_k], [N_tile, K_tile])))
+        pb_jjj, pb_kkk = pb_sched.tile(dict(zip([pb_jj, pb_kk], [N_vector_tile, K_vector_tile])))
+        pb_sched.reorder(pb_j, pb_k, pb_jj, pb_kk, pb_jjj, pb_kkk)
+        pb_plan = pb_sched.create_plan()
+        pb_plan.vectorize(pb_jjj)
+        pb_fn = package.add(pb_plan,
+            args=(n_tile_dim, k_tile_dim,
+                B, io_B_cache,
+                j_tile_idx, k_tile_idx),
+            base_name="pack_b",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        compute_kernel_nest = Nest(shape=(1,))
+        @compute_kernel_nest.iteration_logic
+        def _hack():
+            cci_fn(C_cache) # Don't need to range-clamp this, we can just zero out the full buffer every time
+            mmo_fn(m_kernel_dim, n_kernel_dim, k_tile_dim, io_A_cache, io_B_cache, C_cache, i_kernel_idx, j_kernel_idx)
+            ccr_fn(m_kernel_dim, n_kernel_dim, C, C_cache, i_tile_idx, j_tile_idx, i_kernel_idx, j_kernel_idx)
+
+        compute_kernel_sched = compute_kernel_nest.create_schedule()
+        compute_kernel_plan = compute_kernel_sched.create_plan()
+        compute_kernel_fn = package.add(compute_kernel_plan,
+            args=(
+                m_kernel_dim, n_kernel_dim, k_tile_dim,
+                io_A_cache, io_B_cache, C,
+                i_tile_idx, j_tile_idx, k_tile_idx,
+                i_kernel_idx, j_kernel_idx),
+            base_name="compute_kernel_fn",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+        tile_nest = Nest(shape=(m_tile_dim, n_tile_dim))
+        tile_i, tile_j = tile_nest.get_indices()
+
+        @tile_nest.iteration_logic
+        def _tile():
+            m_kernel_extent = accmin(m_tile_dim - tile_i, cast(M_kernel_tile, ScalarType.index))
+            n_kernel_extent = accmin(n_tile_dim - tile_j, cast(N_kernel_tile, ScalarType.index))
+            compute_kernel_fn(m_kernel_extent, n_kernel_extent, k_tile_dim,
+                io_A_cache, io_B_cache, C,
+                i_tile_idx, j_tile_idx, k_tile_idx,
+                tile_i, tile_j)
+
+        tile_sched = tile_nest.create_schedule()
+        tile_ii, tile_jj = tile_sched.tile({ tile_i: M_tile, tile_j: N_tile })
+        tile_iii, tile_jjj = tile_sched.tile({ tile_ii: M_kernel_tile, tile_jj: N_kernel_tile })
+        tile_sched.reorder(tile_i, tile_j, tile_ii, tile_jj, tile_iii, tile_jjj)
+        tile_plan = tile_sched.create_plan()
+        tile_plan._erase_loops([tile_iii, tile_jjj])
+        tile_fn = package.add(tile_plan,
+            args=(m_tile_dim, n_tile_dim, k_tile_dim,
+                io_A_cache, io_B_cache, C,
+                i_tile_idx, j_tile_idx, k_tile_idx),
+            base_name="tile_fn",
+            function_opts=INTERNAL_FUNCTION_OPTS)
+
+
+        global_nest = Nest(shape=(M, N, K))
+        global_i, global_j, global_k = global_nest.get_indices()
+
+        @global_nest.iteration_logic
+        def _tile():
+            m_tile_extent = accmin(M - global_i, cast(M_tile, ScalarType.index))
+            n_tile_extent = accmin(N - global_j, cast(N_tile, ScalarType.index))
+            k_tile_extent = accmin(K - global_k, cast(K_tile, ScalarType.index))
+
+            pa_fn(m_tile_extent, k_tile_extent, A, A_cache, global_i, global_k)
+            pb_fn(n_tile_extent, k_tile_extent, B, B_cache, global_j, global_k)
+            tile_fn(m_tile_extent, n_tile_extent, k_tile_extent, A_cache, B_cache, C, global_i, global_j, global_k)
+
+        global_sched = global_nest.create_schedule()
+        global_ii, global_jj, global_kk = global_sched.tile({ global_i: M_tile, global_j: N_tile, global_k: K_tile })
+        global_sched.reorder(global_i, global_j, global_k, global_ii, global_jj, global_kk)
+        global_plan = global_sched.create_plan()
+        global_plan._erase_loops([global_ii, global_jj, global_kk])
+
+        function = package.add(global_plan, args=(A, B, C), base_name=test_name)
+        
+        A_test = np.random.random((M, K)).astype(np.int16)
+        B_test = np.random.random((K, N)).astype(np.uint8)
+        C_test = np.random.random((M, N)).astype(np.int32)
+
+        correctness_check_values = {
+            "pre": (A_test, B_test, C_test),
+            "post": (A_test, B_test, C_test + A_test @ B_test),
+        }
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+
+        # build the HAT package
+        with verifiers.VerifyPackage(self, test_name, output_dir) as v:
+            package.build(test_name, format=Package.Format.DEFAULT | Package.Format.MLIR, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False)
+            v.check_correctness(
+                function.name,
+                before=correctness_check_values["pre"],
+                after=correctness_check_values["post"],
+            )
+
+
+    def test_int16_matmul_vpmaddwd(self):
+        test_name = "test_int16_matmul_vpmaddwd"
+        M = 240
+        N = 256
+        K = 256
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(M, N), layout=Array.Layout.FIRST_MAJOR)
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+        ii, jj, kk = schedule.tile({ i: 24, j: 128, k: 128 })
+        iii, jjj, kkk = schedule.tile({ ii: 6, jj: 16, kk: 4 })
+        jjjj, kkkk = schedule.tile({ jjj: 8, kkk: 2 })
+
+        schedule.reorder(i, j, k,
+                         ii, jj, kk,
+                         kkk, iii, jjj,
+                         jjjj, kkkk)
+
+        plan = schedule.create_plan()
+        plan.cache(A, index = ii, element_type = ScalarType.int16, vectorize=False)
+        plan.cache(B, index = jjjj, trigger_index = jj, layout = Array.Layout.LAST_MAJOR, vectorize=False)
+        plan.cache(C, iii)
+        plan.vectorize(jjjj)
+
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+        
+        A_test = np.random.random((M, K)).astype(np.int16)
+        B_test = np.random.random((K, N)).astype(np.int16)
+        C_test = np.random.random((M, N)).astype(np.int32)
+
+        correctness_check_values = {
+            "pre": (A_test, B_test, C_test),
+            "post": (A_test, B_test, C_test + A_test @ B_test),
+        }
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+
+        # build the HAT package
+        with verifiers.VerifyPackage(self, test_name, output_dir) as v:
+            package.build(test_name, format=Package.Format.DEFAULT, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False)
+            v.check_correctness(
+                function.name,
+                before=correctness_check_values["pre"],
+                after=correctness_check_values["post"],
+            )
+
+
+
+    def test_int32_horizontal_vector_add(self):
+        test_name = "test_int32_horizontal_vector_add"
+        M = 256
+        N = 16
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.int32, shape=(M, N), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.int32, shape=(M,), layout=Array.Layout.FIRST_MAJOR)
+
+        nest = Nest(shape=(M, N))
+        i, j = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i] += A[i, j]
+
+        schedule = nest.create_schedule()
+
+        plan = schedule.create_plan()
+        plan.vectorize(j)
+
+        package = Package()
+        function = package.add(plan, args=(A, B), base_name=test_name)
+        
+        A_test = np.random.random((M, N)).astype(np.int32)
+        B_test = np.random.random((M,)).astype(np.int32)
+
+        B_ref = np.zeros((M,)).astype(np.int32)
+        B_ref[:] = B_test[:]
+        for j in range(N):
+            B_ref[:] += A_test[:, j]
+
+        correctness_check_values = {
+            "pre": (A_test, B_test),
+            "post": (A_test, B_ref),
+        }
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+
+        # build the HAT package
+        with verifiers.VerifyPackage(self, test_name, output_dir) as v:
+            package.build(test_name, format=Package.Format.DEFAULT, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False)
+            v.check_correctness(
+                function.name,
+                before=correctness_check_values["pre"],
+                after=correctness_check_values["post"],
+            )
+
+    def test_int16_to_int32_horizontal_vector_add_simple(self):
+        test_name = "test_int16_to_int32_horizontal_vector_add_simple"
+        M = 256
+        N = 16
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.int16, shape=(M, N), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.int32, shape=(M,), layout=Array.Layout.FIRST_MAJOR)
+
+        nest = Nest(shape=(M, N))
+        i, j = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            B[i] += A[i, j]
+
+        schedule = nest.create_schedule()
+        ii = schedule.split(i, 4)
+        schedule.reorder(i, ii, j)
+        plan = schedule.create_plan()
+        plan.vectorize(ii)
+
+        package = Package()
+        function = package.add(plan, args=(A, B), base_name=test_name)
+        
+        A_test = np.random.random((M, N)).astype(np.int16)
+        B_test = np.random.random((M,)).astype(np.int32)
+
+        B_ref = np.zeros((M,)).astype(np.int32)
+        B_ref[:] = B_test[:]
+        for j in range(N):
+            B_ref[:] += A_test[:, j]
+
+        correctness_check_values = {
+            "pre": (A_test, B_test),
+            "post": (A_test, B_ref),
+        }
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+
+        # build the HAT package
+        with verifiers.VerifyPackage(self, test_name, output_dir) as v:
+            package.build(test_name, format=Package.Format.DEFAULT, mode=Package.Mode.RELEASE, output_dir=output_dir, _quiet=False)
+            v.check_correctness(
+                function.name,
+                before=correctness_check_values["pre"],
+                after=correctness_check_values["post"],
+            )
+
+
     # Cache widening the type
     def test_matmul_input_cache_element_type_widen(self) -> None:
         test_name = "test_matmul_input_cache_element_type_widen"
@@ -4165,13 +4637,13 @@ def file_check_fn(verifier):
             # Function decl
             checker.check_label('accv.func nested @test_gpu_cache_different_input_layouts_')
             checker.check_same(
-                '%[[Array_A:[a-z0-9_]+]]: memref<4x2560x2048xf32, affine_map<(d0, d1, d2) -> (d0 * 5242880 + d1 * 2048 + d2)>>'
+                '%[[Array_A:[a-z0-9_]+]]: memref<4x2560x2048xf32>'
             )
             checker.check_same(
                 '%[[Array_B:[a-z0-9_]+]]: memref<4x2048x1536xf32, affine_map<(d0, d1, d2) -> (d0 + d1 * 4 + d2 * 8192)>>'
             )
             checker.check_same(
-                '%[[Array_C:[a-z0-9_]+]]: memref<4x2560x1536xf32, affine_map<(d0, d1, d2) -> (d0 * 3932160 + d1 * 1536 + d2)>>'
+                '%[[Array_C:[a-z0-9_]+]]: memref<4x2560x1536xf32>'
             )
 
             # Block X/Y
@@ -4184,8 +4656,6 @@ def file_check_fn(verifier):
 
             # Loops outside of cache regions
             checker.check('affine.for %[[b_iv:[a-z0-9_]+]] = 0 to 4 {')
-            checker.check('affine.for %[[Block_X_iv:[a-z0-9_]+]] = 0 to 1 {')
-            checker.check('affine.for %[[Block_Y_iv:[a-z0-9_]+]] = 0 to 1 {')
             checker.check('affine.for %[[k_iv:[a-z0-9_]+]] = 0 to 2048 step 512 {')
             checker.check('affine.for %[[kk_iv:[a-z0-9_]+]] = 0 to 512 step 32 {')
 
@@ -4194,10 +4664,8 @@ def file_check_fn(verifier):
             checker.check('%[[Thread_X:[0-9_]+]] = gpu.thread_id x')
             checker.check('%[[Thread_Y:[0-9_]+]] = gpu.thread_id y')
             checker.check('affine.for %[[lpt_iv:[a-z0-9_]+]] = 0 to 2 {')
-            checker.check('affine.for %[[Thread_X_iv:[a-z0-9_]+]] = 0 to 1 {')
-            checker.check('affine.for %[[Thread_Y_iv:[a-z0-9_]+]] = 0 to 1 {')
             checker.check(
-                '%[[Loaded_A_Val:[0-9_]+]] = affine.load %[[Array_A]][%[[b_iv]], symbol(%[[Block_X]]) * 16 + symbol(%[[Thread_X]]) - (symbol(%[[Block_X]]) floordiv 160) * 2560, %[[lpt_iv]] * 16 + %[[k_iv]] + %[[kk_iv]] + symbol(%[[Thread_Y]])] : memref<4x2560x2048xf32, affine_map<(d0, d1, d2) -> (d0 * 5242880 + d1 * 2048 + d2)>>'
+                '%[[Loaded_A_Val:[0-9_]+]] = affine.load %[[Array_A]][%[[b_iv]], symbol(%[[Block_X]]) * 16 + symbol(%[[Thread_X]]) - (symbol(%[[Block_X]]) floordiv 160) * 2560, %[[lpt_iv]] * 16 + %[[k_iv]] + %[[kk_iv]] + symbol(%[[Thread_Y]])] : memref<4x2560x2048xf32>'
             )
             checker.check(
                 'affine.store %[[Loaded_A_Val]], %[[Cache_A]][0, symbol(%[[Thread_X]]), %[[lpt_iv]] * 16 + symbol(%[[Thread_Y]])] : memref<1x16x32xf32, 3>'
@@ -4208,8 +4676,6 @@ def file_check_fn(verifier):
             checker.check('%[[Thread_X:[0-9_]+]] = gpu.thread_id x')
             checker.check('%[[Thread_Y:[0-9_]+]] = gpu.thread_id y')
             checker.check('affine.for %[[lpt_iv:[a-z0-9_]+]] = 0 to 2 {')
-            checker.check('affine.for %[[Thread_X_iv:[a-z0-9_]+]] = 0 to 1 {')
-            checker.check('affine.for %[[Thread_Y_iv:[a-z0-9_]+]] = 0 to 1 {')
             checker.check(
                 '%[[Loaded_B_Val:[0-9_]+]] = affine.load %[[Array_B]][%[[b_iv]], %[[k_iv]] + %[[kk_iv]] + symbol(%[[Thread_Y]]) * 16 + symbol(%[[Thread_X]]) - (symbol(%[[Thread_Y]]) floordiv 2) * 32, %[[lpt_iv]] * 8 + symbol(%[[Block_Y]]) * 16 - (symbol(%[[Block_Y]]) floordiv 96) * 1536 + symbol(%[[Thread_Y]]) floordiv 2 - ((%[[lpt_iv]] * 8 + symbol(%[[Thread_Y]]) floordiv 2) floordiv 16) * 16] : memref<4x2048x1536xf32, affine_map<(d0, d1, d2) -> (d0 + d1 * 4 + d2 * 8192)>>'
             )
@@ -5058,5 +5524,102 @@ def _():
         )
 
 
+    def test_loop_erase_hack(self) -> None:
+        # We want to fuse two nests along at least one dimension that only one of them should actually have, but for positioning reasons
+        # it must exist in both. We therefore fuse along all the dimensions and erase the inner unfused loops that we don't actually need
+
+        M = 256
+        N = 128
+        K = 512
+        M_tile = 32
+        N_tile = 16
+        K_tile = 8
+        A = Array(role=Array.Role.INPUT, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, shape=(M, N))
+
+        # Create nest0 and schedule
+        nest0 = Nest(shape=(M, N, K))
+        i0, j0, k0 = nest0.get_indices()
+
+        @nest0.iteration_logic
+        def _():
+            C[i0, j0] += A[i0, k0] * B[k0, j0]
+
+        schedule0 = nest0.create_schedule()
+        ii0, jj0, kk0 = schedule0.tile({ i0: M_tile, j0: N_tile, k0: K_tile })
+        schedule0.reorder(i0, j0, k0, ii0, jj0, kk0)
+
+        # Create nest1 and schedule1
+        nest1 = Nest(shape=(M, N, K))
+        i1, j1, k1 = nest1.get_indices()
+
+        @nest1.iteration_logic
+        def _():
+            C[i1, j1] = C[i1, j1] * Scalar(0.2)
+
+        schedule1 = nest1.create_schedule()
+        ii1, jj1, kk1 = schedule1.tile({ i1: M_tile, j1: N_tile, k1: K_tile })
+        schedule1.reorder(i1, j1, k1, ii1, jj1, kk1)
+
+        schedule = fuse((schedule0, schedule1), partial=3)
+        f, i, j, k, ii0, jj0, kk0, ii1, jj1, kk1 = schedule.get_indices()
+        schedule.reorder(i, j, k, f, ii0, jj0, kk0, ii1, jj1, kk1)
+        plan = schedule.create_plan()
+        plan._erase_loops([kk1])
+
+        # Create a package and add our function definition to it
+        package_name = "test_loop_erase_hack"
+        package = Package()
+        package.add(plan, args=(A, B, C), base_name="test_loop_erase_hack")
+
+        # Build the HAT package
+        with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR):
+            package.build(package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=TEST_PACKAGE_DIR)
+
+    def test_dynamic_size_redundant_split(self) -> None:
+        package_name = "test_dynamic_size_redundant_split"
+        split_size = 32
+
+        m_extent = Dimension()
+        input_arr = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(m_extent,))
+        output_arr = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(m_extent,))
+
+        nest = Nest((m_extent,))
+        i = nest.get_indices()
+        @nest.iteration_logic
+        def _():
+            output_arr[i] += input_arr[i]
+
+        sched = nest.create_schedule()
+        ii = sched.split(i, split_size)
+        iii = sched.split(ii, split_size)
+        sched.reorder(i, ii, iii)
+        plan = sched.create_plan()
+
+        # Create a package and add our function definition to it
+        package = Package()
+
+        fn = package.add(plan, args=(m_extent, input_arr, output_arr), base_name=package_name)
+
+        M_test = np.int64(67)
+        input_test = np.random.random((M_test,)).astype(np.float32)
+        output_test = np.random.random((M_test,)).astype(np.float32)
+        correctness_check_values = {
+            "pre": [M_test, input_test, output_test],
+            "post": [M_test, input_test, output_test + input_test],
+        }
+
+        # Build the HAT package
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir, _quiet=False)
+
+            v.check_correctness(
+                fn.name,
+                before=correctness_check_values["pre"],
+                after=correctness_check_values["post"],
+            )
+
 if __name__ == '__main__':
     unittest.main(verbosity=10)
diff --git a/accera/python/lib/src/ContainerTypes.cpp b/accera/python/lib/src/ContainerTypes.cpp
index b4157f53..67c82e77 100644
--- a/accera/python/lib/src/ContainerTypes.cpp
+++ b/accera/python/lib/src/ContainerTypes.cpp
@@ -38,8 +38,11 @@ namespace
             .value("float32", value::ValueType::Float, "4 byte floating point")
             .value("float64", value::ValueType::Double, "8 byte floating point");
 
-        py::enum_<value::AllocateFlags>(subModule, "AllocateFlags", "An enumeration of allocation flags")
+        py::enum_<value::AllocateFlags>(module, "AllocateFlags", "An enumeration of allocation flags")
             .value("NONE", value::AllocateFlags::None)
+            .value("GLOBAL", value::AllocateFlags::Global)
+            .value("STACK", value::AllocateFlags::Stack)
+            .value("HEAP", value::AllocateFlags::Heap)
             .value("THREAD_LOCAL", value::AllocateFlags::ThreadLocal);
     }
 
@@ -154,6 +157,8 @@ General constructor.
         module.def("cast", [](value::Scalar s, value::ValueType type) {
             return value::Cast(s, type);
         });
+        module.def("round", &value::Round);
+        module.def("remainderf", &value::Remainderf);
     }
 
     void DefineArrayClass(py::module& module)
diff --git a/accera/python/lib/src/ExecutionPlanTypes.cpp b/accera/python/lib/src/ExecutionPlanTypes.cpp
index cadf661b..60f303d5 100644
--- a/accera/python/lib/src/ExecutionPlanTypes.cpp
+++ b/accera/python/lib/src/ExecutionPlanTypes.cpp
@@ -203,7 +203,8 @@ namespace
             .def("emit_runtime_init_packing", py::overload_cast<value::ViewAdapter, const std::string&, const std::string&, value::CacheIndexing>(&value::Plan::EmitRuntimeInitPacking), "target"_a, "packing_func_name"_a, "packed_buf_size_func_name"_a, "indexing"_a = value::CacheIndexing::GlobalToPhysical)
             .def("pack_and_embed_buffer", py::overload_cast<value::ViewAdapter, value::ViewAdapter, const std::string&, const std::string&, value::CacheIndexing>(&value::Plan::PackAndEmbedBuffer), "target"_a, "constant_data_buffer"_a, "wrapper_fn_name"_a, "packed_buffer_name"_a, "indexing"_a = value::CacheIndexing::GlobalToPhysical)
             .def("vectorize", &value::Plan::Vectorize, "i"_a, "vectorization_info"_a)
-            .def("parallelize", &value::Plan::Parallelize, "indices"_a, "num_threads"_a, "policy"_a);
+            .def("parallelize", &value::Plan::Parallelize, "indices"_a, "num_threads"_a, "policy"_a)
+            .def("_erase_loop", &value::Plan::_EraseLoop, "index"_a);
 
         py::class_<value::GPUPlan>(module, "_GPUExecutionPlan")
             .def(py::init([](value::GPUPlan& plan) {
diff --git a/accera/python/lib/src/PackagingTypes.cpp b/accera/python/lib/src/PackagingTypes.cpp
index 929dfb91..7548949e 100644
--- a/accera/python/lib/src/PackagingTypes.cpp
+++ b/accera/python/lib/src/PackagingTypes.cpp
@@ -106,12 +106,13 @@ ARM: fp16, neon, vfp3, d16, vfp4, hwdiv-arm, hwdiv
             .def(py::init<const std::string&, const value::CompilerOptions&>(), "name"_a, "options"_a = value::CompilerOptions{})
             .def(
                 "Allocate",
-                [](value::MLIRContext& c, value::ValueType type, const util::MemoryLayout& layout, size_t alignment) {
-                    return c.Allocate(type, layout, alignment);
+                [](value::MLIRContext& c, value::ValueType type, const util::MemoryLayout& layout, size_t alignment, value::AllocateFlags flags) {
+                    return c.Allocate(type, layout, alignment, flags);
                 },
                 "type"_a,
                 "layout"_a,
-                "alignment"_a = 0)
+                "alignment"_a = 0,
+                "_flags"_a = value::AllocateFlags::None)
             .def("Print", &value::MLIRContext::print, "Prints the module")
             .def("Save", &value::MLIRContext::save, "filename"_a)
             .def("Verify", &value::MLIRContext::verify)
@@ -160,6 +161,14 @@ Sets whether this function should be decorated (mangled)
                 "inlinable"_a,
                 py::return_value_policy::reference_internal,
                 "Sets whether the function is allowed to be inlined.")
+            .def(
+                "inlinable_into", [](value::FunctionDeclaration& fn, bool inlinable_into) {
+                    (void)fn.InlineInto(inlinable_into ? value::FunctionInlining::always : value::FunctionInlining::never);
+                    return fn;
+                },
+                "inlinable_into"_a,
+                py::return_value_policy::reference_internal,
+                "Sets whether other functions are allowed to be inlined into this function.")
             .def("addTag", &value::FunctionDeclaration::AddTag, "addTag"_a, py::return_value_policy::reference_internal, "A tag to add to a function as an attribute.")
             .def("baseName", &value::FunctionDeclaration::BaseName, "baseName"_a, py::return_value_policy::reference_internal, "Sets the base name for this function to use as an alias in the generated header file.")
             .def("outputVerifiers", &value::FunctionDeclaration::OutputVerifiers, "outputVerifiers"_a, py::return_value_policy::reference_internal, "Sets the verification functions for output checking, one per output argument.")
diff --git a/accera/transforms/include/AcceraPasses.h b/accera/transforms/include/AcceraPasses.h
index 6d9d913f..265aaa94 100644
--- a/accera/transforms/include/AcceraPasses.h
+++ b/accera/transforms/include/AcceraPasses.h
@@ -22,6 +22,7 @@
 #include "value/ValueToLLVMLoweringPass.h"
 #include "value/ValueToStandardLoweringPass.h"
 
+#include <ir/include/intrinsics/AcceraIntrinsicsDialect.h>
 #include <ir/include/exec/ExecutionPlanOps.h>
 #include <ir/include/nest/LoopNestOps.h>
 #include <ir/include/value/ValueDialect.h>
diff --git a/accera/transforms/include/AcceraPasses.td b/accera/transforms/include/AcceraPasses.td
index 54e266d3..64ffbafd 100644
--- a/accera/transforms/include/AcceraPasses.td
+++ b/accera/transforms/include/AcceraPasses.td
@@ -257,6 +257,7 @@ def ConvertValueToLLVM : accModulePass<"value-to-llvm"> {
   let constructor = "accera::transforms::value::createValueToLLVMPass()";
   let dependentDialects = [
     "mlir::StandardOpsDialect",
+    "accera::ir::intrinsics::AcceraIntrinsicsDialect",
     "mlir::LLVM::LLVMDialect"
   ];
   // Match std-to-llvm options so we can pass through arguments
diff --git a/accera/transforms/include/affine/AffineSimplifications.h b/accera/transforms/include/affine/AffineSimplifications.h
index 6705c2f4..f50e3098 100644
--- a/accera/transforms/include/affine/AffineSimplifications.h
+++ b/accera/transforms/include/affine/AffineSimplifications.h
@@ -16,6 +16,7 @@ using OwningRewritePatternList = RewritePatternSet;
 
 namespace accera::transforms::affine
 {
-void populateAcceraAffineSimplificationPatterns(mlir::OwningRewritePatternList& patterns);
+void populateAcceraAffineExprSimplificationPatterns(mlir::OwningRewritePatternList& patterns);
+void populateAcceraAffineLoopSimplificationPatterns(mlir::OwningRewritePatternList& patterns);
 std::unique_ptr<mlir::Pass> createAffineSimplificationPass();
 } // namespace accera::transforms::affine
diff --git a/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h b/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h
index 12eb804b..dfcdebe9 100644
--- a/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h
+++ b/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h
@@ -28,6 +28,7 @@ void populateExecutionPlanAdjustHierarchicalCacheRegionPositionPatterns(mlir::Re
 void populateExecutionPlanAdjustCacheMappingPositionPatterns(mlir::RewritePatternSet& patterns);
 void populateExecutionPlanMaxElementCacheRegionPatterns(mlir::RewritePatternSet& patterns);
 void populateExecutionPlanVectorizePatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns);
+void populateExecutionPlanVectorizeUnrollPatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns);
 void populateExecutionPlanTensorizePatterns(mlir::RewritePatternSet& patterns);
 void populateExecutionPlanParallelizePatterns(mlir::RewritePatternSet& patterns);
 void populateExecutionPlanScaleHoistingPatterns(mlir::RewritePatternSet& patterns);
diff --git a/accera/transforms/include/util/RangeValueUtilities.h b/accera/transforms/include/util/RangeValueUtilities.h
index 3a8caccf..cc06fd1a 100644
--- a/accera/transforms/include/util/RangeValueUtilities.h
+++ b/accera/transforms/include/util/RangeValueUtilities.h
@@ -74,7 +74,9 @@ class RangeValueAnalysis
     RangeValue resolveRangeValue(mlir::gpu::GridDimOp op);
     RangeValue resolveRangeValue(accera::ir::value::WarpIdOp op);
     RangeValue resolveRangeValue(llvm::Instruction::BinaryOps binOp, mlir::Operation* op);
+    RangeValue resolveRangeValue(llvm::Instruction::BinaryOps binOp, const llvm::SmallVectorImpl<RangeValue>& operandRanges);
     RangeValue resolveRangeValue(mlir::AffineForOp op);
+    RangeValue resolveRangeValue(mlir::AffineApplyOp op);
     RangeValue resolveRangeValue(mlir::scf::ForOp op);
     RangeValue resolveRangeValue(mlir::Operation* op);
 };
diff --git a/accera/transforms/include/util/VectorizationUtil.h b/accera/transforms/include/util/VectorizationUtil.h
index b6dcb812..f78ce303 100644
--- a/accera/transforms/include/util/VectorizationUtil.h
+++ b/accera/transforms/include/util/VectorizationUtil.h
@@ -42,8 +42,10 @@ class VectorizedOpMap
     std::map<void*, VectorizedOp> _vectorizedOps;
 };
 
-mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
-                                         mlir::PatternRewriter& rewriter);
+
+mlir::LogicalResult TryVectorizeKnownSubgraph(mlir::AffineForOp affineForOp,
+                                              mlir::PatternRewriter& rewriter);
+
 
 std::optional<VectorizedOp> VectorizeOp(mlir::PatternRewriter& rewriter,
                                         mlir::Operation* op,
diff --git a/accera/transforms/include/value/RangeValueOptimizePass.h b/accera/transforms/include/value/RangeValueOptimizePass.h
index 6557b681..7a201fb1 100644
--- a/accera/transforms/include/value/RangeValueOptimizePass.h
+++ b/accera/transforms/include/value/RangeValueOptimizePass.h
@@ -12,9 +12,12 @@
 namespace mlir
 {
 class Pass;
+class RewritePatternSet;
 } // namespace mlir
 
 namespace accera::transforms::value
 {
+void populateRangeValueOptimizePatterns(mlir::RewritePatternSet& patterns);
+
 std::unique_ptr<mlir::Pass> createRangeValueOptimizePass();
 } // namespace accera::transforms::value
diff --git a/accera/transforms/src/AcceraPasses.cpp b/accera/transforms/src/AcceraPasses.cpp
index 920de35a..fbb729dc 100644
--- a/accera/transforms/src/AcceraPasses.cpp
+++ b/accera/transforms/src/AcceraPasses.cpp
@@ -151,6 +151,7 @@ void addAcceraToLLVMPassPipeline(OpPassManager& pm, const AcceraPassPipelineOpti
 
     pmAdaptor.addPass(value::createValueFuncToTargetPass());
     pmAdaptor.addPass(createSymbolDCEPass());
+    pmAdaptor.addPass(affine::createAffineSimplificationPass());
 
     auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); });
     funcOpPM.addPass(createConvertLinalgToAffineLoopsPass());
diff --git a/accera/transforms/src/affine/AffineSimplifications.cpp b/accera/transforms/src/affine/AffineSimplifications.cpp
index 363add2b..64d249bd 100644
--- a/accera/transforms/src/affine/AffineSimplifications.cpp
+++ b/accera/transforms/src/affine/AffineSimplifications.cpp
@@ -7,11 +7,14 @@
 
 #include "affine/AffineSimplifications.h"
 #include "util/RangeValueUtilities.h"
+#include "nest/LoopNestToValue.h"
+#include "value/RangeValueOptimizePass.h"
 
 #include <ir/include/IRUtil.h>
 #include <ir/include/value/ValueDialect.h>
 
 #include <mlir/Conversion/AffineToStandard/AffineToStandard.h>
+#include <mlir/Dialect/Affine/Analysis/LoopAnalysis.h>
 #include <mlir/Dialect/Affine/IR/AffineOps.h>
 #include <mlir/Dialect/GPU/GPUDialect.h>
 #include <mlir/Dialect/LLVMIR/ROCDLDialect.h>
@@ -242,17 +245,32 @@ mlir::AffineExpr RunOnBinaryOpSubExpr(mlir::AffineExprKind exprKind, mlir::Affin
 
 mlir::AffineValueMap GetAffineValueMap(mlir::AffineStoreOp& storeOp)
 {
-    return mlir::AffineValueMap(storeOp.getAffineMap(), storeOp.getOperands());
+    return mlir::AffineValueMap(storeOp.getAffineMap(), storeOp.getMapOperands());
 }
 mlir::AffineValueMap GetAffineValueMap(mlir::AffineLoadOp& loadOp)
 {
-    return mlir::AffineValueMap(loadOp.getAffineMap(), loadOp.getOperands());
+    return mlir::AffineValueMap(loadOp.getAffineMap(), loadOp.getMapOperands());
 }
 mlir::AffineValueMap GetAffineValueMap(mlir::AffineApplyOp& applyOp)
 {
     return applyOp.getAffineValueMap();
 }
 
+template<typename AffineOpTy>
+bool AllOperandDefsAreInScope(AffineOpTy op)
+{
+    auto operands = op.getMapOperands();
+    for (auto operand : operands)
+    {
+        mlir::Operation* defOp = GetDefiningOpOrForLoop(operand);
+        if (defOp == nullptr)
+        {
+            return false;
+        }
+    }
+    return true;
+}
+
 void ReplaceOpUsingNewValueMap(PatternRewriter& rewriter, mlir::AffineLoadOp loadOp, mlir::AffineValueMap newAffineValueMap)
 {
     rewriter.replaceOpWithNewOp<mlir::AffineLoadOp>(loadOp, loadOp.memref(), newAffineValueMap.getAffineMap(), newAffineValueMap.getOperands());
@@ -275,6 +293,11 @@ struct SmallNumeratorTermFloorDivSimplification : public OpRewritePattern<Affine
     {
         // See docs/Reference/gpu_caching_floor_divisions.md for a proof of the equivalence this simplification leverages
 
+        if (!AllOperandDefsAreInScope(affineOp))
+        {
+            return failure();
+        }
+
         AffineSimplifyHelper helper(affineOp);
         auto loc = affineOp.getLoc();
 
@@ -381,6 +404,11 @@ struct SmallNumeratorTermModSimplification : public OpRewritePattern<AffineOpTy>
     {
         // See docs/Reference/gpu_caching_mod.md for a proof of the equivalence this simplification leverages
 
+        if (!AllOperandDefsAreInScope(affineOp))
+        {
+            return failure();
+        }
+
         AffineSimplifyHelper helper(affineOp);
         auto loc = affineOp.getLoc();
 
@@ -487,6 +515,11 @@ struct PropagateGPUConstants : public OpRewritePattern<AffineOpTy>
 
     LogicalResult matchAndRewrite(AffineOpTy affineOp, PatternRewriter& rewriter) const final
     {
+        if (!AllOperandDefsAreInScope(affineOp))
+        {
+            return failure();
+        }
+
         auto loc = affineOp.getLoc();
 
         std::vector<mlir::Operation*> opsToErase;
@@ -499,15 +532,21 @@ struct PropagateGPUConstants : public OpRewritePattern<AffineOpTy>
             {
                 auto handleBlockDimOp = [&](gpu::BlockDimOp blockDimOp) {
                     auto dimSize = GetBlockDimSize(blockDimOp);
-                    mlir::Value dimSizeConstantOp = rewriter.create<mlir::arith::ConstantIntOp>(loc, dimSize, rewriter.getI64Type());
-                    affineOp->replaceUsesOfWith(operand, dimSizeConstantOp);
-                    replaced = true;
+                    if (dimSize.has_value())
+                    {
+                        mlir::Value dimSizeConstantOp = rewriter.create<mlir::arith::ConstantIntOp>(loc, *dimSize, rewriter.getI64Type());
+                        affineOp->replaceUsesOfWith(operand, dimSizeConstantOp);
+                        replaced = true;
+                    }
                 };
                 auto handleGridDimOp = [&](gpu::GridDimOp gridDimOp) {
                     auto dimSize = GetGridDimSize(gridDimOp);
-                    mlir::Value dimSizeConstantOp = rewriter.create<mlir::arith::ConstantIntOp>(loc, dimSize, rewriter.getI64Type());
-                    affineOp->replaceUsesOfWith(operand, dimSizeConstantOp);
-                    replaced = true;
+                    if (dimSize.has_value())
+                    {
+                        mlir::Value dimSizeConstantOp = rewriter.create<mlir::arith::ConstantIntOp>(loc, *dimSize, rewriter.getI64Type());
+                        affineOp->replaceUsesOfWith(operand, dimSizeConstantOp);
+                        replaced = true;
+                    }
                 };
                 mlir::TypeSwitch<mlir::Operation*>(definingOp)
                     .Case([&](gpu::BlockDimOp blockDimOp) {
@@ -542,6 +581,67 @@ struct PropagateGPUConstants : public OpRewritePattern<AffineOpTy>
     }
 };
 
+struct AffineForOpSimplifyBounds : public OpRewritePattern<AffineForOp>
+{
+    using OpRewritePattern<AffineForOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(AffineForOp affineForOp, PatternRewriter& rewriter) const final
+    {
+        RangeValueAnalysis rangeValue;
+
+        auto lowerBound = affineForOp.getLowerBound();
+        auto initLowerBoundMap = lowerBound.getMap();
+        std::vector<mlir::Value> lowerBoundOperands(lowerBound.operandBegin(), lowerBound.operandEnd());
+        mlir::AffineValueMap lowerBoundValueMap(initLowerBoundMap, lowerBoundOperands);
+        lowerBoundValueMap = SimplifyAffineValueMap(lowerBoundValueMap);
+        auto simplifiedLowerBoundMap = lowerBoundValueMap.getAffineMap();
+
+        auto upperBound = affineForOp.getUpperBound();
+        auto initUpperBoundMap = upperBound.getMap();
+        std::vector<mlir::Value> upperBoundOperands(upperBound.operandBegin(), upperBound.operandEnd());
+        mlir::AffineValueMap upperBoundValueMap(initUpperBoundMap, upperBoundOperands);
+        upperBoundValueMap = SimplifyAffineValueMap(upperBoundValueMap);
+        auto simplifiedUpperBoundMap = upperBoundValueMap.getAffineMap();
+
+        rewriter.updateRootInPlace(affineForOp, [&]()
+        {
+            if (simplifiedLowerBoundMap.isSingleConstant())
+            {
+                auto lowerBoundConst = simplifiedLowerBoundMap.getSingleConstantResult();
+                affineForOp.setConstantLowerBound(lowerBoundConst);
+            }
+            else
+            {
+                affineForOp.setUpperBound(upperBoundValueMap.getOperands(), upperBoundValueMap.getAffineMap());
+            }
+
+            if (simplifiedUpperBoundMap.isSingleConstant())
+            {
+                auto upperBoundConst = simplifiedUpperBoundMap.getSingleConstantResult();
+                affineForOp.setConstantUpperBound(upperBoundConst);
+            }
+            else
+            {
+                affineForOp.setLowerBound(lowerBoundValueMap.getOperands(), lowerBoundValueMap.getAffineMap());
+            }
+        });
+
+        if (affineForOp.hasConstantBounds())
+        {
+            auto constantTripCountOpt = mlir::getConstantTripCount(affineForOp);
+            if (constantTripCountOpt.getValue() == 0)
+            {
+                rewriter.eraseOp(affineForOp);
+                return success();
+            }
+            return PromoteIfSingleIteration(rewriter, affineForOp);
+        }
+
+        // Didn't remove the loop, but possibly modified it. Let another rewrite try to simplify it
+        return failure();
+    }
+};
+
 struct AffineSimplificationPass : public accera::transforms::AcceraAffineSimplificationBase<AffineSimplificationPass>
 {
     void runOnOperation() final
@@ -549,12 +649,24 @@ struct AffineSimplificationPass : public accera::transforms::AcceraAffineSimplif
         auto* context = &getContext();
         auto op = getOperation();
 
-        mlir::GreedyRewriteConfig singleIterationConfig;
-        singleIterationConfig.maxIterations = 1;
+        {
+            mlir::GreedyRewriteConfig singleIterationConfig;
+            singleIterationConfig.maxIterations = 1;
+
+            OwningRewritePatternList patterns(context);
+            accera::transforms::affine::populateAcceraAffineExprSimplificationPatterns(patterns);
+            (void)applyPatternsAndFoldGreedily(op, std::move(patterns), singleIterationConfig);
+        }
 
-        OwningRewritePatternList patterns(context);
-        accera::transforms::affine::populateAcceraAffineSimplificationPatterns(patterns);
-        (void)applyPatternsAndFoldGreedily(op, std::move(patterns), singleIterationConfig);
+        // Apply RangeValueOptimize and affine value map simplification to try to simplify possibly-dynamic loop bounds
+        {
+            mlir::GreedyRewriteConfig topDownConfig;
+            topDownConfig.useTopDownTraversal = true;
+
+            OwningRewritePatternList patterns(context);
+            accera::transforms::affine::populateAcceraAffineLoopSimplificationPatterns(patterns);
+            (void)applyPatternsAndFoldGreedily(op, std::move(patterns), topDownConfig);
+        }
     }
 };
 
@@ -562,7 +674,8 @@ struct AffineSimplificationPass : public accera::transforms::AcceraAffineSimplif
 
 namespace accera::transforms::affine
 {
-void populateAcceraAffineSimplificationPatterns(mlir::OwningRewritePatternList& patterns)
+
+void populateAcceraAffineExprSimplificationPatterns(mlir::OwningRewritePatternList& patterns)
 {
     patterns.insert<SmallNumeratorTermFloorDivSimplification<mlir::AffineLoadOp>>(patterns.getContext());
     patterns.insert<SmallNumeratorTermFloorDivSimplification<mlir::AffineStoreOp>>(patterns.getContext());
@@ -575,6 +688,12 @@ void populateAcceraAffineSimplificationPatterns(mlir::OwningRewritePatternList&
     patterns.insert<PropagateGPUConstants<mlir::AffineApplyOp>>(patterns.getContext());
 }
 
+void populateAcceraAffineLoopSimplificationPatterns(mlir::OwningRewritePatternList& patterns)
+{
+    patterns.insert<AffineForOpSimplifyBounds>(patterns.getContext());
+    accera::transforms::value::populateRangeValueOptimizePatterns(patterns);
+}
+
 std::unique_ptr<mlir::Pass> createAffineSimplificationPass()
 {
     return std::make_unique<AffineSimplificationPass>();
diff --git a/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp b/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp
index 6ee9dd80..4806c6a4 100644
--- a/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp
+++ b/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp
@@ -5910,12 +5910,12 @@ LogicalResult VectorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
         return failure();
     }
 
-    // First, match and rewrite the special case for vectorizing int16 matmul
-    auto result = vectorizeInt16MatMul(affineForOp, rewriter);
-    if (succeeded(result))
+    // First, check if we have a custom match and rewrite pattern for this exact subgraph
+    auto knownSubgraphResult = TryVectorizeKnownSubgraph(affineForOp, rewriter);
+    if (succeeded(knownSubgraphResult))
     {
         RemoveVectorizationInfo(affineForOp);
-        return result;
+        return knownSubgraphResult;
     }
 
     auto vectorInfo = GetVectorizationInfo(affineForOp);
@@ -5935,6 +5935,23 @@ LogicalResult VectorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
         return success();
     }
 
+    // If this isn't the innermost loop in the nest and we don't have custom handling for this pattern,
+    // then in-place unroll the loops between this loop and the innermost loop and vectorize the innermost loop
+    SmallVector<AffineForOp, 4> nestedLoops;
+    mlir::getPerfectlyNestedLoops(nestedLoops, affineForOp);
+    if (nestedLoops.size() > 1)
+    {
+        RemoveVectorizationInfo(affineForOp);
+        for (unsigned loopIdx = 0; loopIdx < nestedLoops.size() - 1; loopIdx++)
+        {
+            InPlaceUnrollInfo inPlaceUnrollInfo{ 0 }; // 0 for full unroll
+            SetInPlaceUnrollInfo(nestedLoops[loopIdx], inPlaceUnrollInfo);
+        }
+        auto vecInfoAttr = VectorizationInfoAttr::get(vectorInfo, rewriter.getContext());
+        nestedLoops[nestedLoops.size() - 1]->setAttr(VectorizationInfoAttr::getKeyName(), vecInfoAttr);
+        return failure();
+    }
+
     auto affineForOpIV = affineForOp.getInductionVar();
 
     if (affineForOpIV.use_empty())
@@ -7850,6 +7867,7 @@ void ExecutionPlanVectorizationPass::runOnOperation()
 
     RewritePatternSet patterns(&getContext());
     accera::transforms::executionPlan::populateExecutionPlanVectorizePatterns(printVecOpDetails, patterns);
+    accera::transforms::executionPlan::populateExecutionPlanVectorizeUnrollPatterns(printVecOpDetails, patterns);
 
     (void)applyPatternsAndFoldGreedily(operation, std::move(patterns));
 }
@@ -8073,8 +8091,12 @@ void populateExecutionPlanAdjustCacheMappingPositionPatterns(mlir::RewritePatter
 
 void populateExecutionPlanVectorizePatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns)
 {
-    patterns.insert<VectorizeAffineForOpConversion,
-                    InPlaceUnrollAffineForOpConversion>(patterns.getContext(), printVectorizationDetails);
+    patterns.insert<VectorizeAffineForOpConversion>(patterns.getContext(), printVectorizationDetails);
+}
+
+void populateExecutionPlanVectorizeUnrollPatterns(bool printVectorizationDetails, mlir::RewritePatternSet& patterns)
+{
+    patterns.insert<InPlaceUnrollAffineForOpConversion>(patterns.getContext(), printVectorizationDetails);
 }
 
 void populateExecutionPlanTensorizePatterns(mlir::RewritePatternSet& patterns)
diff --git a/accera/transforms/src/nest/LoopNestToValue.cpp b/accera/transforms/src/nest/LoopNestToValue.cpp
index 275750ad..088e2a63 100644
--- a/accera/transforms/src/nest/LoopNestToValue.cpp
+++ b/accera/transforms/src/nest/LoopNestToValue.cpp
@@ -814,7 +814,19 @@ LogicalResult ScheduledLoopOpRewrite::matchAndRewrite(ScheduledLoopOp op, Patter
     auto scheduledLoopOpAttrs = op->getAttrs();
     for (auto& attr : scheduledLoopOpAttrs)
     {
-        bodyLoop->setAttr(attr.getName(), attr.getValue());
+        // HACK: Don't copy the domain attribute in case we later inline a dynamically-sized domain into a statically-sized region and the domain doesn't adjust correctly for serialization
+        //       (we also no longer need the domain after building out the loopnest)
+        if (attr.getName() != "domain")
+        {
+            bodyLoop->setAttr(attr.getName(), attr.getValue());
+        }
+    }
+    // Hack for erasing loops
+    if (bodyLoop->hasAttr("_erase"))
+    {
+        bodyLoop.setConstantLowerBound(0);
+        bodyLoop.setConstantUpperBound(1);
+        bodyLoop.setStep(1);
     }
 
     auto bodyLoopRegion = &bodyLoop.region();
diff --git a/accera/transforms/src/nest/LoopNestToValueFunc.cpp b/accera/transforms/src/nest/LoopNestToValueFunc.cpp
index be5f7840..0a76b3cc 100644
--- a/accera/transforms/src/nest/LoopNestToValueFunc.cpp
+++ b/accera/transforms/src/nest/LoopNestToValueFunc.cpp
@@ -281,7 +281,7 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
 
         {
             RewritePatternSet patterns(context);
-            affinetr::populateAcceraAffineSimplificationPatterns(patterns);
+            affinetr::populateAcceraAffineExprSimplificationPatterns(patterns);
             (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns), singleIterationConfig);
             snapshotter.Snapshot("AcceraAffineSimplification", vFuncOp);
         }
@@ -308,6 +308,21 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
             snapshotter.Snapshot("ExecutionPlanVectorize_Canonicalize", vFuncOp);
         }
 
+        {
+            RewritePatternSet patterns(context);
+            tr::populateLoopSimplificationPatterns(patterns);
+            (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
+            snapshotter.Snapshot("LoopSimplification", vFuncOp);
+        }
+
+        {
+            RewritePatternSet patterns(context);
+            xptr::populateExecutionPlanVectorizeUnrollPatterns(printVecOpDetails, patterns);
+            utilir::FillCanonicalPatternsRecursively(vFuncOp, patterns);
+            (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
+            snapshotter.Snapshot("ExecutionPlanVectorizeUnroll_Canonicalize", vFuncOp);
+        }
+
         {
             RewritePatternSet patterns(context);
             tr::populateLoopOptimizationPatterns(patterns);
diff --git a/accera/transforms/src/util/RangeValueUtilities.cpp b/accera/transforms/src/util/RangeValueUtilities.cpp
index 50e7fc53..b328e1f8 100644
--- a/accera/transforms/src/util/RangeValueUtilities.cpp
+++ b/accera/transforms/src/util/RangeValueUtilities.cpp
@@ -41,19 +41,31 @@ namespace
 RangeValue resolveThreadIdRange(Operation* op, gpu::Dimension dimId)
 {
     auto upperBound = GetBlockDimSize(op, dimId);
-    return RangeValue(0, upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the thread id never takes on the upperBound value
+    if (upperBound.has_value())
+    {
+        return RangeValue(0, *upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the thread id never takes on the upperBound value
+    }
+    return RangeValue();
 }
 
 RangeValue resolveBlockIdRange(Operation* op, gpu::Dimension dimId)
 {
     auto upperBound = GetGridDimSize(op, dimId);
-    return RangeValue(0, upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the block id never takes on the upperBound value
+    if (upperBound.has_value())
+    {
+        return RangeValue(0, *upperBound - 1); // -1 because RangeValue will add 1 to the upper bound and the block id never takes on the upperBound value
+    }
+    return RangeValue();
 }
 
 RangeValue resolveGridDimRange(Operation* op, gpu::Dimension dimId)
 {
     auto upperBound = GetGridDimSize(op, dimId);
-    return RangeValue(upperBound, upperBound);
+    if (upperBound.has_value())
+    {
+        return RangeValue(*upperBound, *upperBound);
+    }
+    return RangeValue();
 }
 
 } // namespace
@@ -285,7 +297,12 @@ RangeValue RangeValueAnalysis::resolveRangeValue(mlir::gpu::GridDimOp op)
 RangeValue RangeValueAnalysis::resolveRangeValue(WarpIdOp op)
 {
     const mlir::gpu::Dimension dim{ op.dimension() };
-    auto upperBound = GetBlockDimSize(op, dim);
+    auto upperBoundOpt = GetBlockDimSize(op, dim);
+    if (!upperBoundOpt.has_value())
+    {
+        return RangeValue();
+    }
+    auto upperBound = *upperBoundOpt;
     if (dim == mlir::gpu::Dimension::x)
     {
         auto [warpSizeX, warpSizeY] = ResolveWarpSize(ResolveExecutionRuntime(op)).value();
@@ -298,11 +315,114 @@ RangeValue RangeValueAnalysis::resolveRangeValue(WarpIdOp op)
 RangeValue RangeValueAnalysis::resolveRangeValue(Instruction::BinaryOps binOp, mlir::Operation* op)
 {
     auto operands = resolveOperands(op);
+    return resolveRangeValue(binOp, operands);
+}
+
+RangeValue RangeValueAnalysis::resolveRangeValue(Instruction::BinaryOps binOp, const llvm::SmallVectorImpl<RangeValue>& operands)
+{
     return operands[0].binaryOp(binOp, operands[1]);
 }
+
+RangeValue RangeValueAnalysis::resolveRangeValue(AffineApplyOp op)
+{
+    auto affineValueMap = util::AffineApplyToAffineValueMap(op);
+    auto simplified = util::SimplifyAffineValueMap(affineValueMap);
+    auto map = simplified.getAffineMap();
+    assert(map.getNumResults() == 1 && "Affine apply can't have multiple expressions");
+    auto expr = map.getResult(0);
+    auto operands = simplified.getOperands();
+    for (auto operand : operands)
+    {
+        if (!hasRange(operand))
+        {
+            if (auto defOp = GetDefiningOpOrForLoop(operand))
+            {
+                addOperation(defOp);
+            }
+        }
+    }
+    std::vector<mlir::Value> dimOperands(operands.begin(), operands.begin() + map.getNumDims());
+    std::vector<mlir::Value> symbolOperands(operands.begin() + map.getNumDims(), operands.end());
+    mlir::DenseMap<mlir::AffineExpr, RangeValue> subExprRanges;
+    // Post-order traversal of the expression tree
+    expr.walk([&](mlir::AffineExpr subExpr) {
+        if (auto dimExpr = subExpr.dyn_cast<mlir::AffineDimExpr>())
+        {
+            auto idx = dimExpr.getPosition();
+            auto rv = getRange(dimOperands[idx]);
+            subExprRanges.insert({ subExpr, rv });
+        }
+        if (auto symExpr = subExpr.dyn_cast<mlir::AffineSymbolExpr>())
+        {
+            auto idx = symExpr.getPosition();
+            auto rv = getRange(symbolOperands[idx]);
+            subExprRanges.insert({ subExpr, rv });
+        }
+        if (auto constExpr = subExpr.dyn_cast<mlir::AffineConstantExpr>())
+        {
+            RangeValue rv(constExpr.getValue(), constExpr.getValue());
+            subExprRanges.insert({ subExpr, rv });
+        }
+        if (auto binOpExpr = subExpr.dyn_cast<mlir::AffineBinaryOpExpr>())
+        {
+            auto lhs = binOpExpr.getLHS();
+            auto rhs = binOpExpr.getRHS();
+            auto lhsIt = subExprRanges.find(lhs);
+            assert(lhsIt != subExprRanges.end());
+            auto lhsRv = lhsIt->second;
+            auto rhsIt = subExprRanges.find(rhs);
+            assert(rhsIt != subExprRanges.end());
+            auto rhsRv = rhsIt->second;
+
+            Instruction::BinaryOps llvmBinOp;
+            switch (binOpExpr.getKind())
+            {
+            case mlir::AffineExprKind::Add:
+                llvmBinOp = Instruction::BinaryOps::Add;            
+                break;
+            case mlir::AffineExprKind::Mul:
+                llvmBinOp = Instruction::BinaryOps::Mul;
+                break;
+            case mlir::AffineExprKind::Mod:
+                llvmBinOp = Instruction::BinaryOps::SRem;
+                break;
+            case mlir::AffineExprKind::FloorDiv:
+                llvmBinOp = Instruction::BinaryOps::SDiv;
+                break;
+            case mlir::AffineExprKind::CeilDiv:
+                assert(false); // Unsupported currently - no matching llvm bin op
+                break;
+            default:
+                assert(false);
+                break;
+            }
+            llvm::SmallVector<RangeValue, 2> operandRanges{ lhsRv, rhsRv };
+            auto rv = resolveRangeValue(llvmBinOp, operandRanges);
+            subExprRanges.insert({ subExpr, rv });
+        }
+    });
+
+    // Find the root expr in the map and return its computed RangeValue
+    auto it = subExprRanges.find(expr);
+    assert(it != subExprRanges.end());
+    return it->second;
+}
+
 RangeValue RangeValueAnalysis::resolveRangeValue(AffineForOp op)
 {
-    return op.hasConstantBounds() ? RangeValue(op.getConstantLowerBound(), op.getConstantUpperBound() - op.getStep()) : RangeValue();
+    if (op.hasConstantBounds())
+    {
+        auto lb = op.getConstantLowerBound();
+        auto ub = op.getConstantUpperBound();
+        auto step = op.getStep();
+
+        auto range = ub - lb;
+        auto remainder = range % step;
+        auto largestInductionVarValue = (remainder > 0) ? (ub - remainder) : (ub - step);
+
+        return RangeValue(lb, largestInductionVarValue);
+    }
+    return RangeValue();
 }
 RangeValue RangeValueAnalysis::resolveRangeValue(scf::ForOp op)
 {
@@ -314,7 +434,22 @@ RangeValue RangeValueAnalysis::resolveRangeValue(scf::ForOp op)
 
     RangeValue lowerBound = resolveRangeValue(op.getLowerBound().getDefiningOp());
     RangeValue upperBound = resolveRangeValue(op.getUpperBound().getDefiningOp());
-    return lowerBound.isConstant() && upperBound.isConstant() ? RangeValue(lowerBound.range.getLower(), upperBound.range.getUpper() - 1) : RangeValue();
+    RangeValue stepSize = resolveRangeValue(op.getStep().getDefiningOp());
+
+    bool isConstantRangeStep = lowerBound.isConstant() && upperBound.isConstant() && stepSize.isConstant();
+    if (isConstantRangeStep)
+    {
+        auto lb = lowerBound.range.getLower();
+        auto ub = upperBound.range.getUpper();
+        auto step = stepSize.range.getLower();
+
+        auto range = ub - lb;
+        auto remainder = range.srem(step);
+        auto largestInductionVarValue = (remainder.sgt(0)) ? (ub - remainder) : (ub - step);
+
+        return RangeValue(lb, largestInductionVarValue);
+    }
+    return RangeValue();
 }
 RangeValue RangeValueAnalysis::resolveRangeValue(mlir::Operation* op)
 {
@@ -335,6 +470,7 @@ RangeValue RangeValueAnalysis::resolveRangeValue(mlir::Operation* op)
         .Case([&](arith::DivUIOp op) { return resolveRangeValue(Instruction::BinaryOps::UDiv, op); })
         .Case([&](scf::ForOp op) { return resolveRangeValue(op); })
         .Case([&](AffineForOp op) { return resolveRangeValue(op); })
+        .Case([&](AffineApplyOp op) { return resolveRangeValue(op); })
         .Default([&](mlir::Operation*) { return RangeValue(); });
 }
 
diff --git a/accera/transforms/src/util/VectorizationUtil.cpp b/accera/transforms/src/util/VectorizationUtil.cpp
index 9558ce24..84c91987 100644
--- a/accera/transforms/src/util/VectorizationUtil.cpp
+++ b/accera/transforms/src/util/VectorizationUtil.cpp
@@ -38,6 +38,9 @@ namespace v = accera::ir::value;
 
 #define DEBUG_TYPE "vectorization-util"
 
+// TODO : plumb through a sufficient target enum / bitmap so we can dynamically enable/disable vpmaddwd and other pattern matchers
+#define MATCH_VPMADDWD_INTRINSIC 1
+
 namespace accera::transforms
 {
 
@@ -123,6 +126,8 @@ bool CanVectorizeOp(mlir::Operation* op,
             .Case([](mlir::math::AbsOp) { return true; })
             // .Case([&](mlir::AffineApplyOp) { return true; }) // TODO: either enable or remove this
             .Case([](mlir::math::ExpOp) { return true; })
+            .Case([](v::CastOp) { return true; })
+            .Case([vectorSize](v::RoundOp) { return v::RoundOp::SupportsVectorization(vectorSize); })
             .Case([](v::BitcastOp) { return true; })
             .Case([](v::BinOp) { return true; })
             .Case([](v::CmpOp) { return true; })
@@ -263,19 +268,101 @@ std::optional<mlir::Operation*> VectorizeConstantOp(mlir::PatternRewriter& rewri
     return constVec;
 }
 
+// TODO de-dupe some internals with GetConstantStrideBetweenUnrolledAccesses
+template <typename LhsOpType, typename RhsOpType>
+std::optional<int64_t> GetConstantStrideBetweenAccesses(mlir::PatternRewriter& rewriter,
+                                                        LhsOpType lhsAccessOp,
+                                                        RhsOpType rhsAccessOp)
+{
+    std::stack<mlir::Operation*> tempOps;
+    ir::util::TempOpCleanupGuard tempOpGuard(&tempOps, rewriter);
+
+    auto lhsAccessMapComposition = ir::util::GetIndexToMemoryLocationMap(rewriter.getContext(), lhsAccessOp);
+    auto rhsAccessMapComposition = ir::util::GetIndexToMemoryLocationMap(rewriter.getContext(), rhsAccessOp);
+
+    // For dynamically shaped memrefs, currently we only handle identity-mapped memrefs,
+    // general dynamic memref support will come later.
+    auto lhsMemRefType = lhsAccessOp.memref().getType().template cast<mlir::MemRefType>();
+    if (!lhsMemRefType.hasStaticShape())
+    {
+        if (!ir::util::HasIdentityLayout(lhsAccessOp.memref()))
+        {
+            return std::nullopt;
+        }
+    }
+
+    auto rhsMemRefType = rhsAccessOp.memref().getType().template cast<mlir::MemRefType>();
+    if (!rhsMemRefType.hasStaticShape())
+    {
+        if (!ir::util::HasIdentityLayout(rhsAccessOp.memref()))
+        {
+            return std::nullopt;
+        }
+    }
+
+    // Re-check if there is no static shape and collect the symbols now that we know we won't be returning std::nullopt
+    // because ir::util::GetIdentityMemrefStrideSymbols() does a non-trivial amount of work that me may as well not waste
+    std::vector<mlir::Value> lhsStrideSymbols;
+    std::vector<mlir::Value> rhsStrideSymbols;
+    if (!lhsMemRefType.hasStaticShape())
+    {
+        lhsStrideSymbols = ir::util::GetIdentityMemrefStrideSymbols(rewriter, lhsAccessOp.getLoc(), lhsAccessOp.memref());
+    }
+    if (!rhsMemRefType.hasStaticShape())
+    {
+        rhsStrideSymbols = ir::util::GetIdentityMemrefStrideSymbols(rewriter, rhsAccessOp.getLoc(), rhsAccessOp.memref());
+    }
+
+    std::vector<mlir::Value> lhsIndicesVec(lhsAccessOp.indices().begin(), lhsAccessOp.indices().end());
+    std::vector<mlir::Value> rhsIndicesVec(rhsAccessOp.indices().begin(), rhsAccessOp.indices().end());
+
+    // Append any dynamic stride symbols since we're dealing with a flattened layout map
+    lhsIndicesVec.insert(lhsIndicesVec.end(), lhsStrideSymbols.begin(), lhsStrideSymbols.end());
+    rhsIndicesVec.insert(rhsIndicesVec.end(), rhsStrideSymbols.begin(), rhsStrideSymbols.end());
+
+    auto lhsAccess = ir::util::MultiDimAffineApply(rewriter, lhsAccessOp.getLoc(), lhsAccessMapComposition, lhsIndicesVec);
+    auto rhsAccess = ir::util::MultiDimAffineApply(rewriter, rhsAccessOp.getLoc(), rhsAccessMapComposition, rhsIndicesVec);
+    assert(lhsAccess.size() == 1);
+    assert(rhsAccess.size() == 1);
+    tempOps.push(lhsAccess[0].getDefiningOp());
+    tempOps.push(rhsAccess[0].getDefiningOp());
+
+    mlir::AffineExpr diffExpr = rewriter.getAffineDimExpr(1) - rewriter.getAffineDimExpr(0);
+    auto diffMap = mlir::AffineMap::get(2, 0, diffExpr);
+
+    mlir::SmallVector<mlir::Value, 4> compareAccesses{ lhsAccess[0], rhsAccess[0] };
+    mlir::fullyComposeAffineMapAndOperands(&diffMap, &compareAccesses);
+
+    assert(diffMap.getNumResults() == 1);
+    auto resultExpr = diffMap.getResult(0);
+    if (resultExpr.isa<mlir::AffineConstantExpr>())
+    {
+        auto constExpr = resultExpr.dyn_cast<mlir::AffineConstantExpr>();
+        return constExpr.getValue();
+    }
+
+    // There isn't a constant difference between memory accesses
+    return std::nullopt;
+}
+
 template <typename OpType>
-bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
-                                OpType op,
-                                std::vector<mlir::BlockAndValueMapping>& laneMappings,
-                                int64_t vectorSize)
+std::optional<int64_t> GetConstantStrideBetweenUnrolledAccesses(mlir::PatternRewriter& rewriter,
+                                                                OpType op,
+                                                                std::vector<mlir::BlockAndValueMapping>& laneMappings,
+                                                                int64_t vectorSize)
 {
     // Create some unrolled clones in-memory and see whether they are accessing memory-sequential elements in the MemRef
+    std::stack<mlir::Operation*> tempOps;
+    ir::util::TempOpCleanupGuard tempOpGuard(&tempOps, rewriter);
+
     auto loc = op.getLoc();
     std::vector<OpType> temporaryClones;
     temporaryClones.reserve(vectorSize);
     for (int64_t i = 0; i < vectorSize; ++i)
     {
-        temporaryClones.push_back(mlir::dyn_cast<OpType>(rewriter.clone(*op.getOperation(), laneMappings[i])));
+        auto newTempOp = mlir::dyn_cast<OpType>(rewriter.clone(*op.getOperation(), laneMappings[i]));
+        tempOps.push(newTempOp); // Useful for automatic cleanup
+        temporaryClones.push_back(newTempOp); // Needed for ordered comparison
     }
 
     // Check if the temporary clones are all accessing sequential memory
@@ -289,12 +376,12 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
     {
         if (!ir::util::HasIdentityLayout(op.memref()))
         {
-            return false;
+            return std::nullopt;
         }
         strideSymbols = ir::util::GetIdentityMemrefStrideSymbols(rewriter, loc, op.memref());
     }
 
-    bool sequential = true;
+    std::optional<int64_t> stride = std::nullopt;
     for (int64_t unrollIdx = 1; unrollIdx < vectorSize; ++unrollIdx)
     {
         std::vector<mlir::Value> prevIndicesVec(temporaryClones[unrollIdx - 1].indices().begin(), temporaryClones[unrollIdx - 1].indices().end());
@@ -308,6 +395,8 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
         auto currentAccess = ir::util::MultiDimAffineApply(rewriter, loc, accessMapComposition, currentIndicesVec);
         assert(prevAccess.size() == 1);
         assert(currentAccess.size() == 1);
+        tempOps.push(prevAccess[0].getDefiningOp());
+        tempOps.push(currentAccess[0].getDefiningOp());
 
         mlir::AffineExpr diffExpr = rewriter.getAffineDimExpr(1) - rewriter.getAffineDimExpr(0);
         auto diffMap = mlir::AffineMap::get(2, 0, diffExpr);
@@ -320,31 +409,53 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
         if (resultExpr.isa<mlir::AffineConstantExpr>())
         {
             auto constExpr = resultExpr.dyn_cast<mlir::AffineConstantExpr>();
-            if (constExpr.getValue() != 1)
+            if (!stride.has_value())
+            {
+                stride = constExpr.getValue();
+            }
+            else if (constExpr.getValue() != *stride)
             {
-                // There is a constant difference between sequential op memory accesses
-                // but the stride is not 1, so the memory isn't contiguous and therefore
-                // it's not safe to replace all of the memory ops with a single vector op
-                sequential = false;
-                break;
+                // The strides aren't consistent
+                return std::nullopt;
             }
         }
         else
         {
             // There isn't a constant difference between sequential op memory accesses
-            // so it's not necessarily safe to convert all of the memory ops into a single
-            // vector op
-            sequential = false;
-            break;
+            return std::nullopt;
         }
     }
 
-    // Clean up the temporary clones
-    for (auto& clone : temporaryClones)
-    {
-        rewriter.eraseOp(clone);
-    }
-    return sequential;
+    return stride;
+}
+
+template <typename OpType>
+bool DoesUnrolledAccessHaveStride(mlir::PatternRewriter& rewriter,
+                                  OpType op,
+                                  std::vector<mlir::BlockAndValueMapping>& laneMappings,
+                                  int64_t vectorSize,
+                                  int64_t stride)
+{
+    auto strideOpt = GetConstantStrideBetweenUnrolledAccesses(rewriter, op, laneMappings, vectorSize);
+    return strideOpt.has_value() && *strideOpt == stride;
+}
+
+template <typename OpType>
+bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
+                                OpType op,
+                                std::vector<mlir::BlockAndValueMapping>& laneMappings,
+                                int64_t vectorSize)
+{
+    return DoesUnrolledAccessHaveStride(rewriter, op, laneMappings, vectorSize, 1 /* stride */);
+}
+
+template <typename OpType>
+bool IsUnrolledAccessConstant(mlir::PatternRewriter& rewriter,
+                              OpType op,
+                              std::vector<mlir::BlockAndValueMapping>& laneMappings,
+                              int64_t vectorSize)
+{
+    return DoesUnrolledAccessHaveStride(rewriter, op, laneMappings, vectorSize, 0 /* stride */);
 }
 
 mlir::Value FlattenMemRefCast(mlir::OpBuilder& builder, mlir::Location loc, mlir::Value memref)
@@ -488,6 +599,42 @@ std::optional<VectorizedOp> VectorizeStoreOp(mlir::PatternRewriter& rewriter,
     }
 }
 
+mlir::vector::LoadOp VectorizeAffineLoadOpHelper(mlir::PatternRewriter& rewriter,
+                                                 mlir::AffineLoadOp op,
+                                                 int64_t vectorSize)
+{
+    auto memRefType = op.getMemRefType();
+    auto elementType = memRefType.getElementType();
+    auto vectorType = mlir::VectorType::get({ vectorSize }, elementType);
+    mlir::AffineLoadOpAdaptor adaptor{ op };
+    std::vector<mlir::Value> indices(adaptor.indices().begin(), adaptor.indices().end());
+
+    auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, op, indices);
+    return rewriter.create<mlir::vector::LoadOp>(op.getLoc(), vectorType, flatCastMemRef, mlir::ValueRange{ flattenedPos });
+}
+
+mlir::vector::StoreOp VectorizeAffineStoreOpHelper(mlir::PatternRewriter& rewriter,
+                                                   mlir::AffineStoreOp op,
+                                                   mlir::Value vecValToStore,
+                                                   int64_t vectorSize)
+{
+    mlir::AffineStoreOpAdaptor adaptor{ op };
+    std::vector<mlir::Value> indices(adaptor.indices().begin(), adaptor.indices().end());
+
+    auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, op, indices);
+    return rewriter.create<mlir::vector::StoreOp>(op.getLoc(), vecValToStore, flatCastMemRef, mlir::ValueRange{ flattenedPos });
+}
+
+mlir::vector::StoreOp VectorizeAffineStoreOpHelper(mlir::PatternRewriter& rewriter,
+                                                   mlir::AffineStoreOp op,
+                                                   mlir::BlockAndValueMapping valueMapping,
+                                                   int64_t vectorSize)
+{
+    auto scalarStoreVal = op.getValueToStore();
+    assert(valueMapping.contains(scalarStoreVal));
+    return VectorizeAffineStoreOpHelper(rewriter, op, valueMapping.lookup(scalarStoreVal), vectorSize);
+}
+
 std::optional<VectorizedOp> VectorizeAffineLoadOp(mlir::PatternRewriter& rewriter,
                                                   mlir::AffineLoadOp op,
                                                   const VectorizedOpMap& vectorizedOps,
@@ -505,24 +652,34 @@ std::optional<VectorizedOp> VectorizeAffineLoadOp(mlir::PatternRewriter& rewrite
     std::vector<mlir::Value> baseIndices(adaptor.indices().begin(), adaptor.indices().end());
 
     mlir::Value result;
-    if (IsUnrolledAccessSequential(rewriter, op, laneMappings, vectorSize))
-    {
-        // We know these reads are sequential, but mlir::vector::LoadOp only operates on memrefs where the minor
-        // dimension has unit stride, so cast the memref to a flat buffer and load from that shape
-        auto [flatCastMemref, flattenedPosition] = FlattenAccess(rewriter, op, baseIndices);
-        result = rewriter.create<mlir::vector::LoadOp>(op.getLoc(), vectorType, flatCastMemref, mlir::ValueRange{ flattenedPosition });
-    }
-    else
+    auto strideOpt = GetConstantStrideBetweenUnrolledAccesses(rewriter, op, laneMappings, vectorSize);
+    if (strideOpt.has_value())
     {
-        // Fall back to many loads and stores into a vector
-        auto zero = rewriter.create<mlir::arith::ConstantOp>(loc, elementType, rewriter.getZeroAttr(elementType));
-        result = rewriter.create<mlir::vector::BroadcastOp>(loc, vectorType, zero);
-        for (int64_t i = 0; i < vectorSize; ++i)
+        int64_t stride = *strideOpt;
+        if (stride == 1)
         {
-            auto elementLoad = rewriter.clone(*op.getOperation(), laneMappings[i]);
-            result = rewriter.create<mlir::vector::InsertElementOp>(loc, elementLoad->getResult(0), result, rewriter.create<mlir::arith::ConstantIndexOp>(loc, i));
+            // We know these reads are sequential, but mlir::vector::LoadOp only operates on memrefs where the minor
+            // dimension has unit stride, so cast the memref to a flat buffer and load from that shape
+            auto [flatCastMemref, flattenedPosition] = FlattenAccess(rewriter, op, baseIndices);
+            result = rewriter.create<mlir::vector::LoadOp>(op.getLoc(), vectorType, flatCastMemref, mlir::ValueRange{ flattenedPosition });
+            return result;
+        }
+        else if (stride == 0)
+        {
+            // Broadcast a single loaded element
+            auto clonedLoadOp = mlir::dyn_cast<AffineLoadOp>(rewriter.clone(*op.getOperation())); // The original op will likely get discarded as part of successful vectorization
+            result = rewriter.create<mlir::vector::BroadcastOp>(loc, vectorType, clonedLoadOp.getResult());
+            return result;
         }
     }
+    // Fall back to many loads and stores into a vector
+    auto zero = rewriter.create<mlir::arith::ConstantOp>(loc, elementType, rewriter.getZeroAttr(elementType));
+    result = rewriter.create<mlir::vector::BroadcastOp>(loc, vectorType, zero);
+    for (int64_t i = 0; i < vectorSize; ++i)
+    {
+        auto elementLoad = rewriter.clone(*op.getOperation(), laneMappings[i]);
+        result = rewriter.create<mlir::vector::InsertElementOp>(loc, elementLoad->getResult(0), result, rewriter.create<mlir::arith::ConstantIndexOp>(loc, i));
+    }
     return result;
 }
 
@@ -534,16 +691,28 @@ std::optional<VectorizedOp> VectorizeAffineStoreOp(mlir::PatternRewriter& rewrit
                                                    int64_t step,
                                                    int64_t vectorSize)
 {
+    [[maybe_unused]] auto loc = op.getLoc();
+
     // Get (vector) value to store from map
     mlir::AffineStoreOpAdaptor adaptor{ op };
     auto scalarValue = op.getValueToStore();
-    auto vecOp = vectorizedOps.Lookup(scalarValue.getDefiningOp());
+    auto scalarValueDefOp = scalarValue.getDefiningOp();
+    auto vecOp = vectorizedOps.Lookup(scalarValueDefOp);
     if (!vecOp)
     {
-        return std::nullopt;
+        if (mlir::isa<mlir::ConstantOp>(scalarValueDefOp))
+        {
+            // If it's a constant being stored, just broadcast it to a vector and store that
+            auto vectorType = mlir::VectorType::get({ vectorSize }, scalarValue.getType());
+            mlir::Value broadcastVal = rewriter.create<mlir::vector::BroadcastOp>(loc, vectorType, scalarValue);
+            vecOp = VectorizedOp(broadcastVal);
+        }
+        else
+        {
+            return std::nullopt;
+        }
     }
 
-    [[maybe_unused]] auto loc = op.getLoc();
     auto memRefType = op.getMemRefType();
     [[maybe_unused]] auto elementType = memRefType.getElementType();
 
@@ -647,6 +816,53 @@ std::optional<mlir::Operation*> VectorizeShiftLeftOp(mlir::PatternRewriter& rewr
     return result;
 }
 
+// TODO : de-dupe with cast and other simple vectorizable ops
+std::optional<mlir::Operation*> VectorizeAccRoundOp(mlir::PatternRewriter& rewriter,
+                                                    v::RoundOp op,
+                                                    const VectorizedOpMap& vectorizedOps,
+                                                    std::vector<mlir::BlockAndValueMapping>& laneMappings,
+                                                    mlir::Value inductionVar,
+                                                    int64_t step,
+                                                    int64_t vectorSize)
+{
+    // Get (vector) arguments from map
+    auto inputOp = op.val();
+    auto input = GetVectorizedPredecessor(rewriter, inputOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
+    if (!input)
+    {
+        return std::nullopt;
+    }
+
+    auto loc = op.getLoc();
+    auto scalarResultType = op.getResult().getType();
+    auto resultType = mlir::VectorType::get({ vectorSize }, scalarResultType);
+    auto result = rewriter.create<v::RoundOp>(loc, resultType, input->GetVectorResult());
+    return result;
+}
+
+std::optional<mlir::Operation*> VectorizeAccCastOp(mlir::PatternRewriter& rewriter,
+                                                   v::CastOp op,
+                                                   const VectorizedOpMap& vectorizedOps,
+                                                   std::vector<mlir::BlockAndValueMapping>& laneMappings,
+                                                   mlir::Value inductionVar,
+                                                   int64_t step,
+                                                   int64_t vectorSize)
+{
+    // Get (vector) arguments from map
+    auto inputOp = op.source();
+    auto input = GetVectorizedPredecessor(rewriter, inputOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
+    if (!input)
+    {
+        return std::nullopt;
+    }
+
+    auto loc = op.getLoc();
+    auto scalarResultType = op.getResult().getType();
+    auto resultType = mlir::VectorType::get({ vectorSize }, scalarResultType);
+    auto result = rewriter.create<v::CastOp>(loc, resultType, input->GetVectorResult());
+    return result;
+}
+
 std::optional<mlir::Operation*> VectorizeFPToSIOp(mlir::PatternRewriter& rewriter,
                                                   mlir::arith::FPToSIOp op,
                                                   const VectorizedOpMap& vectorizedOps,
@@ -757,7 +973,23 @@ std::optional<VectorizedOp> VectorizeBinOp(mlir::PatternRewriter& rewriter,
     assert(lhs->HasVectorType() == rhs->HasVectorType()); // TODO : do we need to support the case where one operand is a vector and the other is a series of unrolled values?
     if (lhs->HasVectorType())
     {
-        mlir::Value result = rewriter.create<v::BinOp>(loc, predicate, lhs->GetVectorResult(), rhs->GetVectorResult());
+        mlir::Value result;
+        auto vectorTy = lhs->GetVectorResult().getType();
+        if (vectorSize == 8)
+        {
+            // Special-case max and min for better codegen
+            if (predicate == v::BinaryOpPredicate::MAX)
+            {
+                result = rewriter.create<v::vmaxps>(loc, vectorTy, lhs->GetVectorResult(), rhs->GetVectorResult());
+                return result;
+            }
+            else if (predicate == v::BinaryOpPredicate::MIN)
+            {
+                result = rewriter.create<v::vminps>(loc, vectorTy, lhs->GetVectorResult(), rhs->GetVectorResult());
+                return result;
+            }
+        }
+        result = rewriter.create<v::BinOp>(loc, predicate, lhs->GetVectorResult(), rhs->GetVectorResult());
         return result;
     }
     else
@@ -905,9 +1137,15 @@ std::optional<VectorizedOp> VectorizeOp(mlir::PatternRewriter& rewriter,
             .Case([&](v::CmpOp cmpOp) {
                 return VectorizeCmpOp(rewriter, cmpOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
             })
+            .Case([&](v::CastOp castOp) {
+                return VectorizeAccCastOp(rewriter, castOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
+            })
             .Case([&](v::BitcastOp bitcastOp) {
                 return VectorizeBitcastOp(rewriter, bitcastOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
             })
+            .Case([&](v::RoundOp roundOp) {
+                return VectorizeAccRoundOp(rewriter, roundOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
+            })
             .Case([&](v::ReferenceGlobalOp refGlobalOp) {
                 return VectorizeReferenceGlobalOp(rewriter, refGlobalOp, vectorizedOps, laneMappings, inductionVar, step, vectorSize);
             })
@@ -928,161 +1166,1039 @@ std::optional<VectorizedOp> VectorizeOp(mlir::PatternRewriter& rewriter,
     return resultOp;
 }
 
-mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
-                                         mlir::PatternRewriter& rewriter)
+// TODO : support multi-dim vector reductions
+mlir::LogicalResult vectorizeHorizontalReduction(mlir::AffineForOp affineForOp, mlir::PatternRewriter& rewriter)
 {
+    // Try to match a pattern like:
+    // for indices
+    // for i:
+    //     x = load(input[..., i]) : memref<?? x M, T1> -> T1
+    //     y = load(output[...]) : memref<??, T1> (doesn't depend on i) -> T1
+    //     z = x + y
+    //     store(z, output[...]) : (same position as load)
+
+    // And replace it with:
+    // flat_input = reinterpret_cast input to flat
+    // flat_output = reinterpret_cast output to flat
+    // x = vector_load(flat_input, flatten_input_pos(..., i)) : vector<M x T1>
+    // y = affine_load(output[...]) : T1
+    // z = vector.reduction "add"
+    // affine_store(z, output[...])
+
+    // Note: the 'add' operation above can also be many other ops
+    // See enum values from  <llvm-project>/mlir/include/mlir/Dialect/Vector/IR/VectorOps.td
+    // e.g. add, mul, minui, minsi, minf, maxui, maxsi, maxf, and, or, xor
+
+    // Also allow for the loaded values to be cast before the sum
+
+    // So we need to check for the:
+    //  - this affine for op is the innermost loop
+    //  - the loop has constant bounds (TODO: relax this check)
+    // And the ops in the loop are:
+    //  - loop-sequential load
+    //  - loop-constant load from location Y
+    //  - BinOp of the loaded values
+    //  - store BinOp result to location Y
+
     // Implement the matcher
     auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult {
         llvm::dbgs() << "While processing " << *op << ". " << message << "\n";
         return rewriter.notifyMatchFailure(op, message);
     };
 
-    std::stack<Operation*> matchedOps;
+    std::stack<mlir::Operation*> matchedOps;
     std::stack<mlir::Operation*> tempOps;
+    ir::util::TempOpCleanupGuard(&tempOps, rewriter);
 
-    // Match jj and kk loop in int16 matmul for vectorization rewrite rules
     SmallVector<AffineForOp, 2> loops;
     mlir::getPerfectlyNestedLoops(loops, affineForOp);
-    if (loops.size() != 2) // there should be exactly 2 loops in the nest
+    if (loops.size() != 1) // there should be exactly 1 loops in the nest being vectorized
     {
         return failure();
     }
 
-    for (auto& loop : loops)
+    // TODO : support dynamic loops that operate over contiguous memory
+    if (!affineForOp.hasConstantBounds() || affineForOp.getConstantLowerBound() != 0)
     {
-        if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0)
-        {
-            return failure();
-        }
+        return failure();
     }
 
-    // order of nested loops we are looking for is
-    // jj {0 to 8} followed by kk {0 to 2}
-    auto outerLoop = loops.front(); // jj loop
-    int64_t jj_begin = outerLoop.getConstantLowerBound();
-    int64_t jj_end = outerLoop.getConstantUpperBound();
-    int64_t jj_step = outerLoop.getStep();
-    int64_t jj_numIters = (jj_end - jj_begin) / jj_step;
-    if (jj_numIters != 8)
-        return failure();
-    auto jj_inductionVar = outerLoop.getInductionVar();
+    int64_t begin = affineForOp.getConstantLowerBound();
+    int64_t end = affineForOp.getConstantUpperBound();
+    int64_t step = affineForOp.getStep();
+    int64_t numIters = (end - begin) / step;
+    auto inductionVar = affineForOp.getInductionVar();
 
-    auto innerLoop = loops.back(); // the innermost loop, kk
-    int64_t kk_begin = innerLoop.getConstantLowerBound();
-    int64_t kk_end = innerLoop.getConstantUpperBound();
-    int64_t kk_step = innerLoop.getStep();
-    int64_t kk_numIters = (kk_end - kk_begin) / kk_step;
-    if (kk_numIters != 2)
-        return failure();
-    auto kk_inductionVar = innerLoop.getInductionVar();
+    int64_t unrollMax = std::min(numIters, (end - begin));
+    auto vectorSize = unrollMax;
 
     // iterate on loop body from begin to end to match the ops list
-    auto innerLoopBodyIter = innerLoop.getBody()->begin();
-    auto innerLoopBodyEnd = innerLoop.getBody()->end();
-
-    // TODO: deal with case where we load B before A (allow C[i,j] += B[k,j] * A[i,k])
-    // TODO: ensure we're storing the updated C value back into the same location (disallow C[m,n] = C[i,j] + A[i,k] * B[k,j])
+    auto loopBodyIter = affineForOp.getBody()->begin();
+    auto loopBodyEnd = affineForOp.getBody()->end();
 
-    // 1. load from A matrix
-    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
+    // 1. load from lhs array
+    if (loopBodyIter == loopBodyEnd || !isa<mlir::AffineLoadOp>(*loopBodyIter))
     {
-        return reportMatchFailure(affineForOp, "Failed to match the load from A Op");
+        return reportMatchFailure(affineForOp, "Failed to match the lhs load op");
     }
-    auto loadAOp = cast<mlir::AffineLoadOp>(*innerLoopBodyIter);
-    auto elementBitWidthA = loadAOp.getMemRefType().getElementTypeBitWidth();
-    if (elementBitWidthA != 16)
+
+    auto lhsLoadOp = cast<mlir::AffineLoadOp>(*loopBodyIter++);
+    auto lhsLoadVal = lhsLoadOp.getResult(); // Keep the laoded val separate from the current lhs val for mapping later
+    auto lhsVal = lhsLoadVal;
+    matchedOps.push(lhsLoadOp);
+
+    // Set up sequential mappings for the loop
+    std::vector<mlir::BlockAndValueMapping> laneMappings(unrollMax);
+    for (int64_t idx = begin; idx < end; idx += step)
     {
-        return failure();
+        auto offsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (idx * step));
+        auto offsetInductionVar = rewriter.create<AffineApplyOp>(lhsLoadOp.getLoc(), offsetMap, ValueRange{ inductionVar });
+        tempOps.push(offsetInductionVar);
+        laneMappings[idx].map(inductionVar, offsetInductionVar);
     }
-    matchedOps.push(loadAOp);
 
-    // verify load from A looks like A[*,kk] or A[kk,*]
-    int loadA_kIndex = -1;
-    for (auto en : llvm::enumerate(loadAOp.indices()))
+    bool lhsLoadIsLoopSequential = IsUnrolledAccessSequential(rewriter, lhsLoadOp, laneMappings, unrollMax);
+    bool lhsLoadIsLoopConstant = IsUnrolledAccessConstant(rewriter, lhsLoadOp, laneMappings, unrollMax);
+
+    // 1a. (optional) cast
+    v::CastOp lhsLoadCastOp;
+    mlir::Type lhsCastType;
+    if (isa<v::CastOp>(*loopBodyIter))
     {
-        auto i = en.value();
-        if (i == kk_inductionVar)
+        lhsLoadCastOp = cast<v::CastOp>(*loopBodyIter++);
+        if (lhsLoadCastOp.source() != lhsVal)
         {
-            if (loadA_kIndex != -1)
-            {
-                return reportMatchFailure(affineForOp, "Failed to match the load from A Op (too many 'k' indicies)");
-            }
-            loadA_kIndex = en.index();
+            return reportMatchFailure(affineForOp, "Cast after lhs load isn't casting the loaded value");
         }
+        auto castedValue = lhsLoadCastOp.result();
+        lhsCastType = castedValue.getType();
+        lhsVal = castedValue;
+        matchedOps.push(lhsLoadCastOp);
     }
 
-    if (loadA_kIndex == -1)
+    // 2. load from rhs array
+    if (loopBodyIter == loopBodyEnd || !isa<mlir::AffineLoadOp>(*loopBodyIter))
     {
-        return reportMatchFailure(affineForOp, "Failed to match the load from A Op (no 'k' index)");
+        return reportMatchFailure(affineForOp, "Failed to match the rhs load op");
     }
 
-    // 2. load from B matrix
-    innerLoopBodyIter++;
-    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
+    auto rhsLoadOp = cast<mlir::AffineLoadOp>(*loopBodyIter++);
+    auto rhsLoadVal = rhsLoadOp.getResult();
+    auto rhsVal = rhsLoadVal;
+    matchedOps.push(rhsLoadOp);
+
+    bool rhsLoadIsLoopSequential = IsUnrolledAccessSequential(rewriter, rhsLoadOp, laneMappings, unrollMax);
+    bool rhsLoadIsLoopConstant = IsUnrolledAccessConstant(rewriter, rhsLoadOp, laneMappings, unrollMax);
+
+    // 2a. (optional) cast
+    v::CastOp rhsLoadCastOp(nullptr);
+    mlir::Type rhsCastType;
+    if (isa<v::CastOp>(*loopBodyIter))
+    {
+        rhsLoadCastOp = cast<v::CastOp>(*loopBodyIter++);
+        if (rhsLoadCastOp.source() != rhsVal)
+        {
+            return reportMatchFailure(affineForOp, "Cast after rhs load isn't casting the loaded value");
+        }
+        auto castedValue = rhsLoadCastOp.result();
+        rhsCastType = castedValue.getType();
+        rhsVal = castedValue;
+        matchedOps.push(rhsLoadCastOp);
+    }
+
+    // 3. bin op
+    if (loopBodyIter == loopBodyEnd || !isa<v::BinOp>(*loopBodyIter))
     {
-        return reportMatchFailure(affineForOp, "Failed to match the load from B Op");
+        return reportMatchFailure(affineForOp, "Failed to match the bin op");
     }
-    auto loadBOp = cast<mlir::AffineLoadOp>(innerLoopBodyIter);
-    auto elementBitWidthB = loadBOp.getMemRefType().getElementTypeBitWidth();
-    if (elementBitWidthB != 16)
+    auto binOp = cast<v::BinOp>(*loopBodyIter++);
+    auto binOpVal = binOp.getResult();
+    bool lhsRhsLineUp = (binOp.lhs() == lhsVal) && (binOp.rhs() == rhsVal);
+    bool lhsRhsSwap = (binOp.lhs() == rhsVal) && (binOp.rhs() == lhsVal);
+    if (!lhsRhsLineUp && !lhsRhsSwap)
     {
-        return failure();
+        return reportMatchFailure(affineForOp, "Bin op isn't using loaded lhs and rhs values");
     }
-    matchedOps.push(loadBOp);
+    matchedOps.push(binOp);
 
-    // verify load from B looks like B[kk,jj] or B[jj,kk]
-    int loadB_kIndex = -1;
-    int loadB_jIndex = -1;
-    for (auto en : llvm::enumerate(loadBOp.indices()))
+    auto elementType = binOpVal.getType();
+
+    // Get the bin op combining kind and verify that it has a vector reduction counterpart
+    mlir::vector::CombiningKind reductionKind;
+    // TODO : support AND, OR, MIN, MAX, and XOR as accera bin ops (accera has LOGICAL_AND and LOGICAL_OR, can those be used here?)
+    switch (binOp.getPredicate())
     {
-        auto i = en.value();
-        if (i == kk_inductionVar)
+    case v::BinaryOpPredicate::ADD:
+        reductionKind = mlir::vector::CombiningKind::ADD;
+        break;
+    case v::BinaryOpPredicate::MUL:
+        reductionKind = mlir::vector::CombiningKind::MUL;
+        break;
+    case v::BinaryOpPredicate::MAX:
+        if (elementType.isIntOrFloat())
         {
-            if (loadB_kIndex != -1)
+            if (elementType.isIntOrIndex())
+            {
+                if (elementType.isUnsignedInteger())
+                {
+                    reductionKind = mlir::vector::CombiningKind::MAXUI;
+                }
+                else
+                {
+                    reductionKind = mlir::vector::CombiningKind::MAXSI;
+                }
+            }
+            else
             {
-                return reportMatchFailure(affineForOp, "Failed to match the load from B Op (too many 'k' indicies)");
+                reductionKind = mlir::vector::CombiningKind::MAXF;
             }
-            loadB_kIndex = en.index();
         }
-        else if (i == jj_inductionVar)
+        else
+        {
+            return reportMatchFailure(binOp, "'Max' bin op with the given element type cannot be turned into a vector reduction");
+        }
+        break;
+    case v::BinaryOpPredicate::MIN:
+        if (elementType.isIntOrFloat())
         {
-            if (loadB_jIndex != -1)
+            if (elementType.isIntOrIndex())
+            {
+                if (elementType.isUnsignedInteger())
+                {
+                    reductionKind = mlir::vector::CombiningKind::MINUI;
+                }
+                else
+                {
+                    reductionKind = mlir::vector::CombiningKind::MINSI;
+                }
+            }
+            else
             {
-                return reportMatchFailure(affineForOp, "Failed to match the load from B Op (too many 'j' indicies)");
+                reductionKind = mlir::vector::CombiningKind::MINF;
             }
-            loadB_jIndex = en.index();
         }
+        else
+        {
+            return reportMatchFailure(binOp, "'Min' bin op with the given element type cannot be turned into a vector reduction");
+        }
+        break;
+    default:
+        return reportMatchFailure(binOp, "Bin op predicate type cannot be turned into a vector reduction");
     }
 
-    if (loadB_kIndex == -1)
+    // 4. store to output array
+    if (loopBodyIter == loopBodyEnd || !isa<mlir::AffineStoreOp>(*loopBodyIter))
     {
-        return reportMatchFailure(affineForOp, "Failed to match the load from B Op (no 'k' index)");
+        return reportMatchFailure(affineForOp, "Failed to match the store op");
     }
 
-    if (loadB_jIndex == -1)
+    auto storeOp = cast<mlir::AffineStoreOp>(*loopBodyIter++);
+    auto storeMemRefType = storeOp.getMemRefType();
+    auto storeElementType = storeMemRefType.getElementType();
+    auto storedVal = storeOp.value();
+    matchedOps.push(storeOp);
+
+    // Check that the value being stored is the result of the BinOp
+    if (storedVal != binOpVal)
     {
-        return reportMatchFailure(affineForOp, "Failed to match the load from B Op (no 'j' index)");
+        return reportMatchFailure(storeOp, "Store op isn't storing the result of the bin op");
     }
 
-    // 3. muliply A * B
-    innerLoopBodyIter++;
-    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<v::BinOp>(*innerLoopBodyIter))
+    // Check that store is constant wrt to the loop
+    bool storeIsLoopConstant = IsUnrolledAccessConstant(rewriter, storeOp, laneMappings, unrollMax);
+    if (!storeIsLoopConstant)
     {
-        return reportMatchFailure(affineForOp, "Failed to match the binary A*B multiplication op");
+        return reportMatchFailure(storeOp, "Store op isn't constant wrt the loop being vectorized");
     }
-    auto mulAB = cast<v::BinOp>(*innerLoopBodyIter);
-    if (mulAB.predicate() != v::BinaryOpPredicate::MUL)
+
+    // Check which load is sequential wrt the loop and which is constant and which one is being stored to
+
+    mlir::AffineLoadOp outputLoadOp;
+    if (storeOp.getMemRef() == lhsLoadOp.getMemRef())
     {
-        return reportMatchFailure(mulAB, "Failed to match the multiplication op");
+        if (!lhsLoadIsLoopConstant)
+        {
+            return reportMatchFailure(lhsLoadOp, "LHS load op isn't constant wrt the loop being vectorized but is the same memref being stored to");
+        }
+        if (!rhsLoadIsLoopSequential)
+        {
+            return reportMatchFailure(rhsLoadOp, "RHS load op isn't sequential when LHS load is constant");
+        }
+        outputLoadOp = lhsLoadOp;
     }
-    // Check that the operands for the multiply op are in fact the loads from A and B
-    if (!((mulAB.lhs() == loadAOp && mulAB.rhs() == loadBOp) || (mulAB.rhs() == loadAOp && mulAB.lhs() == loadBOp)))
+    else if (storeOp.getMemRef() == rhsLoadOp.getMemRef())
     {
-        return reportMatchFailure(mulAB, "Failed to match the multiplication operands");
+        if (!rhsLoadIsLoopConstant)
+        {
+            return reportMatchFailure(rhsLoadOp, "RHS load op isn't constant wrt the loop being vectorized but is the same memref being stored to");
+        }
+        if (!lhsLoadIsLoopSequential)
+        {
+            return reportMatchFailure(lhsLoadOp, "LHS load op isn't sequential when RHS load is constant");
+        }
+        outputLoadOp = rhsLoadOp;
+    }
+    else
+    {
+        return reportMatchFailure(storeOp, "Store op isn't storing to the same memref as either load");
     }
-    matchedOps.push(mulAB);
 
-    // 4. sign-extend / cast result of A * B
+    // Check that the output load and store are at the same position
+
+    auto strideOpt = GetConstantStrideBetweenAccesses(rewriter, outputLoadOp, storeOp);
+    if (!strideOpt.has_value() || *strideOpt != 0)
+    {
+        return reportMatchFailure(storeOp, "Output load and store ops aren't at the same location");
+    }
+
+    // At this point we've verified:
+    //  - this affine for op is the innermost loop
+    //  - the loop has constant bounds
+    // And the ops in the loop are:
+    //  - loop-sequential load
+    //  - loop-constant load from location Y
+    //  - BinOp of the loaded values
+    //  - store BinOp result to location Y
+
+    // Check that all that remains are optionally redundant load-stores and the yield op
+    
+    // match the final pair of redundant load and store ops
+    if (loopBodyIter != loopBodyEnd && isa<mlir::AffineLoadOp>(*loopBodyIter))
+    {
+        auto loadOp = cast<mlir::AffineLoadOp>(*loopBodyIter++);
+        matchedOps.push(loadOp);
+        if (loopBodyIter != loopBodyEnd && isa<mlir::AffineStoreOp>(*loopBodyIter))
+        {
+            auto storeOp = cast<mlir::AffineStoreOp>(*loopBodyIter++);
+            if (storeOp.getMemRef() != loadOp.getMemRef())
+            {
+                return reportMatchFailure(storeOp, "Extraneous load/store aren't to the same memref");
+            }
+            
+            auto strideOpt = GetConstantStrideBetweenAccesses(rewriter, loadOp, storeOp);
+            if (!strideOpt.has_value() || *strideOpt != 0)
+            {
+                return reportMatchFailure(storeOp, "Extraneous load/store aren't to the same location");
+            }
+
+            matchedOps.push(storeOp);
+        }
+        else
+        {
+            return reportMatchFailure(loadOp, "Failed to match extraneous store");
+        }
+    }
+
+    // Ignore the yield op at the end
+    if (loopBodyIter != loopBodyEnd && isa<mlir::AffineYieldOp>(*loopBodyIter))
+    {
+        (void)loopBodyIter++;
+    }
+
+    if (loopBodyIter != loopBodyEnd)
+    {
+        LLVM_DEBUG(llvm::dbgs() << "Found additional instructions after the store");
+        return failure();
+    }
+
+    // Set the insertion point to the end of the loop (just before the terminator)
+    mlir::OpBuilder::InsertionGuard guard(rewriter);
+    rewriter.setInsertionPoint(affineForOp.getBody(), affineForOp.getBody()->getTerminator()->getIterator());
+
+    // Now replace the matched ops with the vector load and reduction sequence
+    mlir::BlockAndValueMapping mappings;
+
+    // LHS Load
+    mlir::Value vecLhsVal;
+    if (lhsLoadIsLoopSequential)
+    {
+        auto lhsLoadVecOp = VectorizeAffineLoadOpHelper(rewriter, lhsLoadOp, vectorSize);
+        vecLhsVal = lhsLoadVecOp.getResult();
+        mappings.map(lhsLoadVal, vecLhsVal);
+    }
+    else
+    {
+        vecLhsVal = mlir::cast<mlir::AffineLoadOp>(rewriter.clone(*lhsLoadOp.getOperation(), mappings));
+    }
+    mappings.map(lhsLoadVal, vecLhsVal);
+
+    // Optional cast
+    if (lhsLoadCastOp)
+    {
+        // Create a vector cast
+        auto castVecType = mlir::VectorType::get({ vectorSize }, lhsCastType);
+        vecLhsVal = rewriter.create<v::CastOp>(lhsLoadCastOp.getLoc(), vecLhsVal, castVecType);
+    }
+    mappings.map(lhsVal, vecLhsVal);
+
+    // RHS Load
+    mlir::Value vecRhsVal;
+    if (rhsLoadIsLoopSequential)
+    {
+        auto rhsLoadVecOp = VectorizeAffineLoadOpHelper(rewriter, rhsLoadOp, vectorSize);
+        vecRhsVal = rhsLoadVecOp.getResult();
+        mappings.map(rhsLoadVal, vecRhsVal);
+    }
+    else
+    {
+        vecRhsVal = mlir::cast<mlir::AffineLoadOp>(rewriter.clone(*rhsLoadOp.getOperation(), mappings));
+    }
+    mappings.map(rhsLoadVal, vecRhsVal);
+
+    // Optional cast
+    if (rhsLoadCastOp)
+    {
+        // Create a vector cast
+        auto castVecType = mlir::VectorType::get({ vectorSize }, rhsCastType);
+        vecRhsVal = rewriter.create<v::CastOp>(rhsLoadCastOp.getLoc(), vecRhsVal, castVecType);
+    }
+    mappings.map(rhsVal, vecRhsVal);
+
+    // Now create the appropriate vector reduce given the bin op type and apply it to either the LHS vector val or RHS vector val, whichever is the loaded vector
+    auto vectorValToReduce = lhsLoadIsLoopSequential ? vecLhsVal : vecRhsVal;
+    auto reduceOp = rewriter.create<mlir::vector::ReductionOp>(binOp.getLoc(), storeElementType, mlir::vector::stringifyEnum(reductionKind), vectorValToReduce, mlir::ValueRange{} /* optional accumulate values */);
+    
+    mlir::Value reducedVal = reduceOp.getResult();
+    auto scalarValThatWasReduced = lhsLoadIsLoopSequential ? lhsVal : rhsVal;
+    mappings.map(scalarValThatWasReduced, reducedVal);
+
+    // Now we're left with two scalars, since we've reduced one vector to a scalar and the other value was a scalar to begin with.
+    // Clone the original bin op now that we've vector reduced either the LHS or RHS side and are left with 2 vectors
+    // At this point, in our mappings we've replaces the original lhsVal and rhsVal with either their cloned scalar versions,
+    // or the result of the vector reduce
+    auto finalBinOp = mlir::cast<v::BinOp>(rewriter.clone(*binOp.getOperation(), mappings));
+    mappings.map(binOp, finalBinOp);
+
+    // Clone the final store op
+    rewriter.clone(*storeOp.getOperation(), mappings);
+
+    // Set the step size for the vectorized loops such that they each have a single iteration and will later get simplified away while replacing any IV usage with their begin value
+    affineForOp.setStep(step * numIters);
+
+    // Erase the original non-vectorized ops
+    ir::util::EraseOps(matchedOps, rewriter);
+
+    return mlir::success();
+}
+
+// TODO : de-dupe with part of vectorizeInt16Matmul matcher
+mlir::LogicalResult vectorizeSequentialCast(mlir::AffineForOp affineForOp, mlir::PatternRewriter& rewriter)
+{
+    // Try to match a pattern like:
+    // for jj:
+    //     for kk:
+    //         x = load(input[..., jj, kk]) : memref<...x M x N, T1>
+    //         y = cast(x, T2) : T2
+    //         store(y, output[..., jj, kk]) : memref<...x M x N, T2>
+
+    // And replace it with:
+    // flat_input = reinterpret_cast input to flat
+    // flat_output = reinterpret_cast output to flat
+    // x = vector_load(flat_input, flatten_input_pos(..., jj, kk)) : vector<(M*N)xT1>
+    // y = cast(x, T2) : vector<(M*N)xT2>
+    // vector_store(y, flat_output, flatten_output_pos(..., jj, kk))
+
+    // So we need to check:
+    //  - there are 2 nested loops (TODO : generalize this)
+    //  - the loops have constant bounds (TODO: relax this check)
+    //  - the innermost loop contains a sequential load
+    //  - the innermost loop contains a cast of the loaded value
+    //  - the innermost loop contains a sequential store of the cast value
+    //  - there are no other ops in the innermost loop (other than a loop terminator op)
+
+    // Implement the matcher
+    auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult {
+        llvm::dbgs() << "While processing " << *op << ". " << message << "\n";
+        return rewriter.notifyMatchFailure(op, message);
+    };
+
+    std::stack<mlir::Operation*> matchedOps;
+    std::stack<mlir::Operation*> tempOps;
+    ir::util::TempOpCleanupGuard(&tempOps, rewriter);
+
+    // Match j and k loop
+    SmallVector<AffineForOp, 2> loops;
+    mlir::getPerfectlyNestedLoops(loops, affineForOp);
+    if (loops.size() != 2) // there should be exactly 2 loops in the nest
+    {
+        return failure();
+    }
+
+    // TODO : support dynamic loops that operate over contiguous memory
+    for (auto& loop : loops)
+    {
+        if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0)
+        {
+            return failure();
+        }
+    }
+
+    auto outerLoop = loops.front(); // jj loop
+    int64_t jj_begin = outerLoop.getConstantLowerBound();
+    int64_t jj_end = outerLoop.getConstantUpperBound();
+    int64_t jj_step = outerLoop.getStep();
+    int64_t jj_numIters = (jj_end - jj_begin) / jj_step;
+    auto jj_inductionVar = outerLoop.getInductionVar();
+
+    auto innerLoop = loops.back(); // the innermost loop, kk
+    int64_t kk_begin = innerLoop.getConstantLowerBound();
+    int64_t kk_end = innerLoop.getConstantUpperBound();
+    int64_t kk_step = innerLoop.getStep();
+    int64_t kk_numIters = (kk_end - kk_begin) / kk_step;
+    auto kk_inductionVar = innerLoop.getInductionVar();
+
+    // iterate on loop body from begin to end to match the ops list
+    auto innerLoopBodyIter = innerLoop.getBody()->begin();
+    auto innerLoopBodyEnd = innerLoop.getBody()->end();
+
+    // 1. load from input array
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the input load op");
+    }
+
+    auto loadOp = cast<mlir::AffineLoadOp>(*innerLoopBodyIter);
+    auto loadedVal = loadOp.getResult();
+    matchedOps.push(loadOp);
+
+    // 2. cast loaded input value
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<v::CastOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the cast op");
+    }
+
+    auto castOp = cast<v::CastOp>(*innerLoopBodyIter);
+    auto castedValue = castOp.result();
+    auto castResultType = castedValue.getType();
+    matchedOps.push(castOp);
+
+    if (castOp.source() != loadedVal)
+    {
+        return reportMatchFailure(affineForOp, "Cast op isn't casting the loaded value");
+    }
+
+    // 3. store cast value
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineStoreOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the store op");
+    }
+
+    auto storeOp = cast<mlir::AffineStoreOp>(*innerLoopBodyIter);
+    matchedOps.push(storeOp);
+
+    if (storeOp.value() != castedValue)
+    {
+        return reportMatchFailure(affineForOp, "Store op isn't storing the cast value");
+    }
+
+    // Ignore the yield op at the end
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter != innerLoopBodyEnd && isa<mlir::AffineYieldOp>(*innerLoopBodyIter))
+    {
+        (void)innerLoopBodyIter++;
+    }
+
+    if (innerLoopBodyIter != innerLoopBodyEnd)
+    {
+        LLVM_DEBUG(llvm::dbgs() << "Found additional instructions after the store");
+        return failure();
+    }
+
+    // Check if the input loads and output writes are sequential
+    int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin));
+    int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin));
+
+    // create lanemappings for jj * kk
+    std::vector<mlir::BlockAndValueMapping> laneMappings(unrollMax_kk * unrollMax_jj);
+    auto loadLoc = loadOp.getLoc();
+
+    for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step)
+    {
+        auto jjOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (jj_idx * jj_step));
+        auto offsetInductionVar_jj = rewriter.create<AffineApplyOp>(loadLoc, jjOffsetMap, ValueRange{ jj_inductionVar });
+        tempOps.push(offsetInductionVar_jj);
+        for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step)
+        {
+            auto kkOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (kk_idx * kk_step));
+            auto offsetInductionVar_kk = rewriter.create<AffineApplyOp>(loadLoc, kkOffsetMap, ValueRange{ kk_inductionVar });
+            tempOps.push(offsetInductionVar_kk);
+            BlockAndValueMapping& operandMap = laneMappings[jj_idx * unrollMax_kk + kk_idx];
+            operandMap.map(kk_inductionVar, offsetInductionVar_kk);
+            operandMap.map(jj_inductionVar, offsetInductionVar_jj);
+        }
+    }
+
+    int64_t vectorSize = unrollMax_jj * unrollMax_kk;
+
+    if (!IsUnrolledAccessSequential(rewriter, loadOp, laneMappings, vectorSize))
+    {
+        return reportMatchFailure(loadOp, "Failed: isUnrolledAcessSequential for load op");
+    }
+    if (!IsUnrolledAccessSequential(rewriter, storeOp, laneMappings, vectorSize))
+    {
+        return reportMatchFailure(storeOp, "Failed: isUnrolledAcessSequential for store op");
+    }
+
+    // At this point we know:
+    //  - there are 2 nested loops
+    //  - the loops have constant bounds
+    //  - the innermost loop contains a load that is sequential wrt the 2 loops
+    //  - the innermost loop contains a cast of the loaded value
+    //  - the innermost loop contains a store of the cast value that is sequential wrt the 2 loops
+    //  - there are no other ops in the innermost loop (other than a loop terminator op)
+
+    // So now we can create the new vectorized version of the loops
+
+    // Set the insertion point to the end of the inner loop (just before the terminator)
+    mlir::OpBuilder::InsertionGuard guard(rewriter);
+    rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator());
+
+    // 1. create vector load of the input
+    auto inputMemRefType = loadOp.getMemRefType();
+    auto inputElementType = inputMemRefType.getElementType();
+    auto inputVectorType = mlir::VectorType::get({ vectorSize }, inputElementType);
+    mlir::AffineLoadOpAdaptor loadAdaptor{ loadOp };
+    std::vector<mlir::Value> loadIndices(loadAdaptor.indices().begin(), loadAdaptor.indices().end());
+
+    auto [flatCastInputMemRef, flattenedInputPos] = FlattenAccess(rewriter, loadOp, loadIndices);
+    auto loadVecOp = rewriter.create<mlir::vector::LoadOp>(loadOp.getLoc(), inputVectorType, flatCastInputMemRef, mlir::ValueRange{ flattenedInputPos });
+
+    // 2. create a cast op of the loaded vector
+    auto castResultVecType = mlir::VectorType::get({ vectorSize }, castResultType);
+    mlir::Value castVecVal = rewriter.create<v::CastOp>(castOp.getLoc(), loadVecOp, castResultVecType);
+
+    // 3. create a vector store op of the casted value
+    mlir::AffineStoreOpAdaptor storeAdaptor{ storeOp };
+    std::vector<mlir::Value> storeIndices(storeAdaptor.indices().begin(), storeAdaptor.indices().end());
+
+    auto [flatCastOutputMemRef, flattenedOutputPos] = FlattenAccess(rewriter, storeOp, storeIndices);
+    rewriter.create<mlir::vector::StoreOp>(storeOp.getLoc(), castVecVal, flatCastOutputMemRef, mlir::ValueRange{ flattenedOutputPos });
+
+    // Set the step size for the vectorized loops such that they each have a single iteration and will later get simplified away while replacing any IV usage with their begin value
+    outerLoop.setStep(jj_step * jj_numIters);
+    innerLoop.setStep(kk_step * kk_numIters);
+
+    // Erase the original non-vectorized ops
+    ir::util::EraseOps(matchedOps, rewriter);
+
+    return mlir::success();
+}
+
+mlir::LogicalResult vectorizeTwoRowInterleavedPack(mlir::AffineForOp affineForOp,
+                                                   mlir::PatternRewriter& rewriter)
+{
+    // TODO : generalize this beyond 2 rows
+
+    // Try to match a pattern like:
+    // for jj:
+    //     for kk = 0 ... 2:
+    //         x = load(input[..., kk, jj]) : memref<...x N x M>
+    //         store(x, output[..., jj, kk]) : memref<...x M x N>
+
+    // And replace it with:
+    // flat_input = reinterpret_cast input to flat
+    // loaded_vec_0 = vector_load(flat_input, flatten_input_pos(..., 0, i))  // vector<MxT1>
+    // loaded_vec_1 = vector_load(flat_input, flatten_input_pos(..., 1, i))  // vector<MxT1>
+    // interleaved = vector.shuffle loaded_vec_0, loaded_vec_1 [0, M, 1, M+1, 2, M+2, ...]
+    // flat_output = reinterpret_cast output to flat
+    // vector_store(interleaved, flat_output, flatten_output_pos(..., 0, 0))
+
+    // So we need to check:
+    //  - there are 2 nested loops (TODO : generalize this)
+    //  - the loops have constant bounds (TODO: relax this check)
+    //  - the innermost loop contains a load that is sequential wrt the outer loop
+    //  - the innermost loop contains a store that is sequential wrt both loops
+    //  - there are no other ops in the innermost loop (other than a loop terminator op)
+
+    // Implement the matcher
+    auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult {
+        llvm::dbgs() << "While processing " << *op << ". " << message << "\n";
+        return rewriter.notifyMatchFailure(op, message);
+    };
+
+    std::stack<mlir::Operation*> matchedOps;
+    std::stack<mlir::Operation*> tempOps;
+    ir::util::TempOpCleanupGuard(&tempOps, rewriter);
+
+    // Match j and k loop
+    SmallVector<AffineForOp, 2> loops;
+    mlir::getPerfectlyNestedLoops(loops, affineForOp);
+    if (loops.size() != 2) // there should be exactly 2 loops in the nest
+    {
+        return failure();
+    }
+
+    // TODO : support dynamic loops that operate over contiguous memory
+    for (auto& loop : loops)
+    {
+        if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0)
+        {
+            return failure();
+        }
+    }
+
+    auto outerLoop = loops.front(); // jj loop
+    int64_t jj_begin = outerLoop.getConstantLowerBound();
+    int64_t jj_end = outerLoop.getConstantUpperBound();
+    int64_t jj_step = outerLoop.getStep();
+    int64_t jj_numIters = (jj_end - jj_begin) / jj_step;
+    auto jj_inductionVar = outerLoop.getInductionVar();
+
+    auto innerLoop = loops.back(); // the innermost loop, kk
+    int64_t kk_begin = innerLoop.getConstantLowerBound();
+    int64_t kk_end = innerLoop.getConstantUpperBound();
+    int64_t kk_step = innerLoop.getStep();
+    int64_t kk_numIters = (kk_end - kk_begin) / kk_step;
+    if (kk_numIters != 2)
+        return failure();
+    auto kk_inductionVar = innerLoop.getInductionVar();
+
+    int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin));
+    int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin));
+
+    // iterate on loop body from begin to end to match the ops list
+    auto innerLoopBodyIter = innerLoop.getBody()->begin();
+    auto innerLoopBodyEnd = innerLoop.getBody()->end();
+
+    // 1. load from input array
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the input load op");
+    }
+
+    auto loadOp = cast<mlir::AffineLoadOp>(*innerLoopBodyIter);
+    auto loadLoc = loadOp.getLoc();
+    auto loadedVal = loadOp.getResult();
+    matchedOps.push(loadOp);
+
+    // 2. store value
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineStoreOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the store op");
+    }
+
+    auto storeOp = cast<mlir::AffineStoreOp>(*innerLoopBodyIter);
+    matchedOps.push(storeOp);
+
+    if (storeOp.value() != loadedVal)
+    {
+        return reportMatchFailure(affineForOp, "Store op isn't storing the loaded value");
+    }
+
+    // Ignore the yield op at the end
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter != innerLoopBodyEnd && isa<mlir::AffineYieldOp>(*innerLoopBodyIter))
+    {
+        (void)innerLoopBodyIter++;
+    }
+
+    if (innerLoopBodyIter != innerLoopBodyEnd)
+    {
+        LLVM_DEBUG(llvm::dbgs() << "Found additional instructions after the store");
+        return failure();
+    }
+
+    // Create two sets of lane mappings: one just for jj and one for jj and kk together
+
+    // create lanemappings for jj
+    std::vector<mlir::BlockAndValueMapping> jj_laneMappings(unrollMax_jj);
+
+    // create lanemappings for jj and kk
+    std::vector<mlir::BlockAndValueMapping> jj_kk_laneMappings(unrollMax_kk * unrollMax_jj);
+
+    for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step)
+    {
+        auto jjOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (jj_idx * jj_step));
+        auto offsetInductionVar_jj = rewriter.create<AffineApplyOp>(loadLoc, jjOffsetMap, ValueRange{ jj_inductionVar });
+        tempOps.push(offsetInductionVar_jj);
+        BlockAndValueMapping& jj_operandMap = jj_laneMappings[jj_idx];
+        jj_operandMap.map(jj_inductionVar, offsetInductionVar_jj);
+        for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step)
+        {
+            auto kkOffsetMap = mlir::AffineMap::get(1, 0, rewriter.getAffineDimExpr(0) + (kk_idx * kk_step));
+            auto offsetInductionVar_kk = rewriter.create<AffineApplyOp>(loadLoc, kkOffsetMap, ValueRange{ kk_inductionVar });
+            tempOps.push(offsetInductionVar_kk);
+            BlockAndValueMapping& jj_kk_operandMap = jj_kk_laneMappings[jj_idx * unrollMax_kk + kk_idx];
+            jj_kk_operandMap.map(kk_inductionVar, offsetInductionVar_kk);
+            jj_kk_operandMap.map(jj_inductionVar, offsetInductionVar_jj);
+        }
+    }
+
+    // Check if the input load is sequential wrt the jj loop
+    int64_t inputVectorSize = unrollMax_jj;
+    if (!IsUnrolledAccessSequential(rewriter, loadOp, jj_laneMappings, inputVectorSize))
+    {
+        return reportMatchFailure(loadOp, "Failed: isUnrolledAcessSequential for load op");
+    }
+
+    // Check if the output store is sequential wrt the jj and kk loops
+    int64_t outputVectorSize = unrollMax_jj * unrollMax_kk;
+    if (!IsUnrolledAccessSequential(rewriter, storeOp, jj_kk_laneMappings, outputVectorSize))
+    {
+        return reportMatchFailure(storeOp, "Failed: isUnrolledAcessSequential for store op");
+    }
+
+    // At this point we know:
+    //  - there are 2 nested loops, the inner of which has 2 iterations
+    //  - the loops have constant bounds
+    //  - the innermost loop contains a load that is sequential wrt the outer loop
+    //  - the innermost loop contains a store of the loaded value that is sequential wrt the 2 loops
+    //  - there are no other ops in the innermost loop (other than a loop terminator op)
+
+    // So now we can create the new vectorized version of the loops
+
+    // Set the insertion point to the end of the inner loop (just before the terminator)
+    mlir::OpBuilder::InsertionGuard guard(rewriter);
+    rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator());
+
+    // 1. create vector load of the input rows
+    auto inputMemRefType = loadOp.getMemRefType();
+    auto inputElementType = inputMemRefType.getElementType();
+    auto inputVectorType = mlir::VectorType::get({ inputVectorSize }, inputElementType);
+
+    std::vector<mlir::Value> loadedVecs;
+    // Clone the load op for each iteration of the kk loop and vectorize each of those loads wrt the jj loop
+    for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step)
+    {
+        auto unrolledInductionVar_kk = rewriter.create<mlir::arith::ConstantIndexOp>(loadLoc, kk_idx);
+        tempOps.push(unrolledInductionVar_kk);
+        mlir::BlockAndValueMapping kIterMapping;
+        kIterMapping.map(kk_inductionVar, unrolledInductionVar_kk);
+        auto clonedLoadOp = mlir::cast<mlir::AffineLoadOp>(rewriter.clone(*(loadOp.getOperation()), kIterMapping));
+        tempOps.push(clonedLoadOp);
+
+        mlir::AffineLoadOpAdaptor loadAdaptor{ clonedLoadOp };
+        std::vector<mlir::Value> loadIndices(loadAdaptor.indices().begin(), loadAdaptor.indices().end());
+
+        auto [flatCastInputMemRef, flattenedInputPos] = FlattenAccess(rewriter, clonedLoadOp, loadIndices);
+        mlir::Value loadedVec = rewriter.create<mlir::vector::LoadOp>(loadOp.getLoc(), inputVectorType, flatCastInputMemRef, mlir::ValueRange{ flattenedInputPos });
+        loadedVecs.push_back(loadedVec);
+    }
+    assert(loadedVecs.size() == 2); // Eventually we could relax this, but vector.shuffle ops require precisely 2 vectors, so if we relax this we need to create a sequence of shuffles
+
+    // 2. create a vector.shuffle op to interleave the input rows
+    std::vector<int64_t> interleaveMask;
+    interleaveMask.reserve(outputVectorSize);
+    for (unsigned colIdx = 0; colIdx < unrollMax_jj; ++colIdx)
+    {
+        // The vector.shuffle mask should be like { 0, N, 1, N+1, 2, N+2, ... } where the jj loop has N iterations
+        interleaveMask.push_back(colIdx);
+        interleaveMask.push_back(colIdx + unrollMax_jj);
+    }
+
+    auto outputMemRefType = storeOp.getMemRefType();
+    auto outputElementType = outputMemRefType.getElementType();
+    auto outputVectorType = mlir::VectorType::get({ outputVectorSize }, outputElementType);
+    auto shuffledRowsOp = rewriter.create<mlir::vector::ShuffleOp>(loadLoc, outputVectorType, loadedVecs[0], loadedVecs[1], rewriter.getI64ArrayAttr(interleaveMask));
+
+    // 3. create a vector store op of the interleaved rows
+    mlir::AffineStoreOpAdaptor storeAdaptor{ storeOp };
+    std::vector<mlir::Value> storeIndices(storeAdaptor.indices().begin(), storeAdaptor.indices().end());
+
+    auto [flatCastOutputMemRef, flattenedOutputPos] = FlattenAccess(rewriter, storeOp, storeIndices);
+    rewriter.create<mlir::vector::StoreOp>(storeOp.getLoc(), shuffledRowsOp, flatCastOutputMemRef, mlir::ValueRange{ flattenedOutputPos });
+
+    // Set the step size for the vectorized loops such that they each have a single iteration and will later get simplified away while replacing any IV usage with their begin value
+    outerLoop.setStep(jj_step * jj_numIters);
+    innerLoop.setStep(kk_step * kk_numIters);
+
+    // Erase the original non-vectorized ops
+    ir::util::EraseOps(matchedOps, rewriter);
+
+    return mlir::success();
+}
+
+mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
+                                         mlir::PatternRewriter& rewriter)
+{
+    // Implement the matcher
+    auto reportMatchFailure = [&](mlir::Operation* op, std::string message) -> LogicalResult {
+        llvm::dbgs() << "While processing " << *op << ". " << message << "\n";
+        return rewriter.notifyMatchFailure(op, message);
+    };
+
+    std::stack<Operation*> matchedOps;
+    std::stack<mlir::Operation*> tempOps;
+
+    // Match jj and kk loop in int16 matmul for vectorization rewrite rules
+    SmallVector<AffineForOp, 2> loops;
+    mlir::getPerfectlyNestedLoops(loops, affineForOp);
+    if (loops.size() != 2) // there should be exactly 2 loops in the nest
+    {
+        return failure();
+    }
+
+    for (auto& loop : loops)
+    {
+        if (!loop.hasConstantBounds() || loop.getConstantLowerBound() != 0)
+        {
+            return failure();
+        }
+    }
+
+    // order of nested loops we are looking for is
+    // jj {0 to 8} followed by kk {0 to 2}
+    auto outerLoop = loops.front(); // jj loop
+    int64_t jj_begin = outerLoop.getConstantLowerBound();
+    int64_t jj_end = outerLoop.getConstantUpperBound();
+    int64_t jj_step = outerLoop.getStep();
+    int64_t jj_numIters = (jj_end - jj_begin) / jj_step;
+    if (jj_numIters != 8)
+        return failure();
+    auto jj_inductionVar = outerLoop.getInductionVar();
+
+    auto innerLoop = loops.back(); // the innermost loop, kk
+    int64_t kk_begin = innerLoop.getConstantLowerBound();
+    int64_t kk_end = innerLoop.getConstantUpperBound();
+    int64_t kk_step = innerLoop.getStep();
+    int64_t kk_numIters = (kk_end - kk_begin) / kk_step;
+    if (kk_numIters != 2)
+        return failure();
+    auto kk_inductionVar = innerLoop.getInductionVar();
+
+    // get unroll max for jj and kk
+    int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin));
+    int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin));
+    int64_t vectorSize = unrollMax_jj * unrollMax_kk;
+
+    // create IV map for jj and kk
+    auto inductionVarMap_jj = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + jj_step * rewriter.getAffineSymbolExpr(0));
+    auto inductionVarMap_kk = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + kk_step * rewriter.getAffineSymbolExpr(0));
+
+    // create lanemappings for jj, kk, and jj * kk
+    std::vector<mlir::BlockAndValueMapping> laneMappings_jj(unrollMax_jj);
+    std::vector<mlir::BlockAndValueMapping> laneMappings_kk(unrollMax_kk);
+    std::vector<mlir::BlockAndValueMapping> laneMappings_jj_kk(unrollMax_kk * unrollMax_jj);
+
+    for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step)
+    {
+        auto offset_jj = rewriter.create<arith::ConstantIndexOp>(outerLoop.getLoc(), jj_idx);
+        auto offsetInductionVar_jj = rewriter.create<AffineApplyOp>(outerLoop.getLoc(), inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj });
+        tempOps.push(offset_jj);
+        tempOps.push(offsetInductionVar_jj);
+        laneMappings_jj[jj_idx].map(jj_inductionVar, offsetInductionVar_jj);
+        for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step)
+        {
+            auto offset_kk = rewriter.create<arith::ConstantIndexOp>(innerLoop.getLoc(), kk_idx);
+            auto offsetInductionVar_kk = rewriter.create<AffineApplyOp>(innerLoop.getLoc(), inductionVarMap_kk, ValueRange{ kk_inductionVar, offset_kk });
+            tempOps.push(offset_kk);
+            tempOps.push(offsetInductionVar_kk);
+            laneMappings_jj_kk[jj_idx * unrollMax_kk + kk_idx].map(kk_inductionVar, offsetInductionVar_kk);
+            laneMappings_jj_kk[jj_idx * unrollMax_kk + kk_idx].map(jj_inductionVar, offsetInductionVar_jj);
+            if (jj_idx == jj_begin)
+            {
+                // Only map for the first iter of jj
+                laneMappings_kk[kk_idx].map(kk_inductionVar, offsetInductionVar_kk);
+            }
+        }
+    }
+
+    // iterate on loop body from begin to end to match the ops list
+    auto innerLoopBodyIter = innerLoop.getBody()->begin();
+    auto innerLoopBodyEnd = innerLoop.getBody()->end();
+
+    // TODO: ensure we're storing the updated C value back into the same location (disallow C[m,n] = C[i,j] + A[i,k] * B[k,j])
+
+    // TODO : de-dupe between first and second cases
+
+    // 1. load from first matrix
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the load from the first array");
+    }
+    auto firstLoad = cast<mlir::AffineLoadOp>(*innerLoopBodyIter);
+    auto firstElementType = firstLoad.getMemRefType().getElementType();
+    matchedOps.push(firstLoad);
+
+    // 1a. Optionally allow casting the A value to an int16 if it is not an int16 already
+    bool castFirstLoad = false;
+    mlir::Value firstLoadVal = firstLoad.getResult();
+    if (firstElementType != rewriter.getIntegerType(16))
+    {
+        innerLoopBodyIter++;
+        if (innerLoopBodyIter != innerLoopBodyEnd && isa<v::CastOp>(*innerLoopBodyIter))
+        {
+            castFirstLoad = true;
+            auto castOp = cast<v::CastOp>(*innerLoopBodyIter);
+            firstLoadVal = castOp.result();
+            auto castResultType = firstLoadVal.getType();
+            matchedOps.push(castOp);
+            if (castResultType != rewriter.getIntegerType(16))
+            {
+                return reportMatchFailure(affineForOp, "First load element is not an int16 or cast to an int16");
+            }
+        }
+        else
+        {
+            return reportMatchFailure(affineForOp, "First load is not from an int16 array");
+        }
+    }
+
+    // 2. load from second matrix
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the load from the second array");
+    }
+    auto secondLoad = cast<mlir::AffineLoadOp>(innerLoopBodyIter);
+    auto secondElementType = secondLoad.getMemRefType().getElementType();
+    matchedOps.push(secondLoad);
+
+    // 2a. Optionally allow casting the B value to an int16 if it is not an int16 already
+    bool castSecondLoad = false;
+    mlir::Value secondLoadVal = secondLoad.getResult();
+    if (secondElementType != rewriter.getIntegerType(16))
+    {
+        innerLoopBodyIter++;
+        if (innerLoopBodyIter != innerLoopBodyEnd && isa<v::CastOp>(*innerLoopBodyIter))
+        {
+            castSecondLoad = true;
+            auto castOp = cast<v::CastOp>(*innerLoopBodyIter);
+            secondLoadVal = castOp.result();
+            auto castResultType = secondLoadVal.getType();
+            matchedOps.push(castOp);
+            if (castResultType != rewriter.getIntegerType(16))
+            {
+                return reportMatchFailure(affineForOp, "Second load element is not an int16 or cast to an int16");
+            }
+        }
+        else
+        {
+            return reportMatchFailure(affineForOp, "Second load is not from an int16 array");
+        }
+    }
+
+    // If a load is sequential wrt the inner loop and constant wrt the outer loop, then we want to load the elements and broadcast them to fill a 16-element buffer
+    // If a load is sequential wrt both loops, then we simply want to load the data
+
+    bool broadcastFirstLoad = IsUnrolledAccessSequential(rewriter, firstLoad, laneMappings_kk, unrollMax_kk) && IsUnrolledAccessConstant(rewriter, firstLoad, laneMappings_jj, unrollMax_jj);
+    bool broadcastSecondLoad = IsUnrolledAccessSequential(rewriter, secondLoad, laneMappings_kk, unrollMax_kk) && IsUnrolledAccessConstant(rewriter, secondLoad, laneMappings_jj, unrollMax_jj);
+
+    int64_t firstLoadVecSize = vectorSize;
+    int64_t secondLoadVecSize = vectorSize;
+
+    // 3. muliply A * B
+    innerLoopBodyIter++;
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<v::BinOp>(*innerLoopBodyIter))
+    {
+        return reportMatchFailure(affineForOp, "Failed to match the binary A*B multiplication op");
+    }
+    auto mulAB = cast<v::BinOp>(*innerLoopBodyIter);
+    if (mulAB.predicate() != v::BinaryOpPredicate::MUL)
+    {
+        return reportMatchFailure(mulAB, "Failed to match the multiplication op");
+    }
+    // Check that the operands for the multiply op are in fact the loads from A and B
+    if (!((mulAB.lhs() == firstLoadVal && mulAB.rhs() == secondLoadVal) || (mulAB.rhs() == firstLoadVal && mulAB.lhs() == secondLoadVal)))
+    {
+        return reportMatchFailure(mulAB, "Failed to match the multiplication operands");
+    }
+    matchedOps.push(mulAB);
+
+    // 4. sign-extend / cast result of A * B
     innerLoopBodyIter++;
     if (innerLoopBodyIter == innerLoopBodyEnd || !isa<v::CastOp>(*innerLoopBodyIter))
     {
@@ -1104,6 +2220,11 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
     {
         return failure();
     }
+    if (!IsUnrolledAccessSequential(rewriter, loadCOp, laneMappings_jj, vectorSize / 2))
+    {
+        return reportMatchFailure(loadCOp, "Failed: isUnrolledAcessSequential for C load");
+    }
+
     matchedOps.push(loadCOp);
 
     // 6. add C + (A * B)
@@ -1136,6 +2257,10 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
     {
         return reportMatchFailure(storeCOp, "Failed to match the store into C");
     }
+    if (!IsUnrolledAccessSequential(rewriter, storeCOp, laneMappings_jj, vectorSize / 2))
+    {
+        return reportMatchFailure(loadCOp, "Failed: isUnrolledAcessSequential for C store");
+    }
     matchedOps.push(storeCOp);
 
     // 8. match the final pair of redundant load and store ops
@@ -1172,68 +2297,6 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
         return failure();
     }
 
-    // Instantiate a TempOpCleanupGuard so that all the matched ops will get cleaned up
-    ir::util::TempOpCleanupGuard matchedOpsGuard(&matchedOps, rewriter);
-    //ir::util::TempOpCleanupGuard tempOpsGuard(&tempOps, rewriter);
-
-    // Check if elements of B are sequential
-    // get unroll max for jj and kk
-    int64_t unrollMax_jj = std::min(jj_numIters, (jj_end - jj_begin));
-    int64_t unrollMax_kk = std::min(kk_numIters, (kk_end - kk_begin));
-
-    // create IV map for jj and kk
-    auto inductionVarMap_jj = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + jj_step * rewriter.getAffineSymbolExpr(0));
-    auto inductionVarMap_kk = AffineMap::get(1, 1, rewriter.getAffineDimExpr(0) + kk_step * rewriter.getAffineSymbolExpr(0));
-
-    // create lanemappings for jj * kk
-    std::vector<mlir::BlockAndValueMapping> laneMappings(unrollMax_kk * unrollMax_jj);
-    auto locB = loadBOp.getLoc();
-    
-    for (int64_t jj_idx = jj_begin; jj_idx < jj_end; jj_idx += jj_step)
-    {
-        auto offset_jj = rewriter.create<arith::ConstantIndexOp>(locB, jj_idx);
-        auto offsetInductionVar_jj = rewriter.create<AffineApplyOp>(locB, inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj });
-        tempOps.push(offset_jj);
-        tempOps.push(offsetInductionVar_jj);
-        for (int64_t kk_idx = kk_begin; kk_idx < kk_end; kk_idx += kk_step)
-        {
-            auto offset_kk = rewriter.create<arith::ConstantIndexOp>(locB, kk_idx);
-            auto offsetInductionVar_kk = rewriter.create<AffineApplyOp>(locB, inductionVarMap_kk, ValueRange{ kk_inductionVar, offset_kk });
-            tempOps.push(offset_kk);
-            tempOps.push(offsetInductionVar_kk);
-            BlockAndValueMapping& operandMap = laneMappings[jj_idx * unrollMax_kk + kk_idx];
-            operandMap.map(kk_inductionVar, offsetInductionVar_kk);
-            operandMap.map(jj_inductionVar, offsetInductionVar_jj);
-        }
-    }
-
-    int64_t vectorSize = 16;
-    auto memRefTypeB = loadBOp.getMemRefType();
-    auto elementTypeB = memRefTypeB.getElementType();
-    auto vectorTypeB = mlir::VectorType::get({ vectorSize }, elementTypeB);
-    mlir::AffineLoadOpAdaptor adaptorB{ loadBOp };
-    std::vector<mlir::Value> baseIndicesB(adaptorB.indices().begin(), adaptorB.indices().end());
-
-    mlir::Value loadBVecOp;
-    if (!IsUnrolledAccessSequential(rewriter, loadBOp, laneMappings, vectorSize))
-    {
-        return reportMatchFailure(loadBOp, "Failed: isUnrolledAcessSequential for B");
-    }
-
-    // Check if elements of output array, Y are sequential
-    // create lanemappings for jj
-    std::vector<mlir::BlockAndValueMapping> laneMappingsC(unrollMax_jj);
-    auto loc_loadCOp = loadCOp.getLoc();
-    for (int64_t jj_idx = 0; jj_idx < unrollMax_jj; ++jj_idx)
-    {
-        auto offset_jj = rewriter.create<arith::ConstantIndexOp>(loc_loadCOp, jj_idx);
-        auto offsetInductionVar_jj = rewriter.create<AffineApplyOp>(loc_loadCOp, inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj });
-        tempOps.push(offset_jj);
-        tempOps.push(offsetInductionVar_jj);
-        BlockAndValueMapping& operandMapC = laneMappingsC[jj_idx];
-        operandMapC.map(jj_inductionVar, offsetInductionVar_jj);
-    }
-
     auto memRefTypeC = loadCOp.getMemRefType();
     auto elementTypeC = memRefTypeC.getElementType();
     auto vectorTypeC = mlir::VectorType::get({ vectorSize / 2 }, elementTypeC);
@@ -1241,12 +2304,6 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
     std::vector<mlir::Value> baseIndicesC(adaptorC.indices().begin(), adaptorC.indices().end());
 
     mlir::Value loadCVecOp;
-    if (!IsUnrolledAccessSequential(rewriter, loadCOp, laneMappingsC, vectorSize / 2))
-    {
-        return reportMatchFailure(loadCOp, "Failed: isUnrolledAcessSequential for C");
-    }
-
-
     // Set the insertion point to the end of the inner loop (just before the terminator)
     mlir::OpBuilder::InsertionGuard guard(rewriter);
     rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator());
@@ -1279,112 +2336,135 @@ mlir::LogicalResult vectorizeInt16MatMul(mlir::AffineForOp affineForOp,
 
     // Implement the rewriter by stiching together a list of vector instructions, vector of 16 elements in this case
     // 1. create vector.load A
-    auto memRefType = loadAOp.getMemRefType();
-    auto elementType = memRefType.getElementType();
-    auto vectorType = mlir::VectorType::get({ vectorSize }, elementType);
-    mlir::AffineLoadOpAdaptor adaptorA{ loadAOp };
-    std::vector<mlir::Value> baseIndicesA(adaptorA.indices().begin(), adaptorA.indices().end());
-    // Ignoring the sequential access check for elements of A because that's not required.
-
-    auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, loadAOp, baseIndicesA);
-    auto loadAVecOp = rewriter.create<mlir::vector::LoadOp>(loadAOp.getLoc(), vectorType, flatCastMemRef, mlir::ValueRange{ flattenedPos });
-
-    // 2. create vector.shuffle op for A: alternate between A[0,0] and A[0,1]
-    auto locA = loadAOp.getLoc();
     auto i16Type = rewriter.getIntegerType(16);
-    auto vecType = mlir::VectorType::get({ vectorSize }, i16Type);
+    auto i32Type = rewriter.getIntegerType(32);
+    auto fullVecType = mlir::VectorType::get({ vectorSize }, i16Type);
     auto altElemsMask = rewriter.getI64ArrayAttr({ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 });
 
     auto halfVecType = mlir::VectorType::get({ vectorSize / 2 }, i16Type);
     auto oddMask = rewriter.getI64ArrayAttr({ 1, 3, 5, 7, 9, 11, 13, 15 });
     auto evenMask = rewriter.getI64ArrayAttr({ 0, 2, 4, 6, 8, 10, 12, 14 });
 
-    auto shuffledAOp = rewriter.create<mlir::vector::ShuffleOp>(locA, vecType, loadAVecOp, loadAVecOp, altElemsMask);
+    auto loadCastBroadcastExtractVec = [&](mlir::AffineLoadOp loadOp, int64_t loadVecSize, mlir::Type loadElementType, bool cast, bool broadcast) -> std::tuple<mlir::Value, mlir::Value, mlir::Value> {
+        auto loadOpVectorType = mlir::VectorType::get({ loadVecSize }, loadElementType);
+        mlir::AffineLoadOpAdaptor loadOpAdaptor{ loadOp };
+        std::vector<mlir::Value> loadOpIndices(loadOpAdaptor.indices().begin(), loadOpAdaptor.indices().end());
+        auto [flatCastMemRef, flattenedPos] = FlattenAccess(rewriter, loadOp, loadOpIndices);
+        mlir::Value loadVecVal = rewriter.create<mlir::vector::LoadOp>(loadOp.getLoc(), loadOpVectorType, flatCastMemRef, mlir::ValueRange{ flattenedPos });
+        if (cast)
+        {
+            // 1a. sign-extend loaded vector values
+            auto castLoadVecType = mlir::VectorType::get({ loadVecSize }, i16Type);
+            loadVecVal = rewriter.create<v::CastOp>(loadOp.getLoc(), loadVecVal, castLoadVecType);
+        }
+        if (broadcast)
+        {
+            // 1b. create vector.shuffle op for first load: alternate between A[0,0] and A[0,1]
+            loadVecVal = rewriter.create<mlir::vector::ShuffleOp>(loadOp.getLoc(), fullVecType, loadVecVal, loadVecVal, altElemsMask);
+        }
 
-    // 3. create vector shuffle op for A to pick odd and even elements separately
-    auto vecLoadA_oddShuffleOp = rewriter.create<mlir::vector::ShuffleOp>(locA, halfVecType, shuffledAOp, shuffledAOp, oddMask);
-    auto vecLoadA_evenShuffleOp = rewriter.create<mlir::vector::ShuffleOp>(locA, halfVecType, shuffledAOp, shuffledAOp, evenMask);
+        // 2. Now extract the odds and evens
+        mlir::Value oddShuffleVal = rewriter.create<mlir::vector::ShuffleOp>(loadOp.getLoc(), halfVecType, loadVecVal, loadVecVal, oddMask);
+        mlir::Value evenShuffleVal = rewriter.create<mlir::vector::ShuffleOp>(loadOp.getLoc(), halfVecType, loadVecVal, loadVecVal, evenMask);
 
-    // 4. create vector load op for B
-    if (IsUnrolledAccessSequential(rewriter, loadBOp, laneMappings, vectorSize))
+        return { loadVecVal, oddShuffleVal, evenShuffleVal };
+    };
+
+
+    // If there's only one broadcasted load, make sure it happens first for better vpmaddwd matching
+    mlir::Value firstLoadVec;
+    mlir::Value firstLoadOdds;
+    mlir::Value firstLoadEvens;
+    mlir::Value secondLoadVec;
+    mlir::Value secondLoadOdds;
+    mlir::Value secondLoadEvens;
+
+    if (broadcastFirstLoad == broadcastSecondLoad || broadcastFirstLoad)
     {
-        auto [flatCastMemRefB, flattenedPosB] = FlattenAccess(rewriter, loadBOp, baseIndicesB);
-        loadBVecOp = rewriter.create<mlir::vector::LoadOp>(loadBOp.getLoc(), vectorTypeB, flatCastMemRefB, mlir::ValueRange{ flattenedPosB });
+        auto [firstLoadVecVal, firstLoadOddVal, firstLoadEvenVal] = loadCastBroadcastExtractVec(firstLoad, firstLoadVecSize, firstElementType, castFirstLoad, broadcastFirstLoad);
+        auto [secondLoadVecVal, secondLoadOddVal, secondLoadEvenVal] = loadCastBroadcastExtractVec(secondLoad, secondLoadVecSize, secondElementType, castSecondLoad, broadcastSecondLoad);
+        firstLoadVec = firstLoadVecVal;
+        firstLoadOdds = firstLoadOddVal;
+        firstLoadEvens = firstLoadEvenVal;
+        secondLoadVec = secondLoadVecVal;
+        secondLoadOdds = secondLoadOddVal;
+        secondLoadEvens = secondLoadEvenVal;
     }
     else
     {
-        return failure();
+        // broadcastFirstLoad == false and broadcastSecondLoad == true
+        auto [firstLoadVecVal, firstLoadOddVal, firstLoadEvenVal] = loadCastBroadcastExtractVec(secondLoad, secondLoadVecSize, secondElementType, castSecondLoad, broadcastSecondLoad);
+        auto [secondLoadVecVal, secondLoadOddVal, secondLoadEvenVal] = loadCastBroadcastExtractVec(firstLoad, firstLoadVecSize, firstElementType, castFirstLoad, broadcastFirstLoad);
+        firstLoadVec = firstLoadVecVal;
+        firstLoadOdds = firstLoadOddVal;
+        firstLoadEvens = firstLoadEvenVal;
+        secondLoadVec = secondLoadVecVal;
+        secondLoadOdds = secondLoadOddVal;
+        secondLoadEvens = secondLoadEvenVal;
     }
 
-    // 5. create shuffled ops (odd and even) for loadBVecOp
-    auto vecLoadB_oddShuffleOp = rewriter.create<mlir::vector::ShuffleOp>(locB, halfVecType, loadBVecOp, loadBVecOp, oddMask);
-    auto vecLoadB_evenShuffleOp = rewriter.create<mlir::vector::ShuffleOp>(locB, halfVecType, loadBVecOp, loadBVecOp, evenMask);
-
-    // 6. Sign-extend all ops for further arithmetic operations
-    auto i32Type = rewriter.getIntegerType(32);
     auto bigVecType = mlir::VectorType::get({ vectorSize / 2 }, i32Type);
-    auto sextA_oddOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), vecLoadA_oddShuffleOp, bigVecType);
-    auto sextA_evenOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), vecLoadA_evenShuffleOp, bigVecType);
-    auto sextB_oddOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), vecLoadB_oddShuffleOp, bigVecType);
-    auto sextB_evenOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), vecLoadB_evenShuffleOp, bigVecType);
 
-    // 7. binOp.mul for sign-extended even shuffled elements of A and B
+    // TODO : plumb this from the DSL
+#if MATCH_VPMADDWD_INTRINSIC
+    // (3-5). Create results using vpmaddwd intrinsic
+    auto accumOp = rewriter.create<v::vpmaddwd>(outerLoop.getLoc(), bigVecType, firstLoadVec, secondLoadVec);
+#else
+    // 3. Sign-extend all ops for further arithmetic operations
+    // auto i32Type = rewriter.getIntegerType(32);
+    auto sextA_oddOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), firstLoadOdds, bigVecType);
+    auto sextA_evenOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), firstLoadEvens, bigVecType);
+    auto sextB_oddOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), secondLoadOdds, bigVecType);
+    auto sextB_evenOp = rewriter.create<mlir::arith::ExtSIOp>(rewriter.getUnknownLoc(), secondLoadEvens, bigVecType);
+
+    // 4. binOp.mul for sign-extended even shuffled elements of A and B
     // A[00] * B[0], A[00] * B[2], A[00] * B[4] ...
     auto vecMulAB_even = rewriter.create<mlir::arith::MulIOp>(mulAB.getLoc(), sextA_evenOp, sextB_evenOp);
     // A[01] * B[1], A[01] * B[3], A[01] * B[5] ...
     auto vecMulAB_odd = rewriter.create<mlir::arith::MulIOp>(mulAB.getLoc(), sextA_oddOp, sextB_oddOp);
 
-    // 8. Add odd/even sign-extended results
-    auto accABOp = rewriter.create<mlir::arith::AddIOp>(rewriter.getUnknownLoc(), vecMulAB_even, vecMulAB_odd);
+    // 5. Add odd/even sign-extended results
+    auto accumOp = rewriter.create<mlir::arith::AddIOp>(rewriter.getUnknownLoc(), vecMulAB_even, vecMulAB_odd);
+#endif
 
-    // 9. Vectorize affine.load of C
-    if (IsUnrolledAccessSequential(rewriter, loadCOp, laneMappingsC, vectorSize / 2))
-    {
-        // TODO: substitute 0 for jj here
-        auto [flatCastMemRefC, flattenedPosC] = FlattenAccess(rewriter, loadCOp, baseIndicesC);
-        loadCVecOp = rewriter.create<mlir::vector::LoadOp>(loadCOp.getLoc(), vectorTypeC, flatCastMemRefC, mlir::ValueRange{ flattenedPosC });
-    }
-    else
-    {
-        return failure();
-    }
+    // 6. Vectorize affine.load of C
+    auto [flatCastMemRefC, flattenedPosC] = FlattenAccess(rewriter, loadCOp, baseIndicesC);
+    loadCVecOp = rewriter.create<mlir::vector::LoadOp>(loadCOp.getLoc(), vectorTypeC, flatCastMemRefC, mlir::ValueRange{ flattenedPosC });
 
-    // 10. Add accABOp to vecLoadC
-    auto finalAccOp = rewriter.create<mlir::arith::AddIOp>(accOp.getLoc(), loadCVecOp, accABOp);
-
-    // 11. store final accumulated result to vectorized C
-    // Verify again if the memory access is sequential and then vectorize the store op
-    std::vector<mlir::BlockAndValueMapping> laneMappingsStoreC(unrollMax_jj);
-    auto loc_storeCOp = storeCOp.getLoc();
-    for (int64_t jj_idx = 0; jj_idx < unrollMax_jj; ++jj_idx)
-    {
-        auto offset_jj = rewriter.create<arith::ConstantIndexOp>(loc_storeCOp, jj_idx);
-        auto offsetInductionVar_jj = rewriter.create<AffineApplyOp>(loc_storeCOp, inductionVarMap_jj, ValueRange{ jj_inductionVar, offset_jj });
-        tempOps.push(offset_jj);
-        tempOps.push(offsetInductionVar_jj);
-        BlockAndValueMapping& operandMapStoreC = laneMappingsStoreC[jj_idx];
-        operandMapStoreC.map(jj_inductionVar, offsetInductionVar_jj);
-    }
+    // 7. Add accumOp to vecLoadC
+    auto finalAccOp = rewriter.create<mlir::arith::AddIOp>(accOp.getLoc(), loadCVecOp, accumOp);
 
+    // 8. store final accumulated result to vectorized C
     mlir::AffineStoreOpAdaptor adaptorStoreC{ storeCOp };
     std::vector<mlir::Value> baseIndicesStoreC(adaptorStoreC.indices().begin(), adaptorStoreC.indices().end());
 
     mlir::vector::StoreOp storeCVecOp;
-    if (IsUnrolledAccessSequential(rewriter, storeCOp, laneMappingsStoreC, vectorSize / 2))
-    {
-        auto [flatCastMemRefStoreC, flattenedPosStoreC] = FlattenAccess(rewriter, storeCOp, baseIndicesStoreC);
-        storeCVecOp = rewriter.create<mlir::vector::StoreOp>(storeCOp.getLoc(), finalAccOp.getResult(), flatCastMemRefStoreC, mlir::ValueRange{ flattenedPosStoreC });
-    }
-    else
-    {
-        return failure();
-    }
+    auto [flatCastMemRefStoreC, flattenedPosStoreC] = FlattenAccess(rewriter, storeCOp, baseIndicesStoreC);
+
+    rewriter.create<mlir::vector::StoreOp>(storeCOp.getLoc(), finalAccOp.getResult(), flatCastMemRefStoreC, mlir::ValueRange{ flattenedPosStoreC });
 
     // Set the step size for the vectorized loops to be the vector size in that dimension
     outerLoop.setStep(jj_step * jj_numIters);
     innerLoop.setStep(kk_step * kk_numIters);
-    
+
+    ir::util::EraseOps(matchedOps, rewriter);
+
     return mlir::success();
 }
 
+mlir::LogicalResult TryVectorizeKnownSubgraph(mlir::AffineForOp affineForOp,
+                                              mlir::PatternRewriter& rewriter)
+{
+    // TODO : convert these to rewrite pattern structs with benefit weights
+    if (succeeded(vectorizeHorizontalReduction(affineForOp, rewriter)))
+        return success();
+    if (succeeded(vectorizeSequentialCast(affineForOp, rewriter)))
+        return success();
+    if (succeeded(vectorizeTwoRowInterleavedPack(affineForOp, rewriter)))
+        return success();
+    if (succeeded(vectorizeInt16MatMul(affineForOp, rewriter)))
+        return success();
+    return failure();
+}
+
 } // namespace accera::transforms
diff --git a/accera/transforms/src/value/RangeValueOptimizePass.cpp b/accera/transforms/src/value/RangeValueOptimizePass.cpp
index e6190cb9..4ede182f 100644
--- a/accera/transforms/src/value/RangeValueOptimizePass.cpp
+++ b/accera/transforms/src/value/RangeValueOptimizePass.cpp
@@ -1,7 +1,7 @@
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //  Copyright (c) Microsoft Corporation. All rights reserved.
 //  Licensed under the MIT License. See LICENSE in the project root for license information.
-//  Authors: Abdul Dakkak
+//  Authors: Abdul Dakkak, Mason Remy
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include "AcceraPasses.h"
@@ -12,7 +12,9 @@
 
 #include <llvm/IR/GlobalValue.h>
 #include <mlir/Analysis/DataFlowAnalysis.h>
+#include <mlir/Dialect/Affine/Analysis/LoopAnalysis.h>
 #include <mlir/Dialect/Affine/IR/AffineOps.h>
+#include <mlir/Dialect/Affine/LoopUtils.h>
 #include <mlir/Dialect/Arithmetic/IR/Arithmetic.h>
 #include <mlir/Dialect/LLVMIR/LLVMDialect.h>
 #include <mlir/Dialect/Linalg/IR/Linalg.h>
@@ -39,6 +41,7 @@
 #include <llvm/Support/raw_os_ostream.h>
 
 #include <algorithm>
+#include <optional>
 
 #define DEBUG_TYPE "value-optimize"
 
@@ -55,101 +58,248 @@ using llvm::Instruction;
 
 namespace
 {
-struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase<RangeValueOptimizePass>
+
+enum class CmpIOpClassification : int
 {
-    void runOnOperation() final
+    Unknown,
+    AlwaysFalse,
+    AlwaysTrue
+};
+
+// TODO : de-dupe with value-to-std
+static arith::CmpIPredicate CmpOpPredicateToCmpIPredicate(accera::ir::value::CmpOpPredicate pred)
+{
+#define MAP_PREDICATE(v1, v2)                   \
+    case accera::ir::value::CmpOpPredicate::v1: \
+        return arith::CmpIPredicate::v2
+
+    switch (pred)
     {
-        rangeValue = &getAnalysis<RangeValueAnalysis>();
-
-        // now we use them to classify the comparison operation
-        auto ctx = &getContext();
-        OpBuilder builder(ctx);
-        Type i1Ty = builder.getI1Type();
-        getOperation()->walk([&](arith::CmpIOp op) {
-            auto classification = classifyCmpIOp(op);
-            if (classification != CmpIOpClassification::Unknown)
-            {
-                builder.setInsertionPoint(op);
-                Value val = builder.create<arith::ConstantOp>(op->getLoc(), i1Ty, builder.getBoolAttr(classification == CmpIOpClassification::AlwaysTrue));
-                op.replaceAllUsesWith(val);
-                op.erase();
-            }
-        });
+        MAP_PREDICATE(EQ, eq);
+        MAP_PREDICATE(GE, sge);
+        MAP_PREDICATE(GT, sgt);
+        MAP_PREDICATE(LE, sle);
+        MAP_PREDICATE(LT, slt);
+        MAP_PREDICATE(NE, ne);
+    default:
+        assert(false);
+    }
+
+#undef MAP_PREDICATE
+}
+
+CmpIOpClassification classifyCmpIOp(RangeValueAnalysis& rangeValue, arith::CmpIOp op)
+{
+    auto predicate = op.getPredicate();
+    auto lhs = op.getLhs();
+    auto rhs = op.getRhs();
+    if (!rangeValue.hasRange(lhs) || !rangeValue.hasRange(rhs))
+    {
+        return CmpIOpClassification::Unknown;
+    }
+    auto lhsRange = rangeValue.getRange(lhs);
+    auto rhsRange = rangeValue.getRange(rhs);
+    if (lhsRange.isFullSet() || rhsRange.isFullSet())
+    {
+        return CmpIOpClassification::Unknown;
+    }
+
+    switch (predicate)
+    {
+    case arith::CmpIPredicate::slt:
+        if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysTrue;
+        }
+        else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysFalse;
+        }
+        break;
+    case arith::CmpIPredicate::sle:
+        if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysTrue;
+        }
+        else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysFalse;
+        }
+        break;
+    case arith::CmpIPredicate::sgt:
+        if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysTrue;
+        }
+        else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysFalse;
+        }
+        break;
+    case arith::CmpIPredicate::sge:
+        if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysTrue;
+        }
+        else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange))
+        {
+            return CmpIOpClassification::AlwaysFalse;
+        }
+        break;
+    default:
+        break;
+    }
+
+    return CmpIOpClassification::Unknown;
+}
+
+std::optional<bool> GetConstantCmpIOpResult(arith::CmpIOp cmpIOp)
+{
+    RangeValueAnalysis rangeValueAnalysis(cmpIOp);
+    auto classification = classifyCmpIOp(rangeValueAnalysis, cmpIOp);
+    if (classification != CmpIOpClassification::Unknown)
+    {
+        return classification == CmpIOpClassification::AlwaysTrue;
+    }
+    return std::nullopt;
+}
+
+LogicalResult RewriteConstantCmpIOpCommon(PatternRewriter& rewriter, arith::CmpIOp cmpIOp, mlir::Operation* opToReplace = nullptr)
+{
+    if (!opToReplace)
+    {
+        opToReplace = cmpIOp;
+    }
+
+    auto constantCmpIOpResultOpt = GetConstantCmpIOpResult(cmpIOp);
+
+    if (constantCmpIOpResultOpt.has_value())
+    {
+        Type i1Ty = rewriter.getI1Type();
+        rewriter.replaceOpWithNewOp<arith::ConstantOp>(opToReplace, i1Ty, rewriter.getBoolAttr(*constantCmpIOpResultOpt));
+        return mlir::success();
+    }
+    return mlir::failure();
+}
+
+struct ConstantCmpIOpRewrite : public mlir::OpRewritePattern<arith::CmpIOp>
+{
+    using OpRewritePattern::OpRewritePattern;
+    LogicalResult matchAndRewrite(arith::CmpIOp op, PatternRewriter& rewriter) const final
+    {
+        return RewriteConstantCmpIOpCommon(rewriter, op);
     }
+};
 
-private:
-    enum CmpIOpClassification : int
+struct ConstantAcceraCmpOpRewrite : public mlir::OpRewritePattern<accera::ir::value::CmpOp>
+{
+    using OpRewritePattern::OpRewritePattern;
+    LogicalResult matchAndRewrite(accera::ir::value::CmpOp op, PatternRewriter& rewriter) const final
     {
-        Unknown,
-        AlwaysFalse,
-        AlwaysTrue
-    };
+        std::stack<mlir::Operation*> tempOps;
+        TempOpCleanupGuard guard(&tempOps, rewriter);
 
-    CmpIOpClassification classifyCmpIOp(arith::CmpIOp op)
+        // TODO : de-dupe with value-to-std conversion
+        auto lhs = op.lhs();
+        auto rhs = op.rhs();
+
+        auto pred = op.getPredicate();
+        if (util::GetElementType(lhs.getType()).isa<FloatType>())
+        {
+            // Doesn't support CmpFOp classification currently
+            return failure();
+        }
+        auto stdCmpIOp = rewriter.create<arith::CmpIOp>(op.getLoc(), CmpOpPredicateToCmpIPredicate(pred), lhs, rhs);
+        tempOps.push(stdCmpIOp.getOperation());
+
+        return RewriteConstantCmpIOpCommon(rewriter, stdCmpIOp, op);
+    }
+};
+
+struct ConstantAcceraMaxMinOpRewrite : public mlir::OpRewritePattern<BinOp>
+{
+    using OpRewritePattern::OpRewritePattern;
+    LogicalResult matchAndRewrite(BinOp op, PatternRewriter& rewriter) const final
     {
+        // If the Bin op is a max or a min, then check if it is always equal to one of its operands
+        // i.e. if we have z = max(x, y), and x <= y always, then replace max(x, y) with y
+        // To do this, check:
+        //      (x <= y), and
+        //      (x >= y)
+        // If the former is always true, then replace max(x, y) with y, min(x, y) with x
+        // If the latter is always true, then replace max(x, y) with x, min(x, y) with y
+        // If neither are always true, then don't replace the max or min op
+        // We have to check both to handle the case where a '<' or '>' check doesn't capture that the point where they are equal doesn't change which operand is the replacement value of the max/min and to avoid an operand ordering bias
+
         auto predicate = op.getPredicate();
-        auto lhs = op.getLhs();
-        auto rhs = op.getRhs();
-        if (!rangeValue->hasRange(lhs) || !rangeValue->hasRange(rhs))
+        if (predicate != BinaryOpPredicate::MAX && predicate != BinaryOpPredicate::MIN)
         {
-            return CmpIOpClassification::Unknown;
+            return failure();
         }
-        auto lhsRange = rangeValue->getRange(lhs);
-        auto rhsRange = rangeValue->getRange(rhs);
-        if (lhsRange.isFullSet() || rhsRange.isFullSet())
+        std::stack<mlir::Operation*> tempOps;
+        TempOpCleanupGuard guard(&tempOps, rewriter);
+
+        auto lhs = op.lhs();
+        auto rhs = op.rhs();
+
+        if (util::GetElementType(lhs.getType()).isa<FloatType>())
         {
-            return CmpIOpClassification::Unknown;
+            // Doesn't support CmpFOp classification currently
+            return failure();
         }
+        auto LEQCmpIOp = rewriter.create<arith::CmpIOp>(op.getLoc(), arith::CmpIPredicate::sle, lhs, rhs);
+        tempOps.push(LEQCmpIOp.getOperation());
+        auto LEQconstantResultOpt = GetConstantCmpIOpResult(LEQCmpIOp);
 
-        switch (predicate)
+        if (LEQconstantResultOpt.has_value() && *LEQconstantResultOpt)
         {
-        case arith::CmpIPredicate::slt:
-            if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange))
-            {
-                return CmpIOpClassification::AlwaysTrue;
-            }
-            else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange))
-            {
-                return CmpIOpClassification::AlwaysFalse;
-            }
-            break;
-        case arith::CmpIPredicate::sle:
-            if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange))
-            {
-                return CmpIOpClassification::AlwaysTrue;
-            }
-            else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange))
-            {
-                return CmpIOpClassification::AlwaysFalse;
-            }
-            break;
-        case arith::CmpIPredicate::sgt:
-            if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGT, rhsRange))
+            if (predicate == BinaryOpPredicate::MAX)
             {
-                return CmpIOpClassification::AlwaysTrue;
+                rewriter.replaceOp(op, mlir::ValueRange{ rhs });
             }
-            else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLE, rhsRange))
+            else
             {
-                return CmpIOpClassification::AlwaysFalse;
+                rewriter.replaceOp(op, mlir::ValueRange{ lhs });
             }
-            break;
-        case arith::CmpIPredicate::sge:
-            if (lhsRange.icmp(CmpInst::Predicate::ICMP_SGE, rhsRange))
+            return success();
+        }
+
+        auto GEQCmpIOp = rewriter.create<arith::CmpIOp>(op.getLoc(), arith::CmpIPredicate::sge, lhs, rhs);
+        tempOps.push(GEQCmpIOp.getOperation());
+        auto GEQconstantResultOpt = GetConstantCmpIOpResult(GEQCmpIOp);
+
+        if (GEQconstantResultOpt.has_value() && *GEQconstantResultOpt)
+        {
+            if (predicate == BinaryOpPredicate::MAX)
             {
-                return CmpIOpClassification::AlwaysTrue;
+                rewriter.replaceOp(op, mlir::ValueRange{ lhs });
             }
-            else if (lhsRange.icmp(CmpInst::Predicate::ICMP_SLT, rhsRange))
+            else
             {
-                return CmpIOpClassification::AlwaysFalse;
+                rewriter.replaceOp(op, mlir::ValueRange{ rhs });
             }
-            break;
-        default:
-            break;
+            return success();
         }
+        return failure();
+    }
+};
 
-        return CmpIOpClassification::Unknown;
+
+struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase<RangeValueOptimizePass>
+{
+    void runOnOperation() final
+    {
+        auto context = &getContext();
+        auto operation = getOperation();
+
+        mlir::GreedyRewriteConfig topDownConfig; // Handle outer simplifications first as they will resolve to constants need for inner simplifications
+        topDownConfig.useTopDownTraversal = true;
+
+        mlir::RewritePatternSet patterns(context);
+        accera::transforms::value::populateRangeValueOptimizePatterns(patterns);
+        util::FillCanonicalPatternsRecursively(operation, patterns);
+        (void)applyPatternsAndFoldGreedily(operation, std::move(patterns), topDownConfig);
     }
-    RangeValueAnalysis* rangeValue = nullptr;
 };
 
 } // namespace
@@ -157,6 +307,13 @@ struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase<RangeValueO
 namespace accera::transforms::value
 {
 
+void populateRangeValueOptimizePatterns(mlir::RewritePatternSet& patterns)
+{
+    patterns.insert<ConstantCmpIOpRewrite,
+                    ConstantAcceraCmpOpRewrite,
+                    ConstantAcceraMaxMinOpRewrite>(patterns.getContext());
+}
+
 std::unique_ptr<mlir::Pass> createRangeValueOptimizePass()
 {
     return std::make_unique<RangeValueOptimizePass>();
diff --git a/accera/transforms/src/value/ValueFuncToTargetPass.cpp b/accera/transforms/src/value/ValueFuncToTargetPass.cpp
index 530f57e8..b7e3953b 100644
--- a/accera/transforms/src/value/ValueFuncToTargetPass.cpp
+++ b/accera/transforms/src/value/ValueFuncToTargetPass.cpp
@@ -220,7 +220,8 @@ struct ValueLambdaRewritePattern : mlir::OpRewritePattern<vir::ValueLambdaOp>
         // gpu functions fail since hiprtc does not call the host launcher function
         // but instead calls the kernel directly.
         llvm::SetVector<Value> capturedValuesSet;
-        for (auto&& v : op->getParentOfType<vir::ValueFuncOp>().getArguments())
+        auto parentFuncOp = op->getParentOfType<vir::ValueFuncOp>();
+        for (auto&& v : parentFuncOp.getArguments())
         {
             capturedValuesSet.insert(v);
         }
@@ -306,6 +307,11 @@ struct ValueLambdaRewritePattern : mlir::OpRewritePattern<vir::ValueLambdaOp>
 
         mapValueTypeAttr<vir::ValueFuncOp>(vFuncOp, valueMapper);
 
+        if (parentFuncOp->hasAttr(ir::NoInlineIntoAttrName))
+        {
+            vFuncOp->setAttr(ir::NoInlineIntoAttrName, rewriter.getUnitAttr());
+        }
+
         rewriter.eraseOp(op);
     }
 };
@@ -324,6 +330,13 @@ struct ValueLaunchFuncOpInlinerPattern : OpRewritePattern<vir::LaunchFuncOp>
             // Don't inline calls from RawPointerAPI functions
             return failure();
         }
+        if (parentFnOp->getAttr(ir::NoInlineIntoAttrName))
+        {
+            // If this launch op is inside of a function that is not inlinable-into, then don't inline the function we're calling
+            // By doing this, only the outer publically-visible function will have its internal calls inlined and we won't
+            // wind up bloating our module with function contents that will never be invoked
+            return failure();
+        }
 
         if (auto attr = parentFnOp->getAttrOfType<vir::ExecutionTargetAttr>(vir::ValueFuncOp::getExecTargetAttrName());
             attr && target == attr)
diff --git a/accera/transforms/src/value/ValueSimplifyPass.cpp b/accera/transforms/src/value/ValueSimplifyPass.cpp
index d9ef2dfc..b72d80a8 100644
--- a/accera/transforms/src/value/ValueSimplifyPass.cpp
+++ b/accera/transforms/src/value/ValueSimplifyPass.cpp
@@ -448,7 +448,7 @@ struct IndexCombinationBinOpLowering : public OpRewritePattern<ValueBinOp>
                 combinationExpr = lhsExpr % rhsExpr;
                 break;
             default:
-                assert(false);
+                return failure();
             }
             auto map = mlir::AffineMap::get(nextDimIdx, 0, combinationExpr);
             rewriter.replaceOpWithNewOp<mlir::AffineApplyOp>(op, map, exprInputs);
diff --git a/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp b/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp
index ec3e28f8..00df5544 100644
--- a/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp
+++ b/accera/transforms/src/value/ValueToLLVMLoweringPass.cpp
@@ -8,6 +8,7 @@
 
 #include <ir/include/IRUtil.h>
 #include <ir/include/value/ValueDialect.h>
+#include <ir/include/intrinsics/AcceraIntrinsicsDialect.h>
 #include <mlir/Dialect/LLVMIR/LLVMTypes.h>
 #include <mlir/IR/BuiltinTypes.h>
 #include <mlir/IR/Types.h>
@@ -555,6 +556,89 @@ struct MemrefAllocOpLowering : public ConvertOpToLLVMPattern<memref::AllocOp>
     }
 };
 
+// TODO : de-dupe these lowerings, all 2-arg-1-result vector intrinsics appear to have the same lowering
+struct VpmaddwdOpLowering : public ValueLLVMOpConversionPattern<vpmaddwd>
+{
+    using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern;
+
+    LogicalResult matchAndRewrite(
+        vpmaddwd op,
+        OpAdaptor adaptor,
+        ConversionPatternRewriter& rewriter) const override
+    {
+        LLVMTypeConverter llvmTypeConverter(rewriter.getContext());
+        auto outputVecType = op.getType().cast<mlir::VectorType>();
+        auto outputVecLLVMType = llvmTypeConverter.convertType(outputVecType);
+        rewriter.replaceOpWithNewOp<intrinsics::VpmaddwdOp>(op, outputVecLLVMType, op.lhs(), op.rhs());
+        return success();
+    }
+};
+
+struct VmaxpsOpLowering : public ValueLLVMOpConversionPattern<vmaxps>
+{
+    using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern;
+
+    LogicalResult matchAndRewrite(
+        vmaxps op,
+        OpAdaptor adaptor,
+        ConversionPatternRewriter& rewriter) const override
+    {
+        LLVMTypeConverter llvmTypeConverter(rewriter.getContext());
+        auto outputVecType = op.getType().cast<mlir::VectorType>();
+        auto outputVecLLVMType = llvmTypeConverter.convertType(outputVecType);
+        rewriter.replaceOpWithNewOp<intrinsics::VmaxpsOp>(op, outputVecLLVMType, op.lhs(), op.rhs());
+        return success();
+    }
+};
+
+struct VminpsOpLowering : public ValueLLVMOpConversionPattern<vminps>
+{
+    using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern;
+
+    LogicalResult matchAndRewrite(
+        vminps op,
+        OpAdaptor adaptor,
+        ConversionPatternRewriter& rewriter) const override
+    {
+        LLVMTypeConverter llvmTypeConverter(rewriter.getContext());
+        auto outputVecType = op.getType().cast<mlir::VectorType>();
+        auto outputVecLLVMType = llvmTypeConverter.convertType(outputVecType);
+        rewriter.replaceOpWithNewOp<intrinsics::VminpsOp>(op, outputVecLLVMType, op.lhs(), op.rhs());
+        return success();
+    }
+};
+
+struct RoundOpLowering : public ValueLLVMOpConversionPattern<RoundOp>
+{
+    using ValueLLVMOpConversionPattern::ValueLLVMOpConversionPattern;
+
+    LogicalResult matchAndRewrite(
+        RoundOp op,
+        OpAdaptor adaptor,
+        ConversionPatternRewriter& rewriter) const override
+    {
+        LLVMTypeConverter llvmTypeConverter(rewriter.getContext());
+        auto outputType = llvmTypeConverter.convertType(op.getType());
+
+        auto inputType = op.val().getType();
+        if (inputType.isa<mlir::VectorType>())
+        {
+            rewriter.replaceOpWithNewOp<intrinsics::RoundF32VecAVX2>(op, outputType, op.val());
+        }
+        else
+        {
+            mlir::Value roundedFPVal = rewriter.create<intrinsics::RoundEvenOp>(op.getLoc(), op.val());
+
+            // Create arithmetic dialect cast ops with the expectation that other arithmetic dialect ops are getting lowered as part of this pass
+            auto signlessOutputType = util::ToSignlessMLIRType(rewriter, op.getType());
+            mlir::Value roundedSIVal = rewriter.create<mlir::arith::FPToSIOp>(op.getLoc(), roundedFPVal, signlessOutputType);
+            rewriter.replaceOpWithNewOp<mlir::UnrealizedConversionCastOp>(op, op.getType(), roundedSIVal);
+        }
+        return success();
+    }
+};
+
+
 struct ValueToLLVMLoweringPass : public ConvertValueToLLVMBase<ValueToLLVMLoweringPass>
 {
     ValueToLLVMLoweringPass(bool useBarePtrCallConv, bool emitCWrappers, unsigned indexBitwidth, bool useAlignedAlloc, llvm::DataLayout dataLayout, const IntraPassSnapshotOptions& snapshotteroptions = {}) :
@@ -1281,6 +1365,7 @@ void ValueToLLVMLoweringPass::runOnModule()
     snapshotter.Snapshot("Initial", moduleOp);
 
     target.addLegalOp<ModuleOp>();
+    target.addLegalDialect<intrinsics::AcceraIntrinsicsDialect>();
 
     // Set pass parameter values with command line options inherited from ConvertValueToLLVMBase
     mlir::LowerToLLVMOptions options(&getContext());
@@ -1328,16 +1413,28 @@ void ValueToLLVMLoweringPass::runOnModule()
     snapshotter.Snapshot("BarePtrConversion", moduleOp);
 
     {
+        auto intermediateTarget = target;
+        intermediateTarget.addLegalDialect<mlir::arith::ArithmeticDialect>();
+        intermediateTarget.addLegalDialect<mlir::BuiltinDialect>();
+
         RewritePatternSet patterns(&getContext());
         populateValueToLLVMPatterns(llvmTypeConverter, patterns);
 
         populateLinalgToLLVMConversionPatterns(llvmTypeConverter, patterns);
 
         populateVectorToLLVMConversionPatterns(llvmTypeConverter, patterns, /*reassociateFPReductions*/ true);
+
+        // Subset of LowerVectorToLLVMPass patterns
+        vector::populateVectorToVectorCanonicalizationPatterns(patterns);
+        vector::populateVectorBroadcastLoweringPatterns(patterns);
+        vector::populateVectorMaskOpLoweringPatterns(patterns);
+        vector::populateVectorShapeCastLoweringPatterns(patterns);
+        vector::populateVectorTransposeLoweringPatterns(patterns);
+        vector::populateVectorTransferLoweringPatterns(patterns, /*maxTransferRank=*/1);
         vector::populateVectorContractLoweringPatterns(patterns, vector::VectorTransformsOptions{}.setVectorTransferSplit(mlir::vector::VectorTransferSplit::VectorTransfer));
         vector::populateVectorMaskMaterializationPatterns(patterns, true);
 
-        if (failed(applyPartialConversion(moduleOp, target, std::move(patterns))))
+        if (failed(applyPartialConversion(moduleOp, intermediateTarget, std::move(patterns))))
         {
             signalPassFailure();
         }
@@ -1353,6 +1450,15 @@ void ValueToLLVMLoweringPass::runOnModule()
         populateMemRefToLLVMConversionPatterns(llvmTypeConverter, patterns);
         populateStdToLLVMConversionPatterns(llvmTypeConverter, patterns);
         arith::populateArithmeticToLLVMConversionPatterns(llvmTypeConverter, patterns);
+        arith::populateArithmeticExpandOpsPatterns(patterns);
+
+        // Subset of LowerVectorToLLVMPass patterns
+        vector::populateVectorToVectorCanonicalizationPatterns(patterns);
+        vector::populateVectorBroadcastLoweringPatterns(patterns);
+        vector::populateVectorMaskOpLoweringPatterns(patterns);
+        vector::populateVectorShapeCastLoweringPatterns(patterns);
+        vector::populateVectorTransposeLoweringPatterns(patterns);
+        vector::populateVectorTransferLoweringPatterns(patterns, /*maxTransferRank=*/1);
 
         populateVectorToLLVMConversionPatterns(llvmTypeConverter, patterns, /*reassociateFPReductions*/ true);
         vector::populateVectorContractLoweringPatterns(patterns, vector::VectorTransformsOptions{}.setVectorTransferSplit(mlir::vector::VectorTransferSplit::VectorTransfer));
@@ -1413,6 +1519,10 @@ void populateLocalValueToLLVMPatterns(mlir::LLVMTypeConverter& typeConverter, ml
         PrintFOpLowering,
         GetTimeOpLowering,
         RangeOpLowering,
+        VpmaddwdOpLowering,
+        VmaxpsOpLowering,
+        VminpsOpLowering,
+        RoundOpLowering,
         MemrefAllocOpLowering>(typeConverter, context);
 }
 
diff --git a/accera/transforms/src/value/ValueToStandardLoweringPass.cpp b/accera/transforms/src/value/ValueToStandardLoweringPass.cpp
index 70ab243e..12da3410 100644
--- a/accera/transforms/src/value/ValueToStandardLoweringPass.cpp
+++ b/accera/transforms/src/value/ValueToStandardLoweringPass.cpp
@@ -472,23 +472,40 @@ struct AllocOpLowering : public OpRewritePattern<ValueAllocOp>
             auto memrefType = op.getType();
             auto allocType = op.allocType().getValueOr(vir::MemoryAllocType::Global);
 
+            OpBuilder::InsertionGuard guard(rewriter);
+            auto parentFuncOp = op->getParentOfType<mlir::FuncOp>();
+            mlir::memref::AllocOp allocOp;
+            mlir::Block* parentBlock;
+            mlir::Value allocatedMemref;
             switch (allocType)
             {
             case vir::MemoryAllocType::Global: {
-                    if (memrefType.getNumDynamicDims() == 0)
-                    {
-                        auto globalOp = irutil::CreateGlobalBufferOp(rewriter, op, MemRefType::Builder{ memrefType }.setLayout({}), kGlobalOpSymNameFormat);
-                        rewriter.replaceOpWithNewOp<vir::ReferenceGlobalOp>(op, memrefType, globalOp.sym_name());
-                    }
-                    else
-                    {
-                        rewriter.replaceOpWithNewOp<memref::AllocOp>(op, memrefType, op.getOperation()->getOperands(), op.alignmentAttr());
-                    }
+                if (memrefType.getNumDynamicDims() == 0)
+                {
+                    auto globalOp = irutil::CreateGlobalBufferOp(rewriter, op, MemRefType::Builder{ memrefType }.setLayout({}), kGlobalOpSymNameFormat);
+                    rewriter.replaceOpWithNewOp<vir::ReferenceGlobalOp>(op, memrefType, globalOp.sym_name());
                 }
-                break;
+                else
+                {
+                    rewriter.replaceOpWithNewOp<memref::AllocOp>(op, memrefType, op.getOperation()->getOperands(), op.alignmentAttr());
+                }
+            }
+            break;
             case vir::MemoryAllocType::Stack:
+                // Create the stack allocation at the beginning of the function
+                rewriter.setInsertionPointToStart(&parentFuncOp.front());
                 rewriter.replaceOpWithNewOp<memref::AllocaOp>(op, MemRefType::Builder{ memrefType }.setLayout({}), mlir::ValueRange{}, op.alignmentAttr());
                 break;
+            case vir::MemoryAllocType::Heap:
+                allocOp = rewriter.replaceOpWithNewOp<memref::AllocOp>(op, memrefType, op.getOperation()->getOperands(), op.alignmentAttr());
+
+                // Create a dealloc op at the end of the block containing this alloc op
+                parentBlock = allocOp->getBlock();
+                rewriter.setInsertionPoint(parentBlock->getTerminator());
+
+                allocatedMemref = allocOp.getResult();
+                rewriter.create<memref::DeallocOp>(allocOp.getLoc(), allocatedMemref);
+                break;
             default:
                 llvm_unreachable("Unknown alloc type");
             }
@@ -506,19 +523,19 @@ struct AllocOpLowering : public OpRewritePattern<ValueAllocOp>
 using ValueCastOp = vir::CastOp;
 struct CastOpLowering : public OpRewritePattern<ValueCastOp>
 {
-#define CAST_FROM_TO_WITH_OP_IF(testFromType, testToType, castOp, conditional)                              \
-    if (fromType && toType && fromType.isa<testFromType>() && toType.isa<testToType>() && conditional)      \
-    {                                                                                                       \
-        mlir::Value castValue = rewriter.create<castOp>(op.getLoc(), signlessFromValue, signlessToType);    \
-        if (toType.isIntOrIndex())                                                                          \
-        {                                                                                                   \
-            rewriter.replaceOpWithNewOp<mlir::UnrealizedConversionCastOp>(op, toType, castValue);           \
-        }                                                                                                   \
-        else                                                                                                \
-        {                                                                                                   \
-            rewriter.replaceOp(op, { castValue } );                                                         \
-        }                                                                                                   \
-        return success();                                                                                   \
+#define CAST_FROM_TO_WITH_OP_IF(testFromType, testToType, castOp, conditional)                                       \
+    if (fromType && toType && fromElementType.isa<testFromType>() && toElementType.isa<testToType>() && conditional) \
+    {                                                                                                                \
+        mlir::Value castValue = rewriter.create<castOp>(op.getLoc(), signlessFromValue, signlessToType);             \
+        if (toType.isIntOrIndex())                                                                                   \
+        {                                                                                                            \
+            rewriter.replaceOpWithNewOp<mlir::UnrealizedConversionCastOp>(op, toType, castValue);                    \
+        }                                                                                                            \
+        else                                                                                                         \
+        {                                                                                                            \
+            rewriter.replaceOp(op, { castValue });                                                                   \
+        }                                                                                                            \
+        return success();                                                                                            \
     }
 
 #define CAST_FROM_TO_WITH_OP(testFromType, testToType, castOp) CAST_FROM_TO_WITH_OP_IF(testFromType, testToType, castOp, true);
@@ -532,10 +549,17 @@ struct CastOpLowering : public OpRewritePattern<ValueCastOp>
         auto fromType = op.source().getType();
         auto toType = op.result().getType();
 
-        assert(fromType.isIntOrIndexOrFloat() && "Can only cast from an int, index, or float type");
-        assert(toType.isIntOrIndexOrFloat() && "Can only cast to an int, index, or float type");
+        auto isFromTypeVector = fromType.isa<mlir::VectorType>();
+        auto isToTypeVector = toType.isa<mlir::VectorType>();
+        assert(isFromTypeVector == isToTypeVector && "Can only cast vectors to vectors or scalars to scalars");
+
+        auto fromElementType = util::GetElementType(fromType);
+        auto toElementType = util::GetElementType(toType);
+
+        assert(fromElementType.isIntOrIndexOrFloat() && "Can only cast from an int, index, or float type");
+        assert(toElementType.isIntOrIndexOrFloat() && "Can only cast to an int, index, or float type");
 
-        if (fromType == toType)
+        if (fromElementType == toElementType)
         {
             // No casting needed
             rewriter.replaceOp(op, { op.source() });
@@ -545,42 +569,43 @@ struct CastOpLowering : public OpRewritePattern<ValueCastOp>
         auto signlessFromValue = accera::ir::util::ToSignlessMLIRValue(rewriter, op.source());
         auto signlessToType = accera::ir::util::ToSignlessMLIRType(rewriter, toType);
 
-        auto unsignedFromType = fromType.isUnsignedInteger();
-        auto unsignedToType = toType.isUnsignedInteger();
+        auto unsignedFromElementType = fromElementType.isUnsignedInteger();
+        auto unsignedToElementType = toElementType.isUnsignedInteger();
 
         // Integer casts
-        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::TruncIOp, (fromType.getIntOrFloatBitWidth() > toType.getIntOrFloatBitWidth()));
-        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtSIOp, (fromType.getIntOrFloatBitWidth() < toType.getIntOrFloatBitWidth() && !unsignedToType));
-        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtUIOp, (fromType.getIntOrFloatBitWidth() < toType.getIntOrFloatBitWidth() && unsignedToType));
-        if (fromType.isa<mlir::IntegerType>() && toType.isa<mlir::IntegerType>() && (fromType.getIntOrFloatBitWidth() == toType.getIntOrFloatBitWidth()))
+        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::TruncIOp, (fromElementType.getIntOrFloatBitWidth() > toElementType.getIntOrFloatBitWidth()));
+        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtSIOp, (fromElementType.getIntOrFloatBitWidth() < toElementType.getIntOrFloatBitWidth() && !unsignedFromElementType));
+        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::IntegerType, mlir::arith::ExtUIOp, (fromElementType.getIntOrFloatBitWidth() < toElementType.getIntOrFloatBitWidth() && unsignedFromElementType));
+        if (fromElementType.isa<mlir::IntegerType>() && toElementType.isa<mlir::IntegerType>() && (fromElementType.getIntOrFloatBitWidth() == toElementType.getIntOrFloatBitWidth()))
         {
-            rewriter.replaceOpWithNewOp<mlir::UnrealizedConversionCastOp>(op, toType, signlessFromValue);
+            rewriter.replaceOpWithNewOp<mlir::UnrealizedConversionCastOp>(op, toElementType, signlessFromValue);
             return success();
         }
 
         // Float casts
-        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::SIToFPOp, (!unsignedFromType));
-        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::UIToFPOp, (unsignedFromType));
+        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::SIToFPOp, (!unsignedFromElementType));
+        CAST_FROM_TO_WITH_OP_IF(mlir::IntegerType, mlir::FloatType, mlir::arith::UIToFPOp, (unsignedFromElementType));
 
-        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToSIOp, (!unsignedToType));
-        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToUIOp, (unsignedToType));
+        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToSIOp, (!unsignedToElementType));
+        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::IntegerType, mlir::arith::FPToUIOp, (unsignedToElementType));
 
-        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::TruncFOp, (fromType.getIntOrFloatBitWidth() > toType.getIntOrFloatBitWidth()));
-        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::ExtFOp, (fromType.getIntOrFloatBitWidth() < toType.getIntOrFloatBitWidth()));
+        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::TruncFOp, (fromElementType.getIntOrFloatBitWidth() > toElementType.getIntOrFloatBitWidth()));
+        CAST_FROM_TO_WITH_OP_IF(mlir::FloatType, mlir::FloatType, mlir::arith::ExtFOp, (fromElementType.getIntOrFloatBitWidth() < toElementType.getIntOrFloatBitWidth()));
 
         // Index casts
         CAST_FROM_TO_WITH_OP(mlir::IntegerType, mlir::IndexType, mlir::arith::IndexCastOp);
         CAST_FROM_TO_WITH_OP(mlir::IndexType, mlir::IntegerType, mlir::arith::IndexCastOp);
-        if (fromType.isa<mlir::IndexType>() && toType.isa<mlir::FloatType>())
+        auto i64IntermediateType = accera::ir::util::CloneTypeWithNewElementType(op.source().getType(), rewriter.getI64Type());
+        if (fromElementType.isa<mlir::IndexType>() && toElementType.isa<mlir::FloatType>())
         {
-            auto int64Value = rewriter.create<mlir::arith::IndexCastOp>(loc, op.source(), rewriter.getI64Type()); // index->int64
-            rewriter.replaceOpWithNewOp<mlir::arith::SIToFPOp>(op, int64Value, toType); // int64->fp
+            auto int64Value = rewriter.create<mlir::arith::IndexCastOp>(loc, op.source(), i64IntermediateType); // index->int64
+            rewriter.replaceOpWithNewOp<mlir::arith::SIToFPOp>(op, int64Value, toElementType); // int64->fp
             return success();
         }
-        if (fromType.isa<mlir::FloatType>() && toType.isa<mlir::IndexType>())
+        if (fromElementType.isa<mlir::FloatType>() && toElementType.isa<mlir::IndexType>())
         {
-            auto int64Value = rewriter.create<mlir::arith::FPToSIOp>(loc, op.source(), rewriter.getI64Type()); // fp->int64
-            rewriter.replaceOpWithNewOp<mlir::arith::IndexCastOp>(op, int64Value, toType); // int64->index
+            auto int64Value = rewriter.create<mlir::arith::FPToSIOp>(loc, op.source(), i64IntermediateType); // fp->int64
+            rewriter.replaceOpWithNewOp<mlir::arith::IndexCastOp>(op, int64Value, toElementType); // int64->index
             return success();
         }
 
@@ -948,7 +973,7 @@ struct ValueLaunchFuncOpRewritePattern : OpRewritePattern<vir::LaunchFuncOp>
         switch (target)
         {
         case vir::ExecutionTarget::CPU:
-            rewriter.replaceOpWithNewOp<mlir::CallOp>(op, callee, ArrayRef<mlir::Type>{}, ValueRange{ op.operands() });
+            rewriter.replaceOpWithNewOp<mlir::CallOp>(op, callee, op.getResultTypes(), ValueRange{ op.operands() });
             return success();
         case vir::ExecutionTarget::GPU:
             auto gpuSymRef = SymbolRefAttr::get(rewriter.getContext(), callee.str() + "_module", SymbolRefAttr::get(callee));
@@ -1034,6 +1059,10 @@ LogicalResult BinOpLowering::matchAndRewrite(
                 return rewriter.create<arith::MulFOp>(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr()));
             case BinaryOpPredicate::SUB:
                 return rewriter.create<arith::SubFOp>(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr()));
+            case BinaryOpPredicate::MAX:
+                return rewriter.create<arith::MaxFOp>(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr()));
+            case BinaryOpPredicate::MIN:
+                return rewriter.create<arith::MinFOp>(loc, ValueRange{ lhs, rhs }, rewriter.getNamedAttr("RelaxedPrecision", rewriter.getUnitAttr()));
             default:
                 assert(false);
                 return {};
@@ -1067,6 +1096,32 @@ LogicalResult BinOpLowering::matchAndRewrite(
                 return rewriter.create<arith::AndIOp>(loc, lhs, rhs);
             case BinaryOpPredicate::LOGICAL_OR:
                 return rewriter.create<arith::OrIOp>(loc, lhs, rhs);
+            case BinaryOpPredicate::MAX:
+                if (lhs == rhs)
+                {
+                    return lhs;
+                }
+                if (elementType.isUnsignedInteger())
+                {
+                    return rewriter.create<arith::MaxUIOp>(loc, ValueRange{ lhs, rhs });
+                }
+                else
+                {
+                    return rewriter.create<arith::MaxSIOp>(loc, ValueRange{ lhs, rhs });
+                }
+            case BinaryOpPredicate::MIN:
+                if (lhs == rhs)
+                {
+                    return lhs;
+                }
+                if (elementType.isUnsignedInteger())
+                {
+                    return rewriter.create<arith::MinUIOp>(loc, ValueRange{ lhs, rhs });
+                }
+                else
+                {
+                    return rewriter.create<arith::MinSIOp>(loc, ValueRange{ lhs, rhs });
+                }
             default:
                 assert(false);
                 return {};
diff --git a/accera/value/include/EmitterContext.h b/accera/value/include/EmitterContext.h
index bbd6fc79..1b0846c7 100644
--- a/accera/value/include/EmitterContext.h
+++ b/accera/value/include/EmitterContext.h
@@ -47,6 +47,8 @@ namespace value
         None = 0,
         ThreadLocal = 1 << 0,
         Stack = 1 << 1,
+        Heap = 1 << 2,
+        Global = 1 << 3,
     };
     ACCERA_DEFINE_ENUM_FLAG_OPERATORS(AllocateFlags);
 
@@ -361,6 +363,8 @@ namespace value
 
         Scalar Cast(Scalar value, ValueType type);
 
+        Scalar Round(Scalar value);
+
         bool IsImplicitlyCastable(ValueType source, ValueType target) const;
 
         Scalar Bitcast(Scalar value, ValueType type);
@@ -496,6 +500,8 @@ namespace value
 
         virtual Scalar CastImpl(Scalar value, ValueType type) = 0;
 
+        virtual Scalar RoundImpl(Scalar value) = 0;
+
         virtual bool IsImplicitlyCastableImpl(ValueType source, ValueType target) const = 0;
 
         virtual Scalar BitcastImpl(Scalar value, ValueType type) = 0;
diff --git a/accera/value/include/FunctionDeclaration.h b/accera/value/include/FunctionDeclaration.h
index 28b61860..a085b944 100644
--- a/accera/value/include/FunctionDeclaration.h
+++ b/accera/value/include/FunctionDeclaration.h
@@ -72,6 +72,10 @@ namespace value
         /// <param name="shouldInline"> A FunctionInlining value specifying whether this function should be inlined or not </param>
         FunctionDeclaration& Inlined(FunctionInlining shouldInline = FunctionInlining::always);
 
+        /// <summary> Sets whether other functions should be inlined into this function  </summary>
+        /// <param name="shouldInline"> A FunctionInlining value specifying whether this function should be inlined or not </param>
+        FunctionDeclaration& InlineInto(FunctionInlining shouldInlineInto = FunctionInlining::always);
+
         /// <summary> Sets the execution target for this function  </summary>
         /// <param name="target"> A ExecutionTarget value specifying where this function should execute </param>
         FunctionDeclaration& Target(ExecutionTarget target);
@@ -186,6 +190,9 @@ namespace value
         /// <summary> Returns true if the instance is inlined </summary>
         [[nodiscard]] FunctionInlining InlineState() const;
 
+        /// <summary> Returns true if the instance can be inlined into </summary>
+        [[nodiscard]] FunctionInlining InlineIntoState() const;
+
         [[nodiscard]] ExecutionTarget Target() const { return _execTarget; }
 
         [[nodiscard]] ExecutionRuntime Runtime() const { return _execRuntime; }
@@ -240,6 +247,7 @@ namespace value
         ExecutionTarget _execTarget;
         ExecutionRuntime _execRuntime = ExecutionRuntime::DEFAULT;
         FunctionInlining _inlineState = FunctionInlining::defaultInline;
+        FunctionInlining _inlineIntoState = FunctionInlining::defaultInline;
         bool _isDecorated = true;
         bool _isPublic = false;
         bool _isEmpty = true;
diff --git a/accera/value/include/MLIREmitterContext.h b/accera/value/include/MLIREmitterContext.h
index fc739cb8..700f7723 100644
--- a/accera/value/include/MLIREmitterContext.h
+++ b/accera/value/include/MLIREmitterContext.h
@@ -176,6 +176,8 @@ namespace value
 
         Scalar CastImpl(Scalar value, ValueType type) override;
 
+        Scalar RoundImpl(Scalar value) override;
+
         bool IsImplicitlyCastableImpl(ValueType source, ValueType target) const override;
 
         Scalar BitcastImpl(Scalar value, ValueType type) override;
diff --git a/accera/value/include/Plan.h b/accera/value/include/Plan.h
index a821ab35..93d82342 100644
--- a/accera/value/include/Plan.h
+++ b/accera/value/include/Plan.h
@@ -179,6 +179,8 @@ namespace value
         /// <param name="policy"> The policy used to schedule work across the threads. </param>
         void Parallelize(std::vector<ScalarIndex> indices, int64_t numThreads, ParallelizationPolicy policy);
 
+        void _EraseLoop(const value::ScalarIndex& index);
+
     private:
         friend class Schedule;
         Plan(Schedule& sched, ExecutionRuntime execRuntime = ExecutionRuntime::DEFAULT);
diff --git a/accera/value/include/ScalarOperations.h b/accera/value/include/ScalarOperations.h
index ed1bfccd..e9607fa3 100644
--- a/accera/value/include/ScalarOperations.h
+++ b/accera/value/include/ScalarOperations.h
@@ -52,7 +52,8 @@ namespace value
     Scalar Tanh(Scalar s);
     Scalar Square(Scalar s);
 
-    Scalar Round(Scalar s); // Note: not implemented
+    Scalar Round(Scalar s);
+    Scalar Remainderf(Scalar numer, Scalar denom);
     Scalar Floor(Scalar s);
     Scalar Ceil(Scalar s);
     Scalar CopySign(Scalar s1, Scalar s2); // Note: not implemented
diff --git a/accera/value/include/ValueType.h b/accera/value/include/ValueType.h
index 247a4814..cd7eb0b3 100644
--- a/accera/value/include/ValueType.h
+++ b/accera/value/include/ValueType.h
@@ -87,8 +87,14 @@ namespace value
         divide,
         /// <summary> Remainder operation </summary>
         modulus,
+        /// <summary> Logical AND operation </summary>
         logicalAnd,
-        logicalOr
+        /// <summary> Logical OR operation </summary>
+        logicalOr,
+        /// <summary> Max operation </summary>
+        max,
+        /// <summary> Min operation </summary>
+        min
     };
 
     enum class ValueLogicalOperation
diff --git a/accera/value/src/EmitterContext.cpp b/accera/value/src/EmitterContext.cpp
index 8e56ce2f..460b14d2 100644
--- a/accera/value/src/EmitterContext.cpp
+++ b/accera/value/src/EmitterContext.cpp
@@ -265,6 +265,11 @@ namespace value
         return CastImpl(value, type);
     }
 
+    Scalar EmitterContext::Round(Scalar value)
+    {
+        return RoundImpl(value);
+    }
+
     bool EmitterContext::IsImplicitlyCastable(ValueType source, ValueType target) const
     {
         return IsImplicitlyCastableImpl(source, target);
diff --git a/accera/value/src/FunctionDeclaration.cpp b/accera/value/src/FunctionDeclaration.cpp
index d7c504e0..571545b8 100644
--- a/accera/value/src/FunctionDeclaration.cpp
+++ b/accera/value/src/FunctionDeclaration.cpp
@@ -114,6 +114,14 @@ namespace value
         return *this;
     }
 
+    FunctionDeclaration& FunctionDeclaration::InlineInto(FunctionInlining shouldInlineInto)
+    {
+        CheckNonEmpty();
+
+        _inlineIntoState = shouldInlineInto;
+        return *this;
+    }
+
     FunctionDeclaration& FunctionDeclaration::Target(ExecutionTarget target)
     {
         CheckNonEmpty();
@@ -303,6 +311,12 @@ namespace value
         return _inlineState;
     }
 
+    FunctionInlining FunctionDeclaration::InlineIntoState() const
+    {
+        CheckNonEmpty();
+        return _inlineIntoState;
+    }
+
     void FunctionDeclaration::CheckNonEmpty() const
     {
         if (_isEmpty)
diff --git a/accera/value/src/MLIREmitterContext.cpp b/accera/value/src/MLIREmitterContext.cpp
index 9914ed06..893e104d 100644
--- a/accera/value/src/MLIREmitterContext.cpp
+++ b/accera/value/src/MLIREmitterContext.cpp
@@ -150,6 +150,9 @@ mlir::MemRefType MemoryLayoutToMemRefType(mlir::OpBuilder& builder, const Memory
     // strided maps and memory spaces are not supported for variable-sized layouts
     auto type = layout.IsVariableSized() ? mlir::MemRefType::get(size, mlirElemType) : mlir::MemRefType::get(size, mlirElemType, stridedMap, (unsigned)layout.GetMemorySpace());
 
+    // Canonicalize and simplify the memref map
+    type = mlir::canonicalizeStridedLayout(type);
+
     // represent pointers as memrefs of memrefs (memrefs start at pointer level 1)
     return (pointerLevel > 1) ? mlir::MemRefType::get(MemRefPointerShape, type) : type;
 }
@@ -942,6 +945,27 @@ GPUIndex MLIRContext::GetGPUIndex()
     }
 }
 
+static accera::ir::value::MemoryAllocType AllocateFlagToAllocateType(accera::value::AllocateFlags flags)
+{
+#define MAP_FLAGS(fromFlag, toFlag)     \
+    case accera::value::AllocateFlags::fromFlag: \
+        return accera::ir::value::MemoryAllocType::toFlag
+
+    switch (flags)
+    {
+        MAP_FLAGS(None, Global);
+        MAP_FLAGS(Global, Global);
+        MAP_FLAGS(Stack, Stack);
+        MAP_FLAGS(Heap, Heap);
+        // MAP_FLAGS(ThreadLocal, ThreadLocal); // Not implemented
+    default:
+        assert(false);
+    }
+
+#undef MAP_PREDICATE
+}
+
+
 Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t alignment, AllocateFlags flags, const std::vector<ScalarDimension>& runtimeSizes)
 {
     auto& b = _impl->builder;
@@ -975,6 +999,7 @@ Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t
     std::transform(runtimeSizes.cbegin(), runtimeSizes.cend(), std::back_inserter(sizes), [](ScalarDimension d) { return Unwrap(d); });
     
     mlir::Value result;
+
     if (layout.IsVariableSized())
     {
         result = b.create<ir::value::AllocOp>(loc,
@@ -982,9 +1007,7 @@ Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t
                                             alignment
                                                 ? llvm::Optional{ static_cast<uint64_t>(alignment) }
                                                 : llvm::None,
-                                            static_cast<bool>(flags & AllocateFlags::Stack)
-                                                ? llvm::Optional{ accera::ir::value::MemoryAllocType::Stack }
-                                                : llvm::None,
+                                            AllocateFlagToAllocateType(flags),
                                             mlir::ValueRange{ sizes});
     }
     else
@@ -994,9 +1017,7 @@ Value MLIRContext::AllocateImpl(ValueType valueType, MemoryLayout layout, size_t
                                             alignment
                                                 ? llvm::Optional{ static_cast<uint64_t>(alignment) }
                                                 : llvm::None,
-                                            static_cast<bool>(flags & AllocateFlags::Stack)
-                                                ? llvm::Optional{ accera::ir::value::MemoryAllocType::Stack }
-                                                : llvm::None);
+                                            AllocateFlagToAllocateType(flags));
     }
 
     EmittableInfo& emittableInfo = StoreLocalEmittable({ result.getAsOpaquePointer(), { valueType, 1 } });
@@ -1146,6 +1167,10 @@ EmitterContext::DefinedFunction MLIRContext::CreateFunctionImpl(FunctionDeclarat
             {
                 fnOp->setAttr(ir::NoInlineAttrName, b.getUnitAttr());
             }
+            if (decl.InlineIntoState() == FunctionInlining::never)
+            {
+                fnOp->setAttr(ir::NoInlineIntoAttrName, b.getUnitAttr());
+            }
 
             // Set dynamic arg size references. This is a vector<vector<int>>, where each entry is either a reference to another
             // argument's position or is -1. The outer vector has one entry per function argument, and each inner vector has one
@@ -2008,22 +2033,22 @@ namespace
     auto Convert(ValueBinaryOperation op)
     {
         using namespace accera::ir::value;
+
+#define MAP_BIN_OP(fromEnum, toEnum)     \
+    case ValueBinaryOperation::fromEnum: \
+        return BinaryOpPredicate::toEnum
+
         switch (op)
         {
-        case ValueBinaryOperation::add:
-            return BinaryOpPredicate::ADD;
-        case ValueBinaryOperation::divide:
-            return BinaryOpPredicate::DIV;
-        case ValueBinaryOperation::logicalAnd:
-            return BinaryOpPredicate::LOGICAL_AND;
-        case ValueBinaryOperation::logicalOr:
-            return BinaryOpPredicate::LOGICAL_OR;
-        case ValueBinaryOperation::modulus:
-            return BinaryOpPredicate::MOD;
-        case ValueBinaryOperation::multiply:
-            return BinaryOpPredicate::MUL;
-        case ValueBinaryOperation::subtract:
-            return BinaryOpPredicate::SUB;
+            MAP_BIN_OP(add, ADD);
+            MAP_BIN_OP(subtract, SUB);
+            MAP_BIN_OP(multiply, MUL);
+            MAP_BIN_OP(divide, DIV);
+            MAP_BIN_OP(modulus, MOD);
+            MAP_BIN_OP(logicalAnd, LOGICAL_AND);
+            MAP_BIN_OP(logicalOr, LOGICAL_OR);
+            MAP_BIN_OP(max, MAX);
+            MAP_BIN_OP(min, MIN);
         }
         llvm_unreachable("Unknown binary operation");
     }
@@ -2221,6 +2246,20 @@ Scalar MLIRContext::BitcastImpl(Scalar value, ValueType type)
     throw utilities::InputException(utilities::InputExceptionErrors::invalidArgument, "Can only bitcast between types of the same size");
 }
 
+Scalar MLIRContext::RoundImpl(Scalar value)
+{
+    auto& builder = _impl->builder;
+    mlir::Value mlirValue = ResolveMLIRScalar(builder, ToMLIRValue(builder, value));
+    auto loc = mlirValue.getLoc();
+
+    auto floatType = mlirValue.getType();
+    auto width = floatType.getIntOrFloatBitWidth();
+    auto intType = builder.getIntegerType(width);
+
+    mlir::Value roundedVal = builder.create<ir::value::RoundOp>(loc, intType, mlirValue);
+    return Scalar(Wrap(roundedVal));
+}
+
 namespace
 {
     mlir::ValueRange CascadingConditionBuilder(
diff --git a/accera/value/src/Plan.cpp b/accera/value/src/Plan.cpp
index 2fac6557..f83f8012 100644
--- a/accera/value/src/Plan.cpp
+++ b/accera/value/src/Plan.cpp
@@ -278,6 +278,14 @@ namespace value
             }
         }
 
+        void _EraseLoop(const value::ScalarIndex& scalarIndex)
+        {
+            auto builder = GetBuilder();
+            auto symbolicIndexOp = GetIndexOp(scalarIndex);
+            auto index = symbolicIndexOp.getValue();
+            _scheduleOp.addLoopAttribute(index, builder.getStringAttr("_erase"), builder.getUnitAttr());
+        }
+
     private:
         mlir::OpBuilder& GetBuilder()
         {
@@ -408,6 +416,11 @@ namespace value
         _impl->Parallelize(indices, numThreads, policy);
     }
 
+    void Plan::_EraseLoop(const value::ScalarIndex& index)
+    {
+        _impl->_EraseLoop(index);
+    }
+
     //
     // GPUPlan impl
     //
diff --git a/accera/value/src/ScalarOperations.cpp b/accera/value/src/ScalarOperations.cpp
index ea4f141e..a3e21220 100644
--- a/accera/value/src/ScalarOperations.cpp
+++ b/accera/value/src/ScalarOperations.cpp
@@ -10,6 +10,8 @@
 #include "Scalar.h"
 #include "ValueType.h"
 
+#include "ir/include/value/ValueDialect.h"
+
 #include <mlir/Dialect/Math/IR/Math.h>
 #include <mlir/Dialect/StandardOps/IR/Ops.h>
 #include <stdexcept>
@@ -158,6 +160,24 @@ namespace value
         }
     }
 
+    Scalar Round(Scalar s)
+    {
+        return GetContext().Round(s);
+    }
+
+    Scalar Remainderf(Scalar numer, Scalar denom)
+    {
+        static auto remainderfFunction = [&]() {
+            FunctionDeclaration remainderfDecl("remainderf");
+            remainderfDecl.External(true)
+                            .Decorated(false)
+                            .Parameters(Value(ValueType::Float, ScalarLayout), Value(ValueType::Float, ScalarLayout))
+                            .Returns(Value(ValueType::Float, ScalarLayout));
+            return GetContext().DeclareExternalFunction(remainderfDecl);
+        }();
+        return Scalar(*remainderfFunction(std::vector<Value>{Wrap(UnwrapScalar(numer)), Wrap(UnwrapScalar(denom))})); // TODO : fix this Wrap(Unwrap(...)) pattern... it's currently needed to invoke GetElement on a sliced array
+    }
+
     Scalar Ceil(Scalar s)
     {
         return ScalarOpBuilder<mlir::math::CeilOp>(s);
@@ -200,16 +220,12 @@ namespace value
 
     Scalar Max(Scalar s1, Scalar s2)
     {
-        std::tie(s1, s2) = Scalar::MakeTypeCompatible(s1, s2);
-        
-        return Select(s1 > s2, s1, s2);
+        return GetContext().BinaryOperation(ValueBinaryOperation::max, s1.GetValue(), s2.GetValue());
     }
 
     Scalar Min(Scalar s1, Scalar s2)
     {
-        std::tie(s1, s2) = Scalar::MakeTypeCompatible(s1, s2);
-
-        return Select(s1 < s2, s1, s2);
+        return GetContext().BinaryOperation(ValueBinaryOperation::min, s1.GetValue(), s2.GetValue());
     }
 
     Scalar Clamp(Scalar s, Scalar min, Scalar max)
diff --git a/docs/.bumpversion.cfg b/docs/.bumpversion.cfg
index b2e58ff6..6c7b9a65 100644
--- a/docs/.bumpversion.cfg
+++ b/docs/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 1.2.12
+current_version = 1.2.13
 
 [bumpversion:glob:**/*.md]
 search = Version: v{current_version}
diff --git a/docs/Case Studies/CONTRIBUTING.md b/docs/Case Studies/CONTRIBUTING.md
index 97f42493..29cc7480 100644
--- a/docs/Case Studies/CONTRIBUTING.md	
+++ b/docs/Case Studies/CONTRIBUTING.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Contributing Guide
 
diff --git a/docs/Case Studies/README.md b/docs/Case Studies/README.md
index 3add9995..a2dcce62 100644
--- a/docs/Case Studies/README.md	
+++ b/docs/Case Studies/README.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Accera Case Studies
 
diff --git a/docs/Install/Building_on_MacOS.md b/docs/Install/Building_on_MacOS.md
index d2daaadc..48c5cd91 100644
--- a/docs/Install/Building_on_MacOS.md
+++ b/docs/Install/Building_on_MacOS.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Installing on MacOS
 
diff --git a/docs/Install/Building_on_Ubuntu.md b/docs/Install/Building_on_Ubuntu.md
index 27c89a13..6b90f3c5 100644
--- a/docs/Install/Building_on_Ubuntu.md
+++ b/docs/Install/Building_on_Ubuntu.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Installing on Ubuntu
 
diff --git a/docs/Install/Building_on_Windows.md b/docs/Install/Building_on_Windows.md
index 56684d3b..3586c4a4 100644
--- a/docs/Install/Building_on_Windows.md
+++ b/docs/Install/Building_on_Windows.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Installing on Windows
 
diff --git a/docs/Install/Installing_Accera_on_MacOS.md b/docs/Install/Installing_Accera_on_MacOS.md
index 2dff0957..4c2d700d 100644
--- a/docs/Install/Installing_Accera_on_MacOS.md
+++ b/docs/Install/Installing_Accera_on_MacOS.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Installing on MacOS
 
diff --git a/docs/Install/Installing_Accera_on_Ubuntu.md b/docs/Install/Installing_Accera_on_Ubuntu.md
index 77654ada..47b042ea 100644
--- a/docs/Install/Installing_Accera_on_Ubuntu.md
+++ b/docs/Install/Installing_Accera_on_Ubuntu.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Installing on Ubuntu
 
diff --git a/docs/Install/Installing_Accera_on_Windows.md b/docs/Install/Installing_Accera_on_Windows.md
index bcc86673..4e69af93 100644
--- a/docs/Install/Installing_Accera_on_Windows.md
+++ b/docs/Install/Installing_Accera_on_Windows.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Installing on Windows
 
diff --git a/docs/Install/README.md b/docs/Install/README.md
index d4df0e82..ffa790cd 100644
--- a/docs/Install/README.md
+++ b/docs/Install/README.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Install from PyPI
 The quickest way to get up and running is to install the pre-built Python packages:
diff --git a/docs/Manual/00 Introduction.md b/docs/Manual/00 Introduction.md
index 279b72af..d7826f8b 100644
--- a/docs/Manual/00 Introduction.md	
+++ b/docs/Manual/00 Introduction.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Introduction
 Accera is a framework with a Python-based Domain-specific Language (eDSL) that produces optimized compute-intensive code. Accera's primary focus is the optimization of affine and semi-affine nested for-loops for CPU and GPU targets.
diff --git a/docs/Manual/01 Arrays and Scalars.md b/docs/Manual/01 Arrays and Scalars.md
index b23d4ff6..2273acb7 100644
--- a/docs/Manual/01 Arrays and Scalars.md	
+++ b/docs/Manual/01 Arrays and Scalars.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 1: Arrays and Scalars
 
diff --git a/docs/Manual/02 Simple Affine Loop Nests.md b/docs/Manual/02 Simple Affine Loop Nests.md
index f07dff2b..8e8ae66e 100644
--- a/docs/Manual/02 Simple Affine Loop Nests.md	
+++ b/docs/Manual/02 Simple Affine Loop Nests.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 2: Simple affine loop nests
 This section introduces *loop nests* and their different types that are provided in Accera programming model.
diff --git a/docs/Manual/03 Schedules.md b/docs/Manual/03 Schedules.md
index dddbc7c7..3d34f204 100644
--- a/docs/Manual/03 Schedules.md	
+++ b/docs/Manual/03 Schedules.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 3: Schedules
 We begin with `nest` from [Section 2](<02%20Simple%20Affine%20Loop%20Nests.md>) which captures the logic of matrix-matrix multiplication. We use `nest` to create a `Schedule` that controls the execution order of the nest's iterations. Schedules are target-independent in the sense that the same schedule can be used to emit code for multiple target platforms.
diff --git a/docs/Manual/04 Fusing.md b/docs/Manual/04 Fusing.md
index 65fc9363..c193b768 100644
--- a/docs/Manual/04 Fusing.md	
+++ b/docs/Manual/04 Fusing.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 4: Fusing
 With `fuse` operation, multiple schedules can be combined into a single schedule representing the union of the work in the original schedules. These fused schedules can be transformed by any of the transformations presented in [Section 3](<03%20Schedules.md>).
diff --git a/docs/Manual/05 Targets.md b/docs/Manual/05 Targets.md
index 39bf5630..e61be97b 100644
--- a/docs/Manual/05 Targets.md	
+++ b/docs/Manual/05 Targets.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 5: Targets
 Accera is a cross compiler, which means that it can generate executable code for different target platforms. A target is described using the `Target` class. Accera already supports many different targets, for example:
diff --git a/docs/Manual/06 Plans - Caching.md b/docs/Manual/06 Plans - Caching.md
index 8d77d0b2..b0032159 100644
--- a/docs/Manual/06 Plans - Caching.md	
+++ b/docs/Manual/06 Plans - Caching.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 6: Plans - Caching
 In the previous sections, we defined the logic and then scheduled its iterations. Now, let's move on to completing the implementation with target-specific options.
diff --git a/docs/Manual/07 Plans - Operations and Optimizations.md b/docs/Manual/07 Plans - Operations and Optimizations.md
index 51eb82de..53f0f7a7 100644
--- a/docs/Manual/07 Plans - Operations and Optimizations.md	
+++ b/docs/Manual/07 Plans - Operations and Optimizations.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 7: Plans - Operations and Optimizations
 We can control target-specific operations and optimizations using a plan. Examples include instruction pipelining, applying SIMD vector instructions, and so on.
diff --git a/docs/Manual/08 Deferred Layout of Constant Arrays.md b/docs/Manual/08 Deferred Layout of Constant Arrays.md
index ce621bcb..1f050de0 100644
--- a/docs/Manual/08 Deferred Layout of Constant Arrays.md	
+++ b/docs/Manual/08 Deferred Layout of Constant Arrays.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 8: Deferred layout of constant arrays
 Let's revisit the memory layout of constant arrays. As explained in [Section 1](<01%20Arrays%20and%20Scalars.md>), the contents of constant arrays are known at compile-time, and these contents are immutable. Accera stores constant arrays in a non-standard memory layout optimized for a particular plan. In some cases, storing multiple copies of each array element may even prove advantageous (e.g., storing a matrix in row-major and column-major layouts).
diff --git a/docs/Manual/09 Parameters.md b/docs/Manual/09 Parameters.md
index 22325ea1..156eb632 100644
--- a/docs/Manual/09 Parameters.md	
+++ b/docs/Manual/09 Parameters.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 9: Parameters
 
diff --git a/docs/Manual/10 Packages.md b/docs/Manual/10 Packages.md
index be48c0c3..09cab47b 100644
--- a/docs/Manual/10 Packages.md	
+++ b/docs/Manual/10 Packages.md	
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Section 10: Building Packages
 The `Package` class represents a collection of Accera-generated functions. Whenever a package is built, it creates a stand-alone function library that other pieces of software can use. Currently, Accera supports two package formats: HAT and MLIR.
diff --git a/docs/Manual/README.md b/docs/Manual/README.md
index 8bbe69fc..88ba69a2 100644
--- a/docs/Manual/README.md
+++ b/docs/Manual/README.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Accera v1.2.1 Manual
 
diff --git a/docs/Reference/accera.md b/docs/Reference/accera.md
index 42bc0c84..59673bea 100644
--- a/docs/Reference/accera.md
+++ b/docs/Reference/accera.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 # Module functions
 * [`accera.cast`](functions/cast.md) `(value, type)`
diff --git a/docs/Reference/classes/Array/Array.md b/docs/Reference/classes/Array/Array.md
index de9fe5b1..b8eedeed 100644
--- a/docs/Reference/classes/Array/Array.md
+++ b/docs/Reference/classes/Array/Array.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Array(role[, data, element_type, layout, offset, shape])`
 Constructs an array.
diff --git a/docs/Reference/classes/Array/Layout.md b/docs/Reference/classes/Array/Layout.md
index adea1bea..47182bf9 100644
--- a/docs/Reference/classes/Array/Layout.md
+++ b/docs/Reference/classes/Array/Layout.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Array.Layout`
 
 type | description
diff --git a/docs/Reference/classes/Array/Role.md b/docs/Reference/classes/Array/Role.md
index 0bbe145d..0d654abb 100644
--- a/docs/Reference/classes/Array/Role.md
+++ b/docs/Reference/classes/Array/Role.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Array.Role`
 
 type | description
diff --git a/docs/Reference/classes/Array/deferred_layout.md b/docs/Reference/classes/Array/deferred_layout.md
index 20395107..e831d3d1 100644
--- a/docs/Reference/classes/Array/deferred_layout.md
+++ b/docs/Reference/classes/Array/deferred_layout.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Array.deferred_layout(cache)`
 Specifies the layout for a `Array.Role.CONST` array based on a `Cache`. For more details, see [Deferred layout of constant arrays](<../../../Manual/08%20Deferred%20Layout%20of%20Constant%20Arrays.md>)
diff --git a/docs/Reference/classes/Array/sub_array.md b/docs/Reference/classes/Array/sub_array.md
index 9b75a4d6..73f40fe3 100644
--- a/docs/Reference/classes/Array/sub_array.md
+++ b/docs/Reference/classes/Array/sub_array.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Array.sub_array(offsets, shape[, strides])`
 Creates a sub-array of a specific shape from an array. The sub-array is created from elements at specified offsets and strides into the original array.
diff --git a/docs/Reference/classes/Dimension/Dimension.md b/docs/Reference/classes/Dimension/Dimension.md
index 28e878f8..1b98fbdd 100644
--- a/docs/Reference/classes/Dimension/Dimension.md
+++ b/docs/Reference/classes/Dimension/Dimension.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Dimension([role, value])`
 Constructs a runtime dimension size with optional initialization.
 
diff --git a/docs/Reference/classes/Dimension/Role.md b/docs/Reference/classes/Dimension/Role.md
index 62a447b1..7f0bdc85 100644
--- a/docs/Reference/classes/Dimension/Role.md
+++ b/docs/Reference/classes/Dimension/Role.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Dimension.Role`
 
 type | description
diff --git a/docs/Reference/classes/Nest/Nest.md b/docs/Reference/classes/Nest/Nest.md
index 3508a587..89c0ea1d 100644
--- a/docs/Reference/classes/Nest/Nest.md
+++ b/docs/Reference/classes/Nest/Nest.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Nest(shape)`
 Creates an affine loop nest.
diff --git a/docs/Reference/classes/Nest/create_plan.md b/docs/Reference/classes/Nest/create_plan.md
index 87529ae5..a04c0931 100644
--- a/docs/Reference/classes/Nest/create_plan.md
+++ b/docs/Reference/classes/Nest/create_plan.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Nest.create_plan([target])`
 Creates a plan using the default schedule for the nest.
diff --git a/docs/Reference/classes/Nest/create_schedule.md b/docs/Reference/classes/Nest/create_schedule.md
index e73cd2b8..0eb71a7c 100644
--- a/docs/Reference/classes/Nest/create_schedule.md
+++ b/docs/Reference/classes/Nest/create_schedule.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Nest.create_schedule()`
 Create a default schedule for a nest.
diff --git a/docs/Reference/classes/Nest/get_indices.md b/docs/Reference/classes/Nest/get_indices.md
index bc884f72..0ef8b2dc 100644
--- a/docs/Reference/classes/Nest/get_indices.md
+++ b/docs/Reference/classes/Nest/get_indices.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Nest.get_indices()`
 Gets the iteration space dimensions for a nest.
diff --git a/docs/Reference/classes/Nest/iteration_logic.md b/docs/Reference/classes/Nest/iteration_logic.md
index 32a9b184..b318e05c 100644
--- a/docs/Reference/classes/Nest/iteration_logic.md
+++ b/docs/Reference/classes/Nest/iteration_logic.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.120)
+[//]: # (Version: v1.2.130)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Nest.iteration_logic(logic)`
 Adds an iteration logic function to a `Nest`.
diff --git a/docs/Reference/classes/Package/Format.md b/docs/Reference/classes/Package/Format.md
index 008fd9cf..4ffd1ba7 100644
--- a/docs/Reference/classes/Package/Format.md
+++ b/docs/Reference/classes/Package/Format.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Package.Format`
 
 type | description
diff --git a/docs/Reference/classes/Package/Mode.md b/docs/Reference/classes/Package/Mode.md
index 9845cab3..8c5aa194 100644
--- a/docs/Reference/classes/Package/Mode.md
+++ b/docs/Reference/classes/Package/Mode.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Package.Mode`
 
 type | description
diff --git a/docs/Reference/classes/Package/Package.md b/docs/Reference/classes/Package/Package.md
index 2dd36a3a..cde07921 100644
--- a/docs/Reference/classes/Package/Package.md
+++ b/docs/Reference/classes/Package/Package.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Package.Package()`
 A package of functions that can be built and linked with client code.
diff --git a/docs/Reference/classes/Package/Platform.md b/docs/Reference/classes/Package/Platform.md
index 0dd6664f..8680cb0f 100644
--- a/docs/Reference/classes/Package/Platform.md
+++ b/docs/Reference/classes/Package/Platform.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Package.Platform`
 
 type | description
diff --git a/docs/Reference/classes/Package/add.md b/docs/Reference/classes/Package/add.md
index 1c65a6d1..557c9b7a 100644
--- a/docs/Reference/classes/Package/add.md
+++ b/docs/Reference/classes/Package/add.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Package.add(source, args[, base_name, parameters])`
 Adds one or more functions to the package.
diff --git a/docs/Reference/classes/Package/add_description.md b/docs/Reference/classes/Package/add_description.md
index b252f41d..7af2aea8 100644
--- a/docs/Reference/classes/Package/add_description.md
+++ b/docs/Reference/classes/Package/add_description.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Package.add_description([author, license, other, version])`
 Adds descriptive metadata to the HAT package.
diff --git a/docs/Reference/classes/Package/build.md b/docs/Reference/classes/Package/build.md
index 69d41b7f..441bfd7b 100644
--- a/docs/Reference/classes/Package/build.md
+++ b/docs/Reference/classes/Package/build.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Package.build(name[, format, mode, platform, tolerance, output_dir])`
 Builds a HAT package.
diff --git a/docs/Reference/classes/Plan/bind.md b/docs/Reference/classes/Plan/bind.md
index dd78611f..b6198a27 100644
--- a/docs/Reference/classes/Plan/bind.md
+++ b/docs/Reference/classes/Plan/bind.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.bind(mapping)`
 Only available for targets that can execute a grid of work (such as GPUs). The `bind` function binds dimensions of the iteration space to axes of the target-specific grid (such as `v100.GridUnit.BLOCK_X`, `v100.GridUnit.THREAD_X` or `v100.GridUnit.WARP_X` on an Nvidia GPU).
diff --git a/docs/Reference/classes/Plan/cache.md b/docs/Reference/classes/Plan/cache.md
index 4c64ebd3..a370814f 100644
--- a/docs/Reference/classes/Plan/cache.md
+++ b/docs/Reference/classes/Plan/cache.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.cache(source[, index, trigger_index, layout, level, trigger_level, max_elements, element_type, strategy, thrifty, location, double_buffer, double_buffer_location, vectorize])`
 Adds a caching strategy to a plan.
diff --git a/docs/Reference/classes/Plan/kernelize.md b/docs/Reference/classes/Plan/kernelize.md
index 611677f6..9cff8032 100644
--- a/docs/Reference/classes/Plan/kernelize.md
+++ b/docs/Reference/classes/Plan/kernelize.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.kernelize(unroll_indices[, vectorize_indices])`
 A convenience method for a sequence of `unroll` instructions followed by a possible sequence of `vectorize` instructions.
diff --git a/docs/Reference/classes/Plan/parallelize.md b/docs/Reference/classes/Plan/parallelize.md
index 38cf7b02..16f3d076 100644
--- a/docs/Reference/classes/Plan/parallelize.md
+++ b/docs/Reference/classes/Plan/parallelize.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.parallelize(indices[, pin, policy, max_threads])`
 
diff --git a/docs/Reference/classes/Plan/tensorize.md b/docs/Reference/classes/Plan/tensorize.md
index 84abac99..3109ce1b 100644
--- a/docs/Reference/classes/Plan/tensorize.md
+++ b/docs/Reference/classes/Plan/tensorize.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.tensorize(indices, mma_shape [, use_static_offsets, num_total_passes, num_fused_passes, scheduling_policy])`
 Only available for targets with native matrix multiplication instruction (tensor core) support. Marks the dimensions of the iteration-space for tensorization. Only perfectly nested loops of the following form can be tensorized:
diff --git a/docs/Reference/classes/Plan/unroll.md b/docs/Reference/classes/Plan/unroll.md
index 921d375f..880e2e16 100644
--- a/docs/Reference/classes/Plan/unroll.md
+++ b/docs/Reference/classes/Plan/unroll.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.unroll(index)`
 Marks a dimension of the iteration-space for unrolling.
diff --git a/docs/Reference/classes/Plan/vectorize.md b/docs/Reference/classes/Plan/vectorize.md
index db9c3c25..50875e43 100644
--- a/docs/Reference/classes/Plan/vectorize.md
+++ b/docs/Reference/classes/Plan/vectorize.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Plan.vectorize(index)`
 Only available for targets that have SIMD registers and support vector instructions. Marks a dimension of the iteration-space for vectorization.
diff --git a/docs/Reference/classes/Scalar/Scalar.md b/docs/Reference/classes/Scalar/Scalar.md
index 757237d0..78c4e6e5 100644
--- a/docs/Reference/classes/Scalar/Scalar.md
+++ b/docs/Reference/classes/Scalar/Scalar.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Scalar([element_type, value])`
 Constructs a scalar that holds a number.
diff --git a/docs/Reference/classes/Schedule/create_plan.md b/docs/Reference/classes/Schedule/create_plan.md
index 47fe7f20..3c6ec2f8 100644
--- a/docs/Reference/classes/Schedule/create_plan.md
+++ b/docs/Reference/classes/Schedule/create_plan.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.create_plan([target])`
 Creates a plan for running this schedule.
diff --git a/docs/Reference/classes/Schedule/is_valid_loop_order.md b/docs/Reference/classes/Schedule/is_valid_loop_order.md
index 303188b6..dccd8dcb 100644
--- a/docs/Reference/classes/Schedule/is_valid_loop_order.md
+++ b/docs/Reference/classes/Schedule/is_valid_loop_order.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.is_valid_loop_order(*order)`
 The `is_valid_loop_order` function determines if an order of indices is valid. For a description of valid schedule orders, refer to [reorder](reorder.md).
diff --git a/docs/Reference/classes/Schedule/pad.md b/docs/Reference/classes/Schedule/pad.md
index 642b40b0..cadbdb6e 100644
--- a/docs/Reference/classes/Schedule/pad.md
+++ b/docs/Reference/classes/Schedule/pad.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.pad(index, size)`
 Pads the beginning of a specified dimension of the iteration-space with empty (no-op) elements.
diff --git a/docs/Reference/classes/Schedule/reorder.md b/docs/Reference/classes/Schedule/reorder.md
index ffec70a8..14682e79 100644
--- a/docs/Reference/classes/Schedule/reorder.md
+++ b/docs/Reference/classes/Schedule/reorder.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.reorder(order, *args)`
 The `reorder` transformation sets the order of the indices in the schedule.
diff --git a/docs/Reference/classes/Schedule/skew.md b/docs/Reference/classes/Schedule/skew.md
index 0a8f2065..8916b7f0 100644
--- a/docs/Reference/classes/Schedule/skew.md
+++ b/docs/Reference/classes/Schedule/skew.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.skew(index, reference_index [, unroll_loops_smaller_than])`
 Transforms a dimension with respect to a reference dimension into a parallelogram by padding with empty elements.
diff --git a/docs/Reference/classes/Schedule/split.md b/docs/Reference/classes/Schedule/split.md
index 6d2afd32..67db50cb 100644
--- a/docs/Reference/classes/Schedule/split.md
+++ b/docs/Reference/classes/Schedule/split.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.split(index, size)`
 The `split` transformation takes a dimension `i` and a `size`, modifies `i`, and creates a new dimension `ii`.
diff --git a/docs/Reference/classes/Schedule/tile.md b/docs/Reference/classes/Schedule/tile.md
index 096ee0cf..ee5f30a6 100644
--- a/docs/Reference/classes/Schedule/tile.md
+++ b/docs/Reference/classes/Schedule/tile.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Schedule.tile(shape)`
 The `tile` transformation is a convenience syntax that takes a tuple of indices and a tuple of sizes, and splits each index by the corresponding size. The indices involved in the split are then ordered such that all the outer indices precede all of their respective inner indices.
diff --git a/docs/Reference/classes/Target/Architecture.md b/docs/Reference/classes/Target/Architecture.md
index fd02faf4..d84ef329 100644
--- a/docs/Reference/classes/Target/Architecture.md
+++ b/docs/Reference/classes/Target/Architecture.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Target.Architecture`
 
 Defines the supported target architectures.
diff --git a/docs/Reference/classes/Target/Category.md b/docs/Reference/classes/Target/Category.md
index 4ca5e41e..6f828e53 100644
--- a/docs/Reference/classes/Target/Category.md
+++ b/docs/Reference/classes/Target/Category.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Target.Category`
 
 Defines the target processor category.
diff --git a/docs/Reference/classes/Target/Model.md b/docs/Reference/classes/Target/Model.md
index a67947a6..86e4a97e 100644
--- a/docs/Reference/classes/Target/Model.md
+++ b/docs/Reference/classes/Target/Model.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Target.Model`
 
 Defines constants for some well-known CPU models.
diff --git a/docs/Reference/classes/Target/Runtime.md b/docs/Reference/classes/Target/Runtime.md
index 9aa7c42b..8d75cd46 100644
--- a/docs/Reference/classes/Target/Runtime.md
+++ b/docs/Reference/classes/Target/Runtime.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.Target.Runtime`
 
 The runtime for code generation and/or compilation.
diff --git a/docs/Reference/classes/Target/Target.md b/docs/Reference/classes/Target/Target.md
index a4cafcc4..8ea4d78d 100644
--- a/docs/Reference/classes/Target/Target.md
+++ b/docs/Reference/classes/Target/Target.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.120)
+[//]: # (Version: v1.2.130)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.Target([architecture, cache_lines, cache_sizes, category, extensions, family, frequency_GHz, known_name, model, name, num_cores, num_threads, runtime, tensor_core_info, turbo_frequency_GHz, vector_bytes, vector_registers)`
 
diff --git a/docs/Reference/enumerations/CacheStrategy.md b/docs/Reference/enumerations/CacheStrategy.md
index 930c569d..e66b8f6d 100644
--- a/docs/Reference/enumerations/CacheStrategy.md
+++ b/docs/Reference/enumerations/CacheStrategy.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.CacheStrategy`
 
 type | description
diff --git a/docs/Reference/enumerations/MMASchedulingPolicy.md b/docs/Reference/enumerations/MMASchedulingPolicy.md
index 1e66d544..12dac678 100644
--- a/docs/Reference/enumerations/MMASchedulingPolicy.md
+++ b/docs/Reference/enumerations/MMASchedulingPolicy.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.MMASchedulingPolicy`
 
 type | description
diff --git a/docs/Reference/enumerations/MMAShape.md b/docs/Reference/enumerations/MMAShape.md
index 474ac73d..f7debce4 100644
--- a/docs/Reference/enumerations/MMAShape.md
+++ b/docs/Reference/enumerations/MMAShape.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.MMAShape`
 
 The following table shows the matrix multiplication parameters associated with the different enum values, for different data types for a single pass. So for example a single pass of the `M32xN32xK2_B1` operation would take input matrices of dimensions [32x2] (A) and [2x32] (B) to produce a matrix multiplication result of dimensions [32x32] (C). These operations can then be composed together to perform matrix multiplication of larger matrices.
diff --git a/docs/Reference/enumerations/ScalarType.md b/docs/Reference/enumerations/ScalarType.md
index 8ca323e0..c9abc606 100644
--- a/docs/Reference/enumerations/ScalarType.md
+++ b/docs/Reference/enumerations/ScalarType.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 ## `accera.ScalarType`
 
 type | description
diff --git a/docs/Reference/functions/cast.md b/docs/Reference/functions/cast.md
index c103f4ae..f4969e82 100644
--- a/docs/Reference/functions/cast.md
+++ b/docs/Reference/functions/cast.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.cast(value, type)`
 The `cast` operation converts a value from one `acc.ScalarType` to another.
diff --git a/docs/Reference/functions/create_dimensions.md b/docs/Reference/functions/create_dimensions.md
index 81cd263c..d74dbcdd 100644
--- a/docs/Reference/functions/create_dimensions.md
+++ b/docs/Reference/functions/create_dimensions.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.create_dimensions([role])`
 Creates placeholder dimensions of the specified role. These represent runtime `Array` and `Nest` dimensions.
diff --git a/docs/Reference/functions/create_parameter_grid.md b/docs/Reference/functions/create_parameter_grid.md
index 912dbd63..eeafa7db 100644
--- a/docs/Reference/functions/create_parameter_grid.md
+++ b/docs/Reference/functions/create_parameter_grid.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.create_parameter_grid(parameter_choices, [filter_func, sample, seed])`
 Create a parameter grid from a dictionary that maps each parameter to its possible values.
diff --git a/docs/Reference/functions/create_parameters.md b/docs/Reference/functions/create_parameters.md
index 30ec9bc8..2d191b95 100644
--- a/docs/Reference/functions/create_parameters.md
+++ b/docs/Reference/functions/create_parameters.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.create_parameters()`
 Creates placeholder parameters.
diff --git a/docs/Reference/functions/fuse.md b/docs/Reference/functions/fuse.md
index 598d1dcf..419cad67 100644
--- a/docs/Reference/functions/fuse.md
+++ b/docs/Reference/functions/fuse.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 ## `accera.fuse(schedules[, *args, partial])`
 The `fuse` operation combines multiple iteration spaces into a single "fused" iteration space. The fused iteration space represents the union of the work in the original spaces.
diff --git a/docs/Reference/safety_analysis.md b/docs/Reference/safety_analysis.md
index 87914010..b4d9b009 100644
--- a/docs/Reference/safety_analysis.md
+++ b/docs/Reference/safety_analysis.md
@@ -1,7 +1,7 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
-# Accera v1.2.12 Reference
+# Accera v1.2.13 Reference
 
 # Safety Analysis
 
diff --git a/docs/Tutorials/Hello_MatMul.md b/docs/Tutorials/Hello_MatMul.md
index da8e1ce7..85f850a7 100644
--- a/docs/Tutorials/Hello_MatMul.md
+++ b/docs/Tutorials/Hello_MatMul.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Hello MatMul
 
diff --git a/docs/Tutorials/Hello_MatMul_GPU.md b/docs/Tutorials/Hello_MatMul_GPU.md
index 3f1768c0..9cf9ecda 100644
--- a/docs/Tutorials/Hello_MatMul_GPU.md
+++ b/docs/Tutorials/Hello_MatMul_GPU.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Hello MatMul GPU
 
diff --git a/docs/Tutorials/Optimized_MatMul.md b/docs/Tutorials/Optimized_MatMul.md
index e0737ac3..4f7f6789 100644
--- a/docs/Tutorials/Optimized_MatMul.md
+++ b/docs/Tutorials/Optimized_MatMul.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 ## Optimized MatMul
 
diff --git a/docs/Tutorials/Pi3_Cross_Compilation.md b/docs/Tutorials/Pi3_Cross_Compilation.md
index 59552aa1..e4dc1742 100644
--- a/docs/Tutorials/Pi3_Cross_Compilation.md
+++ b/docs/Tutorials/Pi3_Cross_Compilation.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Cross Compiling for the Raspberry Pi 3
 
diff --git a/docs/Tutorials/README.md b/docs/Tutorials/README.md
index 44d3fd6b..b163c151 100644
--- a/docs/Tutorials/README.md
+++ b/docs/Tutorials/README.md
@@ -1,5 +1,5 @@
 [//]: # (Project: Accera)
-[//]: # (Version: v1.2.12)
+[//]: # (Version: v1.2.13)
 
 # Accera Tutorials