diff --git a/.gitignore b/.gitignore
index a9551474..e57253d5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -40,6 +40,13 @@ mx.grcuda/eclipse-launches
 /projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDAListener.java
 /projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDAParser.java
 /projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDAVisitor.java
+**.log
+/scratch
+**.nvvp
+projects/resources/cuda/bin
+data/results/*
+data/nvprof_log/*
+data/pickle/*
 /projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.g4.stamp
 /projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.interp
 /projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.tokens
@@ -50,3 +57,10 @@ mx.grcuda/eclipse-launches
 tensorrt/build
 examples/tensorrt/python/logs
 examples/tensorrt/cpp/build
+venv
+out/
+*.files
+*.csv
+grcuda_token.txt
+projects/demos/image_pipeline/cuda/build
+projects/demos/image_pipeline/img_out
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 00000000..7fd0dae2
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,8 @@
+[submodule "grcuda-data"]
+	path = grcuda-data
+	url = https://github.com/AlbertoParravicini/grcuda-data.git
+	branch = master
+[submodule "projects/resources/python/plotting/segretini_matplottini"]
+	path = projects/resources/python/plotting/segretini_matplottini
+	url = git@github.com:AlbertoParravicini/segretini-matplottini.git
+	branch = master
diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 00000000..215f8a9b
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,189 @@
+# 2022-06-01
+
+* Added scheduling DAG export functionality. It is now possible to retrieve a graphic version of the scheduling DAG of the execution by adding `ExportDAG` in the startup options. The graph will be exported in .dot format in the path specified by the user as option argument.
+* This information can be leveraged to better understand the achieved runtime performance and to compare the schedules derived from different policies. Moreover, poorly written applications will results in DAGs with low-level of task-parallelism independently of the selected policy, suggesting designers to change their applications’ logic.
+
+
+# 2022-04-15
+
+* Updated install.sh to compute the interconnection graph
+* Updated benchmark_wrapper to retrieve connection_graph correctly
+* Enabled min-transfer-size test in benchmark_wrapper
+* Benchmark_wrapper set for V100
+
+# 2022-02-16
+
+* Added logging for multiple computations (List of floats) on the same deviceID. This information could be used in future history-based adaptive scheduling policies.
+
+# 2022-01-26
+
+* Added mocked benchmarks: for each multi-gpu benchmark in our suite, there is a mocked version where we check that the GPU assignment is the one we expect. Added utility functions to easily test mocked benchmarks
+simplified round-robin device selection policy, now it works more or less as before but it is faster to update when using a subset of devices
+* Added threshold parameter for data-aware device selection policies. When using min-transfer-size or minmax/min-transfer-time, consider only devices that have at least 10% (or X %) of the requested data. Basically, if a device only has a very small amount of data already available it is not worth preferring it to other devices, and it can cause scheduling to converge to a unique device. See B9 and B11, for example.
+* Updated python benchmark suite to use new options, and optimized initialization of B1 (it is faster now) and B9 (it didn't work on matrices with 50K rows, as python uses 32-bit array indexing)
+* Fixed performance regression in DeviceArray access. For a simple python code that writes 160M values on a DeviceArray, performance went from 4sec to 20sec by using GraalVM 21.3 instead of 21.2. Reverted GraalVM to 21.2. Using non-static final Logger in GrCUDAComputationalElement increased time from 4sec to 130sec (not sure why, they are not created in repeated array accesses): fixed this regression.
+
+# 2022-01-14
+
+* Modified the "new stream creation policy FIFO" to simply reuse an existing free stream, without using a FIFO policy. Using FIFO did not give any benefit (besides a more predictable stream assignment), but it was more complex (we needed both a set and a FIFO, now we just use a set for the free streams)
+* Added device manager to track devices. This is mostly an abstraction layer over CUDARuntime, and allows retrieving the currently active GPU, or retrieving a specific device.
+  * DeviceManager is only a "getter", it cannot change the state of the system (e.g. it does not allow changing the current GPU)
+  * Compared to the original multi-GPU branch, we have cleaner separation. StreamManager has access to StreamPolicy, StreamPolicy has access to DeviceManager. StreamManager still has access to the runtime (for event creation, sync etc.), but we might completely hide CUDARuntime inside DeviceManager to have even more separation.
+* Re-added script to build connection graph. We might want to call it automatically from grcuda if the output CSV is not found. Otherwise we need to update the documentation to tell users how to use the script
+
+# 2022-01-12
+
+* Modified DeviceSelectionPolicy to select a device from a specified list of GPUs, instead of looking at all GPUs.
+That's useful because when we want to reuse a parent's stream we have to choose among the devices used by the parents, instead of considering all devices.
+* Added new SelectParentStreamPolicy where we find the parents' streams that can be reused, and then looks at the best device among the devices where these streams are, instead of considering all the devices in the system as in the previous policy. The old policy is still available.
+
+# 2021-12-21, Release 2
+
+* Added support for GraalVM 21.3.
+* Removed `ProfilableElement` Boolean flag, as it was always true.
+
+# 2021-12-09
+
+* Replaced old isLastComputationArrayAccess" with new device tracking API
+* The old isLastComputationArrayAccess was a performance optimization used to track if the last computation on an array was an access done by the CPU (the only existing CPU computations), to skip scheduling of further array accesses done by the CPU
+* Implicitly, the API tracked if a certain array was up-to-date on the CPU or on the GPU (for a 1 GPU system).
+* The new API that tracks locations of arrays completely covers the old API, making it redundant. If an array is up-to-date on the CPU, we can perform read/write without any ComputationalElement scheduling.
+* Checking if an array is up-to-date on the CPU requires a hashset lookup. It might be optimized if necessary, using a tracking flag.
+
+# 2021-12-06
+
+* Fixed major bug that prevented CPU reads on read-only arrays in-use by the GPU. The problem appeared only on devices since Pascal.
+* Started integrating API to track on which devices a certain array is currently up-to-date. Slightly modified from the original multi-GPU API.
+
+# 2021-12-05
+
+* Updated options in GrCUDA to support new multi-gpu flags.
+* Improved initialization of ExecutionContext, now it takes GrCUDAOptionMap as parameter.
+* Improved GrCUDAOptionMap testing, and integrated preliminary multi-GPU tests.
+* Renamed GrCUDAExecutionContext to AsyncGrCUDAExecutionContext.
+* Integrated multi-GPU features into CUDARuntime
+* Improved interface to measure execution time of computationalelements (now the role of "ProfilableElement" is clearer, and execution time logging has been moved inside ComputationElement instead of using StreamManager)
+* Improved manual selection of GPU
+* Unsupported tests (e.g. tests for multiGPU if just 1 GPU is available) are properly skipped, instead of failing or completing successfully without info
+temporary fix for GRCUDA-56: cuBLAS is disabled on pre-pascal if async scheduler is selected
+
+# 2021-11-30
+
+* Updated python benchmark suite to integrate multi-gpu code.
+* Minor updates in naming conventions (e.g. using snake_case instead of CamelCase)
+* We might still want to update the python suite (for example the output dict structure), but for now this should work.
+
+# 2021-11-29
+
+* Removed deprecation warning for Truffle's ArityException. 
+* Updated benchmark suite with CUDAs multiGPU benchmarks. Also fixed GPU OOB in B9.
+
+# 2021-11-21
+
+* Enabled support for cuSPARSE
+  * Added support for CSR and COO `spmv` and `gemvi`.
+  * **Known limitation:** Tgemvi works only with single-precision floating-point arithmetics.
+
+# 2021-11-17
+
+* Added the support of precise timing of kernels, for debugging and complex scheduling policies
+  * Associated a CUDA event to the start of the computation in order to get the elapsed time from start to the end
+  * Added` ElapsedTime` function to compute the elapsed time between events, aka the total execution time
+  * Logging of kernel timers is controlled by the `grcuda.TimeComputation` option, which is false by default
+  * Implemented with the ProfilableElement class to store timing values in a hash table and support future business logic
+* Updated documentation for the use of the new `TimeComputation` option in README
+* Considerations:
+  * `ProfilableElement` is profilable (`true`) by default, and any `ConfiguredKernel` is initialized with this configuration. To date, there isn't any use for a `ProfilableElement` that is not profilable (`false`)
+  * To date, we are tracking only the last execution of a `ConfiguredKernel` on each device. It will be useful in the future to track all the executions and leverage this information in our scheduler
+  
+# 2021-11-15
+
+* Added read-only polyglot map to retrieve grcuda options. Retrieve it with `getoptions`. Option names and values are provided as strings. Find the full list of options in `GrCUDAOptions`.
+
+# 2021-11-04
+
+* Enabled the usage of TruffleLoggers for logging the execution of grcuda code
+    * GrCUDA is characterized by the presence of several different types of loggers, each one with its own functionality
+    * Implemented GrCUDALogger class is in order to have access to loggers of interest when specific features are needed
+* Changed all the print in the source code in log events, with different logging levels
+* Added documentation about logging in docs
+
+# 2021-10-13
+
+* Enabled support for cuBLAS and cuML in the async scheduler
+  * Streams' management is now supported both for CUML and CUBLAS
+  * This feature can be possibly applied to any library, by extending the `LibrarySetStreamFunction` class
+* Set TensorRT support to experimental
+  * TensorRT is currently not supported on CUDA 11.4, making it impossible to use along a recent version of cuML
+  * **Known limitation:** due to this incompatibility, TensorRT is currently not available on the async scheduler
+
+# 2021-09-30, Release 1
+
+## API Changes
+
+* Added option to specify arguments in NFI kernel signatures as `const`
+    * The effect is the same as marking them as `in` in the NIDL syntax
+    * It is not strictly required to have the corresponding arguments in the CUDA kernel marked as `const`, although
+      that's recommended
+    * Marking arguments as `const` or `in` enables the async scheduler to overlap kernels that use the same read-only
+      arguments
+
+## New asynchronous scheduler
+
+* Added a new asynchronous scheduler for GrCUDA, enable it with `--experimental-options --grcuda.ExecutionPolicy=async`
+    * With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes
+      immediately
+    * The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU
+      data are accessed by the host thread
+    * Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using
+      different streams
+    * Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams
+    * The scheduler supports different options, see `README.md` for the full list
+    * It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a
+      Polyglot GPU Runtime" (IPDPS 2021)
+
+## New features
+
+* Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray,
+  MultiDimDeviceArrayView, and provides high-level array interfaces
+* Added API for prefetching
+    * If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before
+      executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance
+* Added API for stream attachment
+    * Always enabled in GPUs with with architecture older than Pascal, and the async scheduler is active. With the sync
+      scheduler, it can be manually enabled
+    * It restricts the visibility of GPU data to the specified stream
+    * In architectures newer or equal than Pascal it can provide a small performance benefit
+* Added `copyTo/copyFrom` functions on generic arrays (Truffle interoperable objects that expose the array API)
+    * Internally, the copy is implemented as a for loop, instead of using CUDA's `memcpy`
+    * It is still faster than copying using loops in the host languages, in many cases, and especially if host code is
+      not JIT-ted
+    * It is also used for copying data to/from DeviceArrays with column-major layout, as `memcpy` cannot copy
+      non-contiguous data
+
+## Demos, benchmarks and code samples
+
+* Added demo used at SeptembeRSE 2021 (`demos/image_pipeline_local` and `demos/image_pipeline_web`)
+    * It shows an image processing pipeline that applies a retro look to images. We have a local version and a web
+      version that displays results a in web page
+* Added benchmark suite written in Graalpython, used in "DAG-based Scheduling with Resource Sharing for Multi-task
+  Applications in a Polyglot GPU Runtime" (IPDPS 2021)
+    * It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling.
+
+## Miscellaneosus
+
+* Added dependency to `grcuda-data` submodule, used to store data, results and plots used in publications and demos.
+* Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it?
+* Added support for Java 11 along with Java 8
+* Added option to specify the location of cuBLAS and cuML with environment variables (`LIBCUBLAS_DIR` and `LIBCUML_DIR`)
+* Refactored package hierarchy to reflect changes to current GrCUDA (e.g. `gpu -> runtime`)
+* Added basic support for TruffleLogger
+* Removed a number of existing deprecation warnings
+* Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking
+* Updated documentation
+    * Bumped GraalVM version to 21.2
+    * Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (
+      see `oci_setup/`)
+    * Added documentation to setup IntelliJ Idea for GrCUDA development
+    * Added documentation about Python benchmark suite
+    * Added documentation on asynchronous scheduler options
diff --git a/LICENSE b/LICENSE
index e7da21dd..a19a3e92 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,4 +1,6 @@
-Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+Copyright (c) 2019, 2020, NVIDIA CORPORATION. All rights reserved.
+Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
@@ -11,6 +13,12 @@ are met:
  * Neither the name of NVIDIA CORPORATION nor the names of its
    contributors may be used to endorse or promote products derived
    from this software without specific prior written permission.
+ * Neither the name of NECSTLab nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+ * Neither the name of Politecnico di Milano nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,5 +33,5 @@ OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 
-grCUDA depends on Truffle APIs licensed under the Universal Permissive
+GrCUDA depends on Truffle APIs licensed under the Universal Permissive
 License (UPL), Version 1.0 (https://opensource.org/licenses/UPL).
diff --git a/README.md b/README.md
index 22b16f39..9b9738ab 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,9 @@
-# grCUDA: Polyglot GPU Access in GraalVM
+# GrCUDA: Polyglot GPU Access in GraalVM
 
 This Truffle language exposes GPUs to the polyglot [GraalVM](http://www.graalvm.org). The goal is to
 
-1) make data exchange between the host language and the GPU efficient without burdening the programmer.
-
-2) allow programmers to invoke _existing_ GPU kernels from their host language.
+1. Make data exchange between the host language and the GPUs efficient without burdening the programmer.
+2. Allow programmers to invoke _existing_ GPU kernels from their host language.
 
 Supported and tested GraalVM languages:
 
@@ -15,27 +14,26 @@ Supported and tested GraalVM languages:
 - Java
 - C and Rust through the Graal Sulong Component
 
-A description of grCUDA and its the features can be found in the [grCUDA documentation](docs/grcuda.md).
+A description of GrCUDA and its the features can be found in the [GrCUDA documentation](docs/grcuda.md).
 
 The [bindings documentation](docs/bindings.md) contains a tutorial that shows
 how to bind precompiled kernels to callables, compile and launch kernels.
 
 **Additional Information:**
 
-- [grCUDA: A Polyglot Language Binding for CUDA in GraalVM](https://devblogs.nvidia.com/grcuda-a-polyglot-language-binding-for-cuda-in-graalvm/). NVIDIA Developer Blog,
+- [GrCUDA: A Polyglot Language Binding for CUDA in GraalVM](https://devblogs.nvidia.com/grcuda-a-polyglot-language-binding-for-cuda-in-graalvm/). NVIDIA Developer Blog,
   November 2019.
-- [grCUDA: A Polyglot Language Binding](https://youtu.be/_lI6ubnG9FY). Presentation at Oracle CodeOne 2019, September 2019.
+- [GrCUDA: A Polyglot Language Binding](https://youtu.be/_lI6ubnG9FY). Presentation at Oracle CodeOne 2019, September 2019.
 - [Simplifying GPU Access](https://developer.nvidia.com/gtc/2020/video/s21269-vid). Presentation at NVIDIA GTC 2020, March 2020.
+- [DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime](https://ieeexplore.ieee.org/abstract/document/9460491). Paper at IPDPS 2021 on the GrCUDA scheduler, May 2021. [Video](https://youtu.be/QkX0FHDRyxA) of the presentation.
 
-## Using grCUDA in the GraalVM
+## Using GrCUDA in the GraalVM
 
-grCUDA can be used in the binaries of the GraalVM languages (`lli`, `graalpython`,
-`js`, `R`, and `ruby)`. The JAR file containing grCUDA must be appended to the classpath
-or copied into `jre/languages/grcuda` of the Graal installation. Note that `--jvm`
-and `--polyglot` must be specified in both cases as well.
+GrCUDA can be used in the binaries of the GraalVM languages (`lli`, `graalpython`, `js`, `R`, and `ruby`).
+The JAR file containing GrCUDA must be appended to the classpath or copied into `jre/languages/grcuda` (Java 8) or `languages/grcuda` (Java 11) of the Graal installation. 
+Note that `--jvm` and `--polyglot` must be specified in both cases as well.
 
-The following example shows how to create a GPU kernel and two device arrays
-in JavaScript (NodeJS) and invoke the kernel:
+The following example shows how to create a GPU kernel and two device arrays in JavaScript (NodeJS) and invoke the kernel:
 
 ```JavaScript
 // build kernel from CUDA C/C++ source code
@@ -46,7 +44,7 @@ __global__ void increment(int *arr, int n) {
     arr[idx] += 1;
   }
 }`
-const cu = Polyglot.eval('grcuda', 'CU') // get grCUDA namespace object
+const cu = Polyglot.eval('grcuda', 'CU') // get GrCUDA namespace object
 const incKernel = cu.buildkernel(
   kernelSource, // CUDA kernel source code string
   'increment', // kernel name
@@ -112,7 +110,7 @@ for i in range(num_elements):
 ```
 
 ```console
-nvcc --cubin  --generate-code arch=compute_75,code=sm_75 kernel.cu
+nvcc --cubin --generate-code arch=compute_75,code=sm_75 kernel.cu
 $GRAALVM_DIR/bin/graalpython --polyglot --jvm example.py
 1
 2
@@ -120,67 +118,308 @@ $GRAALVM_DIR/bin/graalpython --polyglot --jvm example.py
 100
 ```
 
-For more details on how to invoke existing GPU kernels, see the
-Documentation on [polyglot kernel launches](docs/launchkernel.md).
+For more details on how to invoke existing GPU kernels, see the Documentation on [polyglot kernel launches](docs/launchkernel.md).
 
 ## Installation
 
-grCUDA can be downloaded as a binary JAR from [grcuda/releases](https://github.com/NVIDIA/grcuda/releases) and manually copied into a GraalVM installation.
+GrCUDA can either be installed from an [existing release](#installation-from-an-existing-release), or built from the [source files](#installation-from-source-files) using the `mx` build tool.
+In both cases, it is recommended to follow these [extra steps](#additional-installations-steps) to ensure that your installation is working properly.
+
+### Installation from an existing release
 
-1. Download GraalVM CE 20.0.0 for Linux `graalvm-ce-java8-linux-amd64-20.0.0.tar.gz`
-   from [GitHub](https://github.com/oracle/graal/releases) and untar it in your
+GrCUDA can be downloaded as a binary JAR from [grcuda/releases](https://github.com/necst/grcuda/releases) and manually copied into a GraalVM installation. The original version of GrCUDA is available [here](https://github.com/NVIDIA/grcuda/releases).
+
+1. Download GraalVM CE 22.1.0 for Linux `graalvm-ce-java11-linux-amd64-21.2.0.tar.gz`
+   from [GitHub](https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-22.1.0/graalvm-ce-java11-linux-amd64-22.1.0.tar.gz) and untar it in your
    installation directory.
 
-   ```console
-   cd <your installation directory>
-   tar xfz graalvm-ce-java8-linux-amd64-20.0.0.tar.gz
-   export GRAALVM_DIR=`pwd`/graalvm-ce-java8-20.0.0
-   ```
+  ```console
+  cd <your installation directory>
+  wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-22.1.0/graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+  tar xfz graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+  rm graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+  export GRAALVM_DIR=`pwd`/graalvm-ce-java11-22.1.0
+  ```
 
-2. Download the grCUDA JAR from [grcuda/releases](https://github.com/NVIDIA/grcuda/releases)
+2. Download the GrCUDA JAR from [grcuda/releases](https://github.com/necst/grcuda/releases). If using the original release from [NVIDIA](https://github.com/NVIDIA/grcuda/releases), the latest features (e.g. the asynchronous scheduler, multi-GPU support) are not available. 
+<!-- Instead, follow the guide below to install GrCUDA from the source code. -->
 
-   ```console
-   cd $GRAALVM_DIR/jre/languages
-   mkdir grcuda
-   cp <download folder>/grcuda-0.1.0.jar grcuda
-   ```
+  ```console
+  cd $GRAALVM_DIR/languages
+  mkdir grcuda
+  cp <download folder>/grcuda.jar grcuda
+  ```
 
-3. Test grCUDA in Node.JS from GraalVM.
+3. Test GrCUDA in Node.JS from GraalVM.
 
-   ```console
-   cd $GRAALVM_DIR/bin
-   ./node --jvm --polyglot
-   > arr = Polyglot.eval('grcuda', 'int[5]')
-   [Array: null prototype] [ 0, 0, 0, 0, 0 ]
-   ```
+  ```console
+  cd $GRAALVM_DIR/bin
+  ./node --jvm --polyglot
+  > arr = Polyglot.eval('grcuda', 'int[5]')
+  [Array: null prototype] [ 0, 0, 0, 0, 0 ]
+  ```
 
 4. Download other GraalVM languages.
 
-   ```console
-   cd $GRAAL_VM/bin
-   ./gu available
-   ./gu install python
-   ./gu install R
-   ./gu install ruby
-   ```
+  ```console
+  cd $GRAAL_VM/bin
+  ./gu available
+  ./gu install python
+  ./gu install R
+  ./gu install ruby
+  ```
+
+### Installation from source files
+
+If you want to build GrCUDA yourself, instead of using an existing release, you will need a couple of extra steps. 
+This section contains all the steps required to setup GrCUDA if your goal is to contribute to its development, or simply hack with it.
+For simplicity, let's assume that your installation is done in your home directory, `~`.
+
+If you are installing GrCUDA on a new machine, you can simply follow or execute `oci_setup/setup_machine_from_scratch.sh` first, and then `oci_setup/setup_graalvm.sh`. 
+Here we repeat the same steps, with additional comments. 
+The installation process has been validated with CUDA 11.4 - 11.7 and Ubuntu 20.04.
+The same `oci_setup` has a number of useful scripts to configure machines on OCI and easily use GrCUDA.
+
+1. First, download GraalVM 22.1 as above.
+
+  ```console
+  wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-22.1.0/graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+  tar xfz graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+  rm graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+  export GRAALVM_DIR=~/graalvm-ce-java11-22.1.0
+  ```
+
+2. To build GrCUDA, you also need a custom JDK that is used to build GraalVM.
+
+  ```console
+  wget https://github.com/graalvm/labs-openjdk-11/releases/download/jvmci-22.1-b01/labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
+  tar xfz labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
+  rm labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
+  export JAVA_HOME=~/labsjdk-ce-11.0.15-jvmci-22.1-b01
+  ```
+  
+3. GrCUDA requires the [mx build tool](https://github.com/graalvm/mx).
+Clone the mx repository and add the directory into `$PATH`, such that the `mx` can be invoked from
+the command line.
+We checkout the commit corresponding to the current GraalVM release.
+
+  ```console
+  git clone https://github.com/graalvm/mx.git
+  cd mx
+  git checkout 722b86b8ef87fbb297f7e33ee6014bbbd3f4a3a8
+  cd ..
+  ```
 
-## Instructions to build grCUDA from Sources
+4. You might also want the source files for GraalVM CE, at the commit corresponding to the current release of GraalVM. This is not required for building, but if you want to modify GrCUDA's source code, it is useful to also have access to GraalVM's code.
 
-grCUDA requires the [mx build tool](https://github.com/graalvm/mx). Clone the mx
-repository and add the directory into `$PATH`, such that the `mx` can be invoked from
-the command line.
+  ```console
+  git clone https://github.com/oracle/graal.git
+  cd graal
+  git checkout 84541b16ae8a8726a0e7d76c7179d94a57ed84ee
+  cd ..
+  ```
+
+5. Last but not least, build GrCUDA
+
+  ```console
+  cd <directory containing this README>
+  ./install.sh
+  ```
+
+### Additional installations steps
+
+1. **Setup your CUDA environment**
+* Install CUDA and Nvidia drivers, for example following the steps [here](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network)
+* Add the following to your environment (assuming you have installed CUDA in the default `/usr/local` location, and using the `nvcc` compiler. Add these lines to `~/.bashrc` to make them permanent.
+
+```console
+export CUDA_DIR=/usr/local/cuda
+export PATH=$PATH:$CUDA_DIR/bin
+```
+
+2. **Setup your GraalVM and GrCUDA environment**
+* Add the following to your environment (assuming you have installed the releases mentioned in step 2 and 3). Add these lines to `~/.bashrc` to make them permanent.
+
+```console
+export PATH=~/mx:$PATH
+export JAVA_HOME=~/labsjdk-ce-11.0.15-jvmci-22.1-b01
+export GRAAL_HOME=~/graalvm-ce-java11-22.1.0
+export GRAALVM_HOME=$GRAAL_HOME
+export PATH=$GRAAL_HOME/bin:$PATH
+export PATH=$JAVA_HOME/bin:$PATH
+export GRCUDA_HOME=~/grcuda
+```
+* `source ~/.bashrc` to make changes available.
+
+3. **Install languages for GraalVM** (optional, but recommended)
+
+```console
+gu available
+gu install native-image
+gu install llvm-toolchain
+gu install python 
+gu install nodejs
+gu rebuild-images polyglot
+```
 
-Build grCUDA and the unit tests:
+* If Graalpython is installed, create a `virtualenv` for it
 
 ```console
-cd <directory containing this README>
-mx build
+graalpython -m venv ~/graalpython_venv
+source ~/graalpython_venv/bin/activate
 ```
 
-Note that this will also checkout the graal repository.
+* Recommended: install `numpy` in Graalpython (required for running GrCUDA benchmarks)
 
-To run unit tests:
+```console
+graalpython -m ginstall install setuptools;
+graalpython -m ginstall install Cython;
+graalpython -m ginstall install numpy;
+```
 
-```bash
+4. **Run GrCUDA Unit tests** using 
+
+```
 mx unittest com.nvidia
+# To run a specific test, you can use
+mx unittest com.nvidia.grcuda.test.BuildKernelTest#testBuildKernelwithNFILegacytSignature
+```
+
+5. **Setup the grcuda-data submodule**
+The `grcuda-data` repository is used as a `git` submodule to store data, results, and plots for demos, benchmarks, and publications. You will need this submodule to run the full benchmark suite, and some of the demos. To setup the submodule, follow this [`README`](https://github.com/AlbertoParravicini/grcuda-data/tree/master).
+
+### Setup your IDE
+
+To develop GrCUDA, you will greatly benefit from having an IDE that allows jumping between symbols and debugging individual tests. 
+Here, we explain how to setup IntelliJ Idea.
+
+  1. `mx ideinit` from `$GRCUDA_HOME`, to setup the IDE
+  2. Open Idea and select *"open project"*, then open GrCUDA
+  3. See this [guide](https://github.com/graalvm/mx/blob/master/docs/IDE.md) to configure the syntax checker
+     * `File -> Settings -> Plugins -> Marketplace -> Search "Eclipse Code Formatter" and install it`
+  4. In IntelliJ Idea, install the Python plugin with `Settings -> Plugin -> Search "Python"`, then do `Project Structure -> SDKs -> Create a new Python 3.8 Virtual Environment`, it is used by `mx`
+  5. Select the right JVM. It should select automatically your `$JAVA_HOME`. Othewise, `Project Structures -> Modules -> Set the Module SDK (under Dependencies)` of `mx` and submodules to your Java SDK (e.g. `11`). You can pick either the `labsjdk` or `graalvm`.
+      * This is also given by the `configure` option if you try to build the project in IntelliJ Idea before setting these options. Set your project Java SDK (e.g. `11`) for those missing modules
+      * When building for the first time in Intellij Idea, you might get errors like `cannot use --export for target 1.8`, which means that some package is being build with Java 8.
+      * For these packages, there are two possible solutions. Try either of them, and stick to the one that works for you
+
+          a. For those packages (look at the log to find them), manually specify a more recent SDK (e.g. `11`) as you did in step above. If you get errors of missing symbols, follow IntelliJ's hints and export the requested packages
+
+          b. Remove the exports. `File -> Settings -> Build ... -> Compiler -> Java Compiler`, then remove all the `--export` flags.
+  7. To run tests:
+
+      a. Go to `Run (top bar) -> Edit Configurations -> Edit configuration templates -> Junit`
+
+      b. (Not always necessary) By default, Idea should use your `env`. If not, make sure to have the same. Update the `PATH` variable so that it can find `nvcc`, and export `$GRAAL_HOME`. See `setup_machine_from_scratch.sh` to find all the environment variables.
+
+      c. Modify the template Junit test configuration adding `-Djava.library.path="$GRAAL_HOME/lib"` (in Java 11) to the VM options to find `trufflenfi`
+
+      d. In IntelliJ Idea, `Run -> Edit Configurations`. Create a new JUnit configuration set to `All in package` with `com.nvidia.grcuda` as module and `com.nvidia.grcuda.test` selected below. Add `-Djava.library.path="$GRAAL_HOME/lib"` (or your version of GraalVM) if it's not already in VM options. Specify the SDK by setting the GraalVM JRE in e.g. `$GRAAL_HOME`, if not specified already.   
+      
+      e. If you change something in GrCUDA, rebuild it with `./install.sh` before running tests. That's because tests that use the GPU load the `.jar` in `$GRAAL_HOME`, which is updated by `./install.sh`    
+
+## Execute performance tests using Graalpython
+
+To measure the performance of GrCUDA on complex GPU applications, we have developed a custom benchmark suite, found in `projects/resources/python/benchmark`.
+The benchmark suite includes those used in the [DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime](https://ieeexplore.ieee.org/abstract/document/9460491) paper.
+All commands are executed from `$GRCUDA_HOME/projects/resources/python/benchmark`;
+
+Run a single benchmark with custom settings
+```console
+graalpython --jvm --polyglot --experimental-options --grcuda.ExecutionPolicy=async --grcuda.DependencyPolicy=with-const --grcuda.RetrieveNewStreamPolicy=always-new  --grcuda.RetrieveParentStreamPolicy=disjoint benchmark_main.py -d -i 10 -n 4800 --no_cpu_validation --reinit false --realloc false -b b10
+```
+
+Run all benchmarks
+```console
+graalpython --jvm --polyglot benchmark_wrapper.py -d -i 30 
+```
+
+To run the CUDA version of all benchmarks, build it as follows. You might want to update the GPU architecture (the `-arch` flag) inside `$GRCUDA_HOME/projects/resources/cuda/Makefile` to reflect the hardware at your disposal.
+```console
+cd $GRCUDA_HOME/projects/resources/cuda;
+make
+cd -;
+```
+
+Run the CUDA version of all benchmarks
+```console
+graalpython --jvm --polyglot benchmark_wrapper.py -d -i 30 -c
+```
+
+To print the Java Stack Trace in case of exceptions, add the following to Graalpython
+```console
+graalpython --python.ExposeInternalSources --python.WithJavaStacktrace=1 --experimental-options <your-benchmark-command>
+```
+
+Profile a specific benchmark using `nvprof`. Running `nvprof` as `sudo` might not be required, see [here](https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters).
+The `graalpython` benchmark offers the `--nvprof` flag: if enable, only the real computation is profiled (and not the benchmark initialization). 
+Additionally, provide `nvprof` with flags `--csv` to get a CSV output, and `--log-file bench-name_%p.csv"` to store the result.
+Not using the flag `--print-gpu-trace` will print aggregated results.
+Additional metrics can be collected by `nvprof` with e.g. `--metrics "achieved_occupancy,sm_efficiency"` ([full list](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference)).
+GPUs with architecture starting from Turing (e.g. GTX 1660 Super) no longer allow collecting metrics with `nvprof`, but `ncu` ([link](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)) and Nsight Compute ([link](https://developer.nvidia.com/nsight-compute)).
+
+```
+sudo /usr/local/cuda/bin/nvprof --profile-from-start off --print-gpu-trace --profile-child-processes  /path/to/graalpython --jvm --polyglot --experimental-options --grcuda.InputPrefetch --grcuda.ForceStreamAttach --grcuda.RetrieveNewStreamPolicy=always-new --grcuda.ExecutionPolicy=async --grcuda.DependencyPolicy=with-const --grcuda.RetrieveParentStreamPolicy=disjoint benchmark_main.py -d -i 10 -n 4800 --no_cpu_validation --reinit false --realloc false -b b10d --block_size_1d 256 --block_size_2d 16 --nvprof
+```
+
+* Benchmarks are defined in the `projects/resources/python/benchmark/bench` folder, 
+and you can create more benchmarks by inheriting from the `Benchmark` class. 
+Individual benchmarks are executed from `benchmark_main.py`, while running all benchmark is done through `benchmark_wrapper.py`
+* The output of benchmarks is stored in a JSON (by default, located in `data/results`)
+* The benchmarking suite, through `benchmark_main.py`, supports the following options
+  1. `-d`, `--debug`: print to the console the results and details of each benchmark. False by default
+  2. `-i`, `--num_iter`: number of times that each benchmark is executed, for each combination of options. 30 by default
+  3. `-o`, `--output_path`: full path to the file where results are stored. By default results are stored in `data/results`,
+  and the file name is generated automatically
+  4. `--realloc`: if true, allocate new memory and rebuild the GPU code at each iteration. False by default
+  5. `--reinit`: if true, re-initialize the values used in each benchmark at each iteration. True by default
+  6. `-c`, `--cpu_validation`: if present, validate the result of each benchmark using the CPU (use `--no_cpu_validation` to skip it instead)
+  7. `-b`, `--benchmark`: run the benchmark only for the specified kernel. Otherwise run all benchmarks specified in `benchmark_main.py`
+  8. `-n`, `--size`: specify the input size of data used as input for each benchmark. Otherwise use the sizes specified in `benchmark_main.py`
+  9. `-r`, `--random`: initialize benchmarks randomly whenever possible. True by default
+  9. `--number_of_gpus`: Number of GPU employed for computation
+  9. `--execution_policy`: If present, run the benchmark only with the selected execution policy
+  9. `--dependency_policy`: If present, run the benchmark only with the selected dependency policy
+  9. `--new_stream`: If present, run the benchmark only with the selected new stream policy
+  9. `--parent_stream`: If present, run the benchmark only with the selected parent stream policy
+  9. `--device_selection`: If present and parent policy is data aware, run the benchmark only with the selected device selection heuristic
+  9. `--force_stream_attach`: If present, force association between arrays and CUDA streams.
+  9. `--memory_advise_policy`: Select a managed memory memAdvise flag, if multiple GPUs are available
+  9. `--prefetch`: If true, enable automatic prefetching in the benchmarks
+  10. `--block_size_1d`: number of threads per block when using 1D kernels
+  11. `--block_size_2d`: number of threads per block when using 2D kernels
+  12. `-g`, `--number_of_blocks`: number of blocks in the computation
+  13. `-p`, `--time_phases`: measure the execution time of each phase of the benchmark; note that this introduces overheads, and might influence the total execution time. Results for each phase are meaningful only for synchronous execution
+  13. `--timing`: If presentm, measure the execution time of each kernel
+  14. `--nvprof`: if present, enable profiling when using nvprof. For this option to have effect, run graalpython using nvprof, with flag '--profile-from-start off'
+	
+## DAG Scheduling Settings
+The automatic DAG scheduling of GrCUDA supports different settings that can be used for debugging or to simplify the dependency computation in some circumstances. 
+Starting from release 0.4.0, the automatic scheduler also supports the usage of multiple GPUs available in the system.
+Different options can be provided at startup, using `--experimental-options --grcuda.OptionName=value`:
+
+* `EnableComputationTimers`: measure the execution time of GPU computations; `false` by default;
+* `ExecutionPolicy`: this regulates the global scheduling policy;
+ `async` uses the DAG for asynchronous parallel execution, while `sync` executes each computation synchronously and can be used for debugging or to measure the execution time of each kernel
+* `DependencyPolicy`: choose how data dependencies between GrCUDA computations are computed;
+`with-const` considers read-only parameter, while `no-const` assumes that all arguments can be modified in a computation;
+* `RetrieveNewStreamPolicy`: choose how streams for new GrCUDA computations are created;
+ `reuse` (the default) reuses free streams whenever possible, while `always-new` creates new streams every time a computation should use a stream different from its parent
+* `RetrieveParentStreamPolicy`: choose how streams for new GrCUDA computations are obtained from parent computations;
+`same-as-parent` simply reuse the stream of one of the parent computations, while `disjoint` allows parallel scheduling of multiple child computations as long as their arguments are disjoint; `multigpu-disjoint` extends the previous policy with multi-GPU support and select the best parent device for a given computation, while `multigpu-early-disjoint` first selects the ideal GPU for the input computation, then finds if any of the reusable streams is allocated on that device.
+* `InputPrefetch`: if `true`, prefetch the data on GPUs with architecture starting from Pascal. In most cases, it improves performance. `false` by default;
+* `ForceStreamAttach`: if `true`, force association between arrays and CUDA streams. `true` by default on architectures older than Pascal, to allow concurrent CPU/GPU computation. On architectures starting from Pascal, it can improve performance, but it's `false` by default;
+* `NumberOfGPUs`: set how many GPUs can be used during computation (if available, otherwise use the max number of GPUs in the system). 1 by default;
+* `DeviceSelectionPolicy`: choose the heuristic that manages how GPU computations are mapped to devices, if multiple GPUs are available. `single-gpu` by default, it supports 5 multi-GPU policies: `round-robin` simply rotates the scheduling between GPUs, `stream-aware` selects the device with fewer ongoing computations, `min-transfer-size` maximizes data locality, while `minmin-transfer-time` and `minmax-transfer-time` minimize respectively the minimum and the maximum total transfer time; 
+* `DataThreshold`: When selecting a device with data-aware DeviceSelectionPolicies, such as min-transfer-size, do not give priority to devices that have less than this percentage of data already available. A lower percentage favors exploitation, a high percentage favors exploration. 0.1 by default (10%). 
+* `MemAdvisePolicy`: select a managed memory `memAdvise` flag, if multiple GPUs are available. Options: `read-mostly`, `preferred-location`, `none` (default);
+* `ExportDAG`: if present, dump the scheduling DAG in .dot format. Specify the destination path and the file name as value of the option (e.g. `--grcuda.ExportDAG=../ExecutionDAG`).
+* `BandwidthMatrix`: if present, sets the location of the CSV file that contains the estimated bandwidth between each CPU and GPU in the system, employed by topology-aware DeviceSelectionPolicies. By default, this is taken from  `$GRCUDA_HOME/projects/resouces/connection_graph/datasets/connection_graph.csv`, which is automatically computed during GrCUDA installation.
+
+## Publications
+
+If you use GrCUDA in your research, please cite the following publication(s). Thank you!
+
+```
+Parravicini, A., Delamare, A., Arnaboldi, M., & Santambrogio, M. D. (2021, May). DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 111-120). IEEE.
 ```
diff --git a/demos/image_pipeline_local/.gitignore b/demos/image_pipeline_local/.gitignore
new file mode 100644
index 00000000..b0e98ab6
--- /dev/null
+++ b/demos/image_pipeline_local/.gitignore
@@ -0,0 +1,2 @@
+node_modules/
+img_out/
diff --git a/demos/image_pipeline_local/array_copy_performance_test.js b/demos/image_pipeline_local/array_copy_performance_test.js
new file mode 100644
index 00000000..bcec4c5e
--- /dev/null
+++ b/demos/image_pipeline_local/array_copy_performance_test.js
@@ -0,0 +1,192 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+const System = Java.type("java.lang.System");
+const cu = Polyglot.eval('grcuda', 'CU')
+const { assert } = require("console");
+const cv = require("opencv4nodejs");
+
+function intervalToMs(start, end) {
+    return (end - start) / 1e6;
+}
+
+function mean(x) {
+	return x.reduce((a, b) => (a + b)) / x.length;
+}
+
+const R = 512;
+const N = R * R;
+const NUM_TESTS = 120;
+const DEBUG = true;
+
+// Create the device array;
+const deviceArray = cu.DeviceArray('int', N);
+
+// Load the image;
+const img = cv.imread("img_in/lena.jpg", cv.IMREAD_GRAYSCALE).resize(R, R);
+const img_buffer = img.getData();
+
+const direction = process.argv[2];
+const dirText = direction == "to" ? "grcuda->img" : "img->grcuda";
+
+// Initialize device array;
+for (let i = 0; i < N; i++) {
+	deviceArray[i] = 1;
+}
+
+//////////////////////////
+// COPY TO DEVICE ARRAY //
+//////////////////////////
+
+// Copy using for loop;
+function copy_for(from, to) {
+	const start = System.nanoTime();
+	for (let i = 0; i < N; i++) {
+		to[i] = from[i];
+	}
+	const end = System.nanoTime();
+	const time = intervalToMs(start, end);
+	if (DEBUG) console.log("-- copy "+ dirText + "- forloop=" + time + " ms")
+	return time
+}
+
+// Copy using copyFrom;
+function copy_grcuda_from(from, to) {
+	const start = System.nanoTime();
+	to.copyFrom(from)
+	const end = System.nanoTime();
+	const time = intervalToMs(start, end);
+	if (DEBUG) console.log("-- copy "+ dirText + "- grcuda=" + time + " ms")
+	return time
+}
+function copy_grcuda_to(from, to) {
+	const start = System.nanoTime();
+	from.copyFrom(to)
+	const end = System.nanoTime();
+	const time = intervalToMs(start, end);
+	if (DEBUG) console.log("-- copy "+ dirText + "- grcuda=" + time + " ms")
+	return time
+}
+
+// Copy using while;
+function copy_while(from, to) {
+	const start = System.nanoTime();
+	let i = from.length;
+	while(i--) to[i] = from[i];
+	const end = System.nanoTime();
+	const time = intervalToMs(start, end);
+	if (DEBUG) console.log("-- copy "+ dirText + "- while=" + time + " ms")
+	return time
+}
+
+// Copy using map;
+function copy_map(from, to) {
+	if (direction == "to") return undefined;
+	const start = System.nanoTime();
+	let i = 0;
+	from.forEach(a => {
+		to[i++] = a;
+	});
+	const end = System.nanoTime();
+	const time = intervalToMs(start, end);
+	if (DEBUG) console.log("-- copy "+ dirText + "- map=" + time + " ms")
+	return time
+}
+
+let from = img_buffer;
+let to = deviceArray;
+let copy_grcuda = copy_grcuda_from
+if (direction == "to") {
+	from = deviceArray;
+	to = img_buffer;
+	copy_grcuda = copy_grcuda_to
+}
+const types = ["for", "grcuda", "while", "map"];
+const functions = [copy_for, copy_grcuda, copy_while, copy_map];
+const averageTimes = [];
+
+// Test all functions
+for (let t = 0; t < types.length; t++) {
+	times = []
+	for (let n = 0; n < NUM_TESTS; n++) {
+		times.push(functions[t](from, to));
+	}
+	const averageTime = mean(times);
+	if (DEBUG) console.log(types[t] + "=" + averageTime + " ms");
+	averageTimes.push(averageTime);
+
+	// Check that the output is correct;
+	for (let n = 0; n < N; n++) {
+		assert(from[n] == to[n]);
+	}
+}
+
+console.log("---- RESULTS ----")
+for (let t = 0; t < types.length; t++) {
+	console.log("--" + types[t] + "=" + averageTimes[t] + " ms");
+}
+
+///////////////////////////////
+// OLD, WITHOUT IMAGE BUFFER //
+///////////////////////////////
+
+// const y = [] 
+// for (let i = 0; i < N; i++) {
+// 	y[i] = i;
+// }
+
+// console.log("copy from device array to js")
+// for (n = 0; n < 30; n++) {
+// 	const start = System.nanoTime();
+// 	for (let i = 0; i < N; i++) {
+// 		y[i] = x[i];
+// 	}
+// 	const end = System.nanoTime();
+// 	console.log("--copy - js=" + ((end - start) / 1e6) + " ms")
+
+// 	const start2 = System.nanoTime();
+// 	x.copyTo(y, N);
+// 	const end2 = System.nanoTime();
+// 	console.log("--copy - grcuda=" + ((end2 - start2) / 1e6) + " ms")
+// }
+
+// console.log("copy to device array from js")
+// for (n = 0; n < 30; n++) {
+// 	const start = System.nanoTime();
+// 	for (let i = 0; i < N; i++) {
+// 		x[i] = y[i];
+// 	}
+// 	const end = System.nanoTime();
+// 	console.log("--copy - js=" + ((end - start) / 1e6) + " ms")
+
+// 	const start2 = System.nanoTime();
+// 	x.copyFrom(y, N);
+// 	const end2 = System.nanoTime();
+// 	console.log("--copy - grcuda=" + ((end2 - start2) / 1e6) + " ms")
+// }
\ No newline at end of file
diff --git a/demos/image_pipeline_local/cuda/.gitignore b/demos/image_pipeline_local/cuda/.gitignore
new file mode 100644
index 00000000..a007feab
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/.gitignore
@@ -0,0 +1 @@
+build/*
diff --git a/demos/image_pipeline_local/cuda/CMakeLists.txt b/demos/image_pipeline_local/cuda/CMakeLists.txt
new file mode 100644
index 00000000..f67889d6
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/CMakeLists.txt
@@ -0,0 +1,21 @@
+cmake_minimum_required(VERSION 3.16)
+project(image_pipeline_cuda LANGUAGES CXX CUDA)
+set(CMAKE_CUDA_ARCHITECTURES 70)
+
+find_package(CUDA REQUIRED)
+include_directories("${CUDA_INCLUDE_DIRS}")
+include_directories("/usr/local/include") 
+
+find_package(OpenCV REQUIRED)
+find_package(CUDA REQUIRED)
+set(CMAKE_CXX_FLAGS_RELEASE "-O3")
+
+# Pass options to NVCC
+set(
+    CUDA_NVCC_FLAGS
+    ${CUDA_NVCC_FLAGS};
+    -O3
+    )
+
+add_executable(image_pipeline main.cpp opencv_interface.cpp image_pipeline.cu)
+target_link_libraries(image_pipeline PRIVATE cudart ${OpenCV_LIBS})
diff --git a/demos/image_pipeline_local/cuda/dvrapi_error_string.h b/demos/image_pipeline_local/cuda/dvrapi_error_string.h
new file mode 100644
index 00000000..5f155dd0
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/dvrapi_error_string.h
@@ -0,0 +1,463 @@
+/* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#pragma once
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+// Error Code string definitions here
+typedef struct {
+  char const *error_string;
+  int error_id;
+} s_CudaErrorStr;
+
+/**
+ * Error codes
+ */
+static s_CudaErrorStr sCudaDrvErrorString[] = {
+    /**
+     * The API call returned with no errors. In the case of query calls, this
+     * can also mean that the operation being queried is complete (see
+     * ::cuEventQuery() and ::cuStreamQuery()).
+     */
+    {"CUDA_SUCCESS", 0},
+
+    /**
+     * This indicates that one or more of the parameters passed to the API call
+     * is not within an acceptable range of values.
+     */
+    {"CUDA_ERROR_INVALID_VALUE", 1},
+
+    /**
+     * The API call failed because it was unable to allocate enough memory to
+     * perform the requested operation.
+     */
+    {"CUDA_ERROR_OUT_OF_MEMORY", 2},
+
+    /**
+     * This indicates that the CUDA driver has not been initialized with
+     * ::cuInit() or that initialization has failed.
+     */
+    {"CUDA_ERROR_NOT_INITIALIZED", 3},
+
+    /**
+     * This indicates that the CUDA driver is in the process of shutting down.
+     */
+    {"CUDA_ERROR_DEINITIALIZED", 4},
+
+    /**
+     * This indicates profiling APIs are called while application is running
+     * in visual profiler mode.
+     */
+    {"CUDA_ERROR_PROFILER_DISABLED", 5},
+    /**
+     * This indicates profiling has not been initialized for this context.
+     * Call cuProfilerInitialize() to resolve this.
+     */
+    {"CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6},
+    /**
+     * This indicates profiler has already been started and probably
+     * cuProfilerStart() is incorrectly called.
+     */
+    {"CUDA_ERROR_PROFILER_ALREADY_STARTED", 7},
+    /**
+     * This indicates profiler has already been stopped and probably
+     * cuProfilerStop() is incorrectly called.
+     */
+    {"CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8},
+    /**
+     * This indicates that no CUDA-capable devices were detected by the
+     * installed CUDA driver.
+     */
+    {"CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100},
+
+    /**
+     * This indicates that the device ordinal supplied by the user does not
+     * correspond to a valid CUDA device.
+     */
+    {"CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)",
+     101},
+
+    /**
+     * This indicates that the device kernel image is invalid. This can also
+     * indicate an invalid CUDA module.
+     */
+    {"CUDA_ERROR_INVALID_IMAGE", 200},
+
+    /**
+     * This most frequently indicates that there is no context bound to the
+     * current thread. This can also be returned if the context passed to an
+     * API call is not a valid handle (such as a context that has had
+     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
+     * mixes different API versions (i.e. 3010 context with 3020 API calls).
+     * See ::cuCtxGetApiVersion() for more details.
+     */
+    {"CUDA_ERROR_INVALID_CONTEXT", 201},
+
+    /**
+     * This indicated that the context being supplied as a parameter to the
+     * API call was already the active context.
+     * \deprecated
+     * This error return is deprecated as of CUDA 3.2. It is no longer an
+     * error to attempt to push the active context via ::cuCtxPushCurrent().
+     */
+    {"CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202},
+
+    /**
+     * This indicates that a map or register operation has failed.
+     */
+    {"CUDA_ERROR_MAP_FAILED", 205},
+
+    /**
+     * This indicates that an unmap or unregister operation has failed.
+     */
+    {"CUDA_ERROR_UNMAP_FAILED", 206},
+
+    /**
+     * This indicates that the specified array is currently mapped and thus
+     * cannot be destroyed.
+     */
+    {"CUDA_ERROR_ARRAY_IS_MAPPED", 207},
+
+    /**
+     * This indicates that the resource is already mapped.
+     */
+    {"CUDA_ERROR_ALREADY_MAPPED", 208},
+
+    /**
+     * This indicates that there is no kernel image available that is suitable
+     * for the device. This can occur when a user specifies code generation
+     * options for a particular CUDA source file that do not include the
+     * corresponding device configuration.
+     */
+    {"CUDA_ERROR_NO_BINARY_FOR_GPU", 209},
+
+    /**
+     * This indicates that a resource has already been acquired.
+     */
+    {"CUDA_ERROR_ALREADY_ACQUIRED", 210},
+
+    /**
+     * This indicates that a resource is not mapped.
+     */
+    {"CUDA_ERROR_NOT_MAPPED", 211},
+
+    /**
+     * This indicates that a mapped resource is not available for access as an
+     * array.
+     */
+    {"CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212},
+
+    /**
+     * This indicates that a mapped resource is not available for access as a
+     * pointer.
+     */
+    {"CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213},
+
+    /**
+     * This indicates that an uncorrectable ECC error was detected during
+     * execution.
+     */
+    {"CUDA_ERROR_ECC_UNCORRECTABLE", 214},
+
+    /**
+     * This indicates that the ::CUlimit passed to the API call is not
+     * supported by the active device.
+     */
+    {"CUDA_ERROR_UNSUPPORTED_LIMIT", 215},
+
+    /**
+     * This indicates that the ::CUcontext passed to the API call can
+     * only be bound to a single CPU thread at a time but is already
+     * bound to a CPU thread.
+     */
+    {"CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216},
+
+    /**
+     * This indicates that peer access is not supported across the given
+     * devices.
+     */
+    {"CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
+
+    /**
+     * This indicates that a PTX JIT compilation failed.
+     */
+    {"CUDA_ERROR_INVALID_PTX", 218},
+
+    /**
+     * This indicates an error with OpenGL or DirectX context.
+     */
+    {"CUDA_ERROR_INVALID_GRAPHICS_CONTEXT", 219},
+
+    /**
+     * This indicates that an uncorrectable NVLink error was detected during the
+     * execution.
+     */
+    {"CUDA_ERROR_NVLINK_UNCORRECTABLE", 220},
+
+    /**
+     * This indicates that the PTX JIT compiler library was not found.
+     */
+    {"CUDA_ERROR_JIT_COMPILER_NOT_FOUND", 221},
+
+    /**
+     * This indicates that the device kernel source is invalid.
+     */
+    {"CUDA_ERROR_INVALID_SOURCE", 300},
+
+    /**
+     * This indicates that the file specified was not found.
+     */
+    {"CUDA_ERROR_FILE_NOT_FOUND", 301},
+
+    /**
+     * This indicates that a link to a shared object failed to resolve.
+     */
+    {"CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302},
+
+    /**
+     * This indicates that initialization of a shared object failed.
+     */
+    {"CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303},
+
+    /**
+     * This indicates that an OS call failed.
+     */
+    {"CUDA_ERROR_OPERATING_SYSTEM", 304},
+
+    /**
+     * This indicates that a resource handle passed to the API call was not
+     * valid. Resource handles are opaque types like ::CUstream and ::CUevent.
+     */
+    {"CUDA_ERROR_INVALID_HANDLE", 400},
+
+    /**
+     * This indicates that a named symbol was not found. Examples of symbols
+     * are global/constant variable names, texture names }, and surface names.
+     */
+    {"CUDA_ERROR_NOT_FOUND", 500},
+
+    /**
+     * This indicates that asynchronous operations issued previously have not
+     * completed yet. This result is not actually an error, but must be
+     * indicated differently than ::CUDA_SUCCESS (which indicates completion).
+     * Calls that may return this value include ::cuEventQuery() and
+     * ::cuStreamQuery().
+     */
+    {"CUDA_ERROR_NOT_READY", 600},
+
+    /**
+     * While executing a kernel, the device encountered a
+     * load or store instruction on an invalid memory address.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_ILLEGAL_ADDRESS", 700},
+
+    /**
+     * This indicates that a launch did not occur because it did not have
+     * appropriate resources. This error usually indicates that the user has
+     * attempted to pass too many arguments to the device kernel, or the
+     * kernel launch specifies too many threads for the kernel's register
+     * count. Passing arguments of the wrong size (i.e. a 64-bit pointer
+     * when a 32-bit int is expected) is equivalent to passing too many
+     * arguments and can also result in this error.
+     */
+    {"CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701},
+
+    /**
+     * This indicates that the device kernel took too long to execute. This can
+     * only occur if timeouts are enabled - see the device attribute
+     * ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
+     * context cannot be used (and must be destroyed similar to
+     * ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
+     * this context are invalid and must be reconstructed if the program is to
+     * continue using CUDA.
+     */
+    {"CUDA_ERROR_LAUNCH_TIMEOUT", 702},
+
+    /**
+     * This error indicates a kernel launch that uses an incompatible texturing
+     * mode.
+     */
+    {"CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703},
+
+    /**
+     * This error indicates that a call to ::cuCtxEnablePeerAccess() is
+     * trying to re-enable peer access to a context which has already
+     * had peer access to it enabled.
+     */
+    {"CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704},
+
+    /**
+     * This error indicates that ::cuCtxDisablePeerAccess() is
+     * trying to disable peer access which has not been enabled yet
+     * via ::cuCtxEnablePeerAccess().
+     */
+    {"CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705},
+
+    /**
+     * This error indicates that the primary context for the specified device
+     * has already been initialized.
+     */
+    {"CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708},
+
+    /**
+     * This error indicates that the context current to the calling thread
+     * has been destroyed using ::cuCtxDestroy }, or is a primary context which
+     * has not yet been initialized.
+     */
+    {"CUDA_ERROR_CONTEXT_IS_DESTROYED", 709},
+
+    /**
+     * A device-side assert triggered during kernel execution. The context
+     * cannot be used anymore, and must be destroyed. All existing device
+     * memory allocations from this context are invalid and must be
+     * reconstructed if the program is to continue using CUDA.
+     */
+    {"CUDA_ERROR_ASSERT", 710},
+
+    /**
+     * This error indicates that the hardware resources required to enable
+     * peer access have been exhausted for one or more of the devices
+     * passed to ::cuCtxEnablePeerAccess().
+     */
+    {"CUDA_ERROR_TOO_MANY_PEERS", 711},
+
+    /**
+     * This error indicates that the memory range passed to
+     * ::cuMemHostRegister() has already been registered.
+     */
+    {"CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712},
+
+    /**
+     * This error indicates that the pointer passed to ::cuMemHostUnregister()
+     * does not correspond to any currently registered memory region.
+     */
+    {"CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713},
+
+    /**
+     * While executing a kernel, the device encountered a stack error.
+     * This can be due to stack corruption or exceeding the stack size limit.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_HARDWARE_STACK_ERROR", 714},
+
+    /**
+     * While executing a kernel, the device encountered an illegal instruction.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_ILLEGAL_INSTRUCTION", 715},
+
+    /**
+     * While executing a kernel, the device encountered a load or store
+     * instruction on a memory address which is not aligned. This leaves the
+     * process in an inconsistent state and any further CUDA work will return
+     * the same error. To continue using CUDA, the process must be terminated
+     * and relaunched.
+     */
+    {"CUDA_ERROR_MISALIGNED_ADDRESS", 716},
+
+    /**
+     * While executing a kernel, the device encountered an instruction
+     * which can only operate on memory locations in certain address spaces
+     * (global, shared, or local), but was supplied a memory address not
+     * belonging to an allowed address space.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_INVALID_ADDRESS_SPACE", 717},
+
+    /**
+     * While executing a kernel, the device program counter wrapped its address
+     * space. This leaves the process in an inconsistent state and any further
+     * CUDA work will return the same error. To continue using CUDA, the process
+     * must be terminated and relaunched.
+     */
+    {"CUDA_ERROR_INVALID_PC", 718},
+
+    /**
+     * An exception occurred on the device while executing a kernel. Common
+     * causes include dereferencing an invalid device pointer and accessing
+     * out of bounds shared memory. The context cannot be used }, so it must
+     * be destroyed (and a new one should be created). All existing device
+     * memory allocations from this context are invalid and must be
+     * reconstructed if the program is to continue using CUDA.
+     */
+    {"CUDA_ERROR_LAUNCH_FAILED", 719},
+
+    /**
+     * This error indicates that the number of blocks launched per grid for a
+     * kernel that was launched via either ::cuLaunchCooperativeKernel or
+     * ::cuLaunchCooperativeKernelMultiDevice exceeds the maximum number of
+     * blocks as allowed by ::cuOccupancyMaxActiveBlocksPerMultiprocessor or
+     * ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags times the number
+     * of multiprocessors as specified by the device attribute
+     * ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.
+     */
+    {"CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE", 720},
+
+    /**
+     * This error indicates that the attempted operation is not permitted.
+     */
+    {"CUDA_ERROR_NOT_PERMITTED", 800},
+
+    /**
+     * This error indicates that the attempted operation is not supported
+     * on the current system or device.
+     */
+    {"CUDA_ERROR_NOT_SUPPORTED", 801},
+
+    /**
+     * This indicates that an unknown internal error has occurred.
+     */
+    {"CUDA_ERROR_UNKNOWN", 999},
+    {NULL, -1}};
+
+// This is just a linear search through the array, since the error_id's are not
+// always ocurring consecutively
+inline const char *getCudaDrvErrorString(int error_id) {
+  int index = 0;
+
+  while (sCudaDrvErrorString[index].error_id != error_id &&
+         sCudaDrvErrorString[index].error_id != -1) {
+    index++;
+  }
+
+  if (sCudaDrvErrorString[index].error_id == error_id)
+    return (const char *)sCudaDrvErrorString[index].error_string;
+  else
+    return (const char *)"CUDA_ERROR not found!";
+}
diff --git a/demos/image_pipeline_local/cuda/image_pipeline.cu b/demos/image_pipeline_local/cuda/image_pipeline.cu
new file mode 100644
index 00000000..10b86e02
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/image_pipeline.cu
@@ -0,0 +1,471 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "image_pipeline.cuh"
+
+//////////////////////////////
+// GPU kernels ///////////////
+//////////////////////////////
+
+extern "C" __global__ void gaussian_blur(const int *image, float *result, int rows, int cols, const float* kernel, int diameter) {
+    extern __shared__ float kernel_local[];
+    for(int i = threadIdx.x; i < diameter; i += blockDim.x) {
+        for(int j = threadIdx.y; j < diameter; j += blockDim.y) {
+            kernel_local[i * diameter + j] = kernel[i * diameter + j];
+        }
+    }
+    __syncthreads();
+
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            int radius = diameter / 2;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        sum += kernel_local[(x + radius) * diameter + (y + radius)] * (float(image[nx * cols + ny]) / 255);
+                    }
+                }
+            }
+            result[i * cols + j] = sum;
+        }
+    }
+}
+
+extern "C" __global__ void sobel(float *image, float *result, int rows, int cols) {
+    // int SOBEL_X[3][3] = {{-1, -2, -1}, {0, 0, 0}, {1, 2, 1}};
+    // int SOBEL_Y[3][3] = {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}};
+    __shared__ int SOBEL_X[9];
+    __shared__ int SOBEL_Y[9];
+    if (threadIdx.x == 0 && threadIdx.y == 0) {   
+        SOBEL_X[0] = -1;
+        SOBEL_X[1] = -2;
+        SOBEL_X[2] = -1;
+        SOBEL_X[3] = 0;
+        SOBEL_X[4] = 0;
+        SOBEL_X[5] = 0;
+        SOBEL_X[6] = 1;
+        SOBEL_X[7] = 2;
+        SOBEL_X[8] = 1;
+
+        SOBEL_Y[0] = -1;
+        SOBEL_Y[1] = 0;
+        SOBEL_Y[2] = 1;
+        SOBEL_Y[3] = -2;
+        SOBEL_Y[4] = 0;
+        SOBEL_Y[5] = 2;
+        SOBEL_Y[6] = -1;
+        SOBEL_Y[7] = 0;
+        SOBEL_Y[8] = 1;
+    }
+    __syncthreads();
+    
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum_gradient_x = 0.0, sum_gradient_y = 0.0;
+            int radius = 1;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        float neighbour = image[nx * cols + ny];
+                        int s = (x + radius) * 3 + y + radius;
+                        sum_gradient_x += SOBEL_X[s] * neighbour;
+                        sum_gradient_y += SOBEL_Y[s] * neighbour;
+                    }
+                }
+            }
+            result[i * cols + j] = sqrt(sum_gradient_x * sum_gradient_x + sum_gradient_y * sum_gradient_y);
+        }
+    }
+}
+
+__device__ float atomicMinf(float* address, float val)
+{
+    int *address_as_int =(int*)address;
+    int old = *address_as_int, assumed;
+    while (val < __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed,
+                        __float_as_int(val));
+        }
+    return __int_as_float(old);
+}
+
+__device__ float atomicMaxf(float* address, float val)
+{
+    int *address_as_int = (int*) address;
+    int old = *address_as_int, assumed;
+    // If val is smaller than current, don't do anything, else update the current value atomically;
+    while (val > __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed, __float_as_int(val));
+    }
+    return __int_as_float(old);
+}
+
+__inline__ __device__ float warp_reduce_max(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = max(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+__inline__ __device__ float warp_reduce_min(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+extern "C" __global__ void maximum_kernel(float *in, float* out, int N) {
+    int warp_size = 32;
+    float maximum = -1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        maximum = max(maximum, in[i]);
+    }
+    maximum = warp_reduce_max(maximum); // Obtain the max of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMaxf(out, maximum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void minimum_kernel(float *in, float* out, int N) {
+    int warp_size = 32;
+    float minimum = 1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        minimum = min(minimum, in[i]);
+    }
+    minimum = warp_reduce_min(minimum); // Obtain the min of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMinf(out, minimum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void extend(float *x, const float *minimum, const float *maximum, int n, int extend_factor) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = extend_factor * (x[i] - *minimum) / (*maximum - *minimum);
+        x[i] = res_tmp > 1 ? 1 : res_tmp;
+    }
+}
+
+extern "C" __global__ void unsharpen(const int *x, const float *y, float *res, float amount, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = (float(x[i]) / 255) * (1 + amount) - y[i] * amount;
+        res_tmp = res_tmp > 1 ? 1 : res_tmp;
+        res[i] = res_tmp < 0 ? 0 : res_tmp;
+    }
+}
+
+extern "C" __global__ void combine(const float *x, const float *y, const float *mask, float *res, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        res[i] = x[i] * mask[i] + y[i] * (1 - mask[i]);
+    }
+}
+
+extern "C" __global__ void combine_lut(const float *x, const float *y, const float *mask, int *res, int n, int* lut) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        int res_tmp = min(CDEPTH - 1, int(CDEPTH * (x[i] * mask[i] + y[i] * (1 - mask[i]))));
+        res[i] = lut[res_tmp];
+    }
+}
+
+//////////////////////////////
+// GPU functions /////////////
+//////////////////////////////
+
+void ImagePipeline::alloc() {
+    err = cudaMallocManaged(&image, sizeof(int) * image_width * image_width);
+    err = cudaMallocManaged(&image2, sizeof(float) * image_width * image_width);
+    err = cudaMallocManaged(&image3, sizeof(int) * image_width * image_width);
+    err = cudaMallocManaged(&image_unsharpen, sizeof(float) * image_width * image_width);
+    err = cudaMallocManaged(&mask_small, sizeof(float) * image_width * image_width);
+    err = cudaMallocManaged(&mask_large, sizeof(float) * image_width * image_width);
+    err = cudaMallocManaged(&blurred_small, sizeof(float) * image_width * image_width);
+    err = cudaMallocManaged(&blurred_large, sizeof(float) * image_width * image_width);
+    err = cudaMallocManaged(&blurred_unsharpen, sizeof(float) * image_width * image_width);
+
+    err = cudaMallocManaged(&kernel_small, sizeof(float) * kernel_small_diameter * kernel_small_diameter);
+    err = cudaMallocManaged(&kernel_large, sizeof(float) * kernel_large_diameter * kernel_large_diameter);
+    err = cudaMallocManaged(&kernel_unsharpen, sizeof(float) * kernel_unsharpen_diameter * kernel_unsharpen_diameter);
+
+    err = cudaMallocManaged(&lut[0], sizeof(int) * CDEPTH);
+    err = cudaMallocManaged(&lut[1], sizeof(int) * CDEPTH);
+    err = cudaMallocManaged(&lut[2], sizeof(int) * CDEPTH);
+
+    err = cudaMallocManaged(&maximum_1, sizeof(float));
+    err = cudaMallocManaged(&minimum_1, sizeof(float));
+    err = cudaMallocManaged(&maximum_2, sizeof(float));
+    err = cudaMallocManaged(&minimum_2, sizeof(float));
+
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+    err = cudaStreamCreate(&s3);
+    err = cudaStreamCreate(&s4);
+    err = cudaStreamCreate(&s5);
+}
+
+void ImagePipeline::init(unsigned char* input_image, int channel) {
+    gaussian_kernel(kernel_small, kernel_small_diameter, kernel_small_variance);
+    gaussian_kernel(kernel_large, kernel_large_diameter, kernel_large_variance);
+    gaussian_kernel(kernel_unsharpen, kernel_unsharpen_diameter, kernel_unsharpen_variance);
+    *maximum_1 = 0;
+    *minimum_1 = 0;
+    *maximum_2 = 0;
+    *minimum_2 = 0;
+
+    // Initialize LUTs;
+    lut_b(lut[0]);
+    lut_g(lut[1]);
+    lut_r(lut[2]);
+
+    // Copy input data on the GPU managed memory;
+    for (int i = 0; i < image_width * image_width; i++) {
+        image[i] = int(input_image[black_and_white ? i : (i * 3 + channel)]);
+    }
+    cudaDeviceSynchronize();
+}
+
+void ImagePipeline::execute_sync(int channel) {
+
+    if (pascalGpu && do_prefetch) {
+        cudaMemPrefetchAsync(image, sizeof(int) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(image2, sizeof(float) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(image3, sizeof(int) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(image_unsharpen, sizeof(float) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(mask_small, sizeof(float) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(mask_large, sizeof(float) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(blurred_small, sizeof(float) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(blurred_large, sizeof(float) * image_width * image_width, 0, 0);
+        cudaMemPrefetchAsync(blurred_unsharpen, sizeof(float) * image_width * image_width, 0, 0);
+    }
+    // Blur - Small;
+    gaussian_blur<<<grid_size_2d, block_size_2d, kernel_small_diameter * kernel_small_diameter * sizeof(float)>>>(image, blurred_small, image_width, image_width, kernel_small, kernel_small_diameter);
+    cudaDeviceSynchronize();
+    // Blur - Large;
+    gaussian_blur<<<grid_size_2d, block_size_2d, kernel_large_diameter * kernel_large_diameter * sizeof(float)>>>(image, blurred_large, image_width, image_width, kernel_large, kernel_large_diameter);
+    cudaDeviceSynchronize();
+    // Blur - Unsharpen;
+    gaussian_blur<<<grid_size_2d, block_size_2d, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float)>>>(image, blurred_unsharpen, image_width, image_width, kernel_unsharpen, kernel_unsharpen_diameter);
+    cudaDeviceSynchronize();
+    // Sobel filter (edge detection);
+    sobel<<<grid_size_2d, block_size_2d>>>(blurred_small, mask_small, image_width, image_width);
+    cudaDeviceSynchronize();
+    sobel<<<grid_size_2d, block_size_2d>>>(blurred_large, mask_large, image_width, image_width);
+    cudaDeviceSynchronize();
+    // Ensure that the output of Sobel is in [0, 1];
+    maximum_kernel<<<grid_size_1d, block_size_1d>>>(mask_small, maximum_1, image_width * image_width);
+    cudaDeviceSynchronize();
+    minimum_kernel<<<grid_size_1d, block_size_1d>>>(mask_small, minimum_1, image_width * image_width);
+    cudaDeviceSynchronize();
+    extend<<<grid_size_1d, block_size_1d>>>(mask_small, minimum_1, maximum_1, image_width * image_width, 1);
+    cudaDeviceSynchronize();
+    // Extend large edge detection mask, and normalize it;
+    maximum_kernel<<<grid_size_1d, block_size_1d>>>(mask_large, maximum_2, image_width * image_width);
+    cudaDeviceSynchronize();
+    minimum_kernel<<<grid_size_1d, block_size_1d>>>(mask_large, minimum_2, image_width * image_width);
+    cudaDeviceSynchronize();
+    extend<<<grid_size_1d, block_size_1d>>>(mask_large, minimum_2, maximum_2, image_width * image_width, 5);
+    cudaDeviceSynchronize();
+    // Unsharpen;
+    unsharpen<<<grid_size_1d, block_size_1d>>>(image, blurred_unsharpen, image_unsharpen, unsharpen_amount, image_width * image_width);
+    cudaDeviceSynchronize();
+    // Combine results;
+    combine<<<grid_size_1d, block_size_1d>>>(image_unsharpen, blurred_large, mask_large, image2, image_width * image_width);
+    cudaDeviceSynchronize();
+    combine_lut<<<grid_size_1d, block_size_1d>>>(image2, blurred_small, mask_small, image3, image_width * image_width, lut[channel]);
+
+    cudaDeviceSynchronize();
+}
+
+void ImagePipeline::execute_async(int channel) {
+   
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, blurred_small, 0);
+        cudaStreamAttachMemAsync(s1, mask_small, 0);
+        cudaStreamAttachMemAsync(s2, blurred_large, 0);
+        cudaStreamAttachMemAsync(s2, mask_large, 0);
+        cudaStreamAttachMemAsync(s2, image2, 0);
+        cudaStreamAttachMemAsync(s3, blurred_unsharpen, 0);
+        cudaStreamAttachMemAsync(s3, image_unsharpen, 0);
+        cudaStreamAttachMemAsync(s1, image3, 0);
+    }
+    if (pascalGpu && do_prefetch) {
+        cudaMemPrefetchAsync(image, sizeof(int) * image_width * image_width, 0, s1);
+        cudaMemPrefetchAsync(image2, sizeof(float) * image_width * image_width, 0, s2);
+        cudaMemPrefetchAsync(image3, sizeof(int) * image_width * image_width, 0, s1);
+        cudaMemPrefetchAsync(image_unsharpen, sizeof(float) * image_width * image_width, 0, s3);
+        cudaMemPrefetchAsync(mask_small, sizeof(float) * image_width * image_width, 0, s1);
+        cudaMemPrefetchAsync(mask_large, sizeof(float) * image_width * image_width, 0, s2);
+        cudaMemPrefetchAsync(blurred_small, sizeof(float) * image_width * image_width, 0, s1);
+        cudaMemPrefetchAsync(blurred_large, sizeof(float) * image_width * image_width, 0, s2);
+        cudaMemPrefetchAsync(blurred_unsharpen, sizeof(float) * image_width * image_width, 0, s3);
+    }
+    // Blur - Small;
+    gaussian_blur<<<grid_size_2d, block_size_2d, kernel_small_diameter * kernel_small_diameter * sizeof(float), s1>>>(image, blurred_small, image_width, image_width, kernel_small, kernel_small_diameter);
+    // Blur - Large;
+    gaussian_blur<<<grid_size_2d, block_size_2d, kernel_large_diameter * kernel_large_diameter * sizeof(float), s2>>>(image, blurred_large, image_width, image_width, kernel_large, kernel_large_diameter);
+    // Blur - Unsharpen;
+    gaussian_blur<<<grid_size_2d, block_size_2d, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float), s3>>>(image, blurred_unsharpen, image_width, image_width, kernel_unsharpen, kernel_unsharpen_diameter);
+    // Sobel filter (edge detection);
+    sobel<<<grid_size_2d, block_size_2d, 0, s1>>>(blurred_small, mask_small, image_width, image_width);
+    sobel<<<grid_size_2d, block_size_2d, 0, s2>>>(blurred_large, mask_large, image_width, image_width);
+
+    // Max-min + combine to normalize Sobel on small mask;
+    cudaEvent_t e_ss, e_min1;
+    cudaEventCreate(&e_ss);
+    cudaEventCreate(&e_min1);
+    cudaEventRecord(e_ss, s1);  // Wait end of Sobel on small mask; 
+    maximum_kernel<<<grid_size_1d, block_size_1d, 0, s1>>>(mask_small, maximum_1, image_width * image_width);
+    cudaStreamWaitEvent(s4, e_ss, 0);
+    minimum_kernel<<<grid_size_1d, block_size_1d, 0, s4>>>(mask_small, minimum_1, image_width * image_width);
+    cudaEventRecord(e_min1, s4);  
+    cudaStreamWaitEvent(s1, e_min1, 0);  // Wait min;
+    extend<<<grid_size_1d, block_size_1d, 0, s1>>>(mask_small, minimum_1, maximum_1, image_width * image_width, 1);
+    
+    // Max-min + combine to normalize Sobel on large mask;
+    cudaEvent_t e_sl, e_min2;
+    cudaEventCreate(&e_sl);
+    cudaEventCreate(&e_min2);
+    cudaEventRecord(e_sl, s2);
+    maximum_kernel<<<grid_size_1d, block_size_1d, 0, s2>>>(mask_large, maximum_2, image_width * image_width);
+    cudaStreamWaitEvent(s5, e_sl, 0);  // Wait end of Sobel on large mask; 
+    minimum_kernel<<<grid_size_1d, block_size_1d, 0, s5>>>(mask_large, minimum_2, image_width * image_width);
+    cudaEventRecord(e_min2, s5);  
+    cudaStreamWaitEvent(s2, e_min2, 0);  // Wait min;
+    extend<<<grid_size_1d, block_size_1d, 0, s2>>>(mask_large, minimum_2, maximum_2, image_width * image_width, 5);
+
+    // Unsharpen;
+    unsharpen<<<grid_size_1d, block_size_1d, 0, s3>>>(image, blurred_unsharpen, image_unsharpen, unsharpen_amount, image_width * image_width);
+
+    // Combine results;
+    cudaEvent_t e_un, e_co;
+    cudaEventCreate(&e_un);
+    cudaEventCreate(&e_co);
+    cudaEventRecord(e_un, s3);
+    cudaStreamWaitEvent(s2, e_un, 0);
+    combine<<<grid_size_1d, block_size_1d, 0, s2>>>(image_unsharpen, blurred_large, mask_large, image2, image_width * image_width);
+    cudaEventRecord(e_co, s2);
+    cudaStreamWaitEvent(s1, e_co, 0);
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, image2, 0);
+    }
+    if (pascalGpu && do_prefetch) {
+        cudaMemPrefetchAsync(image3, image_width * image_width * sizeof(float), 0, s1);
+    }
+    combine_lut<<<grid_size_1d, block_size_1d, 0, s1>>>(image2, blurred_small, mask_small, image3, image_width * image_width, lut[channel]);
+
+    cudaStreamSynchronize(s1);
+}
+
+//////////////////////////////
+// Utility functions /////////
+//////////////////////////////
+
+std::string ImagePipeline::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(image3[0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < 10; j++) {
+            res += std::to_string(image3[j]) + ", ";
+        }
+        return res + ", ...]";
+    }
+}
+
+//////////////////////////////
+// Main execution ////////////
+//////////////////////////////
+
+void ImagePipeline::run_inner(unsigned char* input_image, int channel) {
+    auto start_tot = clock_type::now();
+    auto start_tmp = clock_type::now();
+    auto end_tmp = clock_type::now();
+
+    // Allocation;
+    start_tmp = clock_type::now();
+    alloc();
+    end_tmp = clock_type::now();
+    if (debug && err) std::cout << "error=" << err << std::endl;
+    if (debug) std::cout << "allocation time=" << chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count() / 1000 << " ms" << std::endl;
+
+    // Initialization;
+    start_tmp = clock_type::now();
+    init(input_image, channel);
+    end_tmp = clock_type::now();
+    if (debug && err) std::cout << "error=" << err << std::endl;
+    if (debug) std::cout << "initialization time=" << chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count() / 1000 << " ms" << std::endl;
+
+    // Execution;
+    start_tmp = clock_type::now();
+    switch (policy) {
+        case Policy::Sync:
+            execute_sync(channel);
+            break;
+        default:
+            execute_async(channel);
+    }
+    if (debug && err) std::cout << "  error=" << err << std::endl;
+    end_tmp = clock_type::now();
+    auto exec_time = chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count();
+    if (debug) {
+        std::cout << "  result=" << print_result() << std::endl;
+        std::cout << "  execution=" << float(exec_time) / 1000 << " ms" << std::endl;
+    }
+
+    // Copy back data;
+    start_tmp = clock_type::now();
+    for (int i = 0; i < image_width * image_width; i++) {
+        input_image[black_and_white ? i : (i * 3 + channel)] = (unsigned char) (image3[i]);
+    }
+    end_tmp = clock_type::now();
+    auto write_time = chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count();
+    if (debug) std::cout << "writeback time=" << write_time / 1e6 << " sec" << std::endl;
+
+    // End of the coputation;
+    auto end_time = chrono::duration_cast<chrono::microseconds>(clock_type::now() - start_tot).count();
+    if (debug) std::cout << "\ntotal processing time=" << end_time / 1e6 << " sec" << std::endl;
+}
+
+void ImagePipeline::run(unsigned char* input_image) {
+
+    int deviceCount = 1;
+    cudaGetDeviceCount(&deviceCount);
+    if (deviceCount >= 4) {
+        cudaSetDevice(3);
+    }
+
+    for (int channel = 0; channel < (black_and_white ? 1 : 3); channel++) {
+        // Access individual channels;
+        run_inner(input_image, channel);
+    }
+}
+
diff --git a/demos/image_pipeline_local/cuda/image_pipeline.cuh b/demos/image_pipeline_local/cuda/image_pipeline.cuh
new file mode 100644
index 00000000..519a8c2a
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/image_pipeline.cuh
@@ -0,0 +1,250 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include <chrono>
+#include <iostream>
+#include <string>
+#include <cuda_runtime.h> 
+#include <math.h>
+#include "options.hpp"
+#include "utils.hpp"
+
+#define CDEPTH 256
+
+namespace chrono = std::chrono;
+using clock_type = chrono::high_resolution_clock;
+
+class ImagePipeline {
+   public:
+    ImagePipeline(Options &options) : debug(options.debug),
+                                      black_and_white(options.black_and_white),
+                                      image_width(options.resized_image_width),
+                                      do_prefetch(options.prefetch),
+                                      stream_attach(options.stream_attach),
+                                      policy(options.policy_choice) {
+        if (debug) {
+            std::cout << "------------------------------" << std::endl;
+            std::cout << "- policy=" << options.policy_map[policy] << std::endl;
+            std::cout << "- block size 1d=" << options.block_size_1d << std::endl;
+            std::cout << "- block size 2d=" << options.block_size_2d << std::endl;
+            std::cout << "- num blocks=" << options.num_blocks << std::endl;
+            std::cout << "------------------------------" << std::endl;
+        }
+        grid_size_2d = dim3(options.num_blocks, options.num_blocks);
+        grid_size_1d = dim3(options.num_blocks * 2);
+        block_size_2d = dim3(options.block_size_2d, options.block_size_2d);
+        block_size_1d = dim3(options.block_size_1d);
+    }
+    std::string print_result(bool short_form = false);
+
+    // Main execution functions;
+    void run(unsigned char* input_image);
+
+   private:
+
+    // Instance-specific settings;
+    bool black_and_white = DEFAULT_BLACK_AND_WHITE;  // Convert image to black and white;
+    int image_width = DEFAULT_RESIZED_IMAGE_WIDTH;
+    
+    // General configuration settings;
+    int debug = DEBUG;
+    bool do_prefetch = DEFAULT_PREFETCH;
+    bool stream_attach = DEFAULT_STREAM_ATTACH;
+    int pascalGpu = 1;
+    Policy policy;
+    int err = 0;
+    dim3 grid_size_2d;
+    dim3 grid_size_1d;
+    dim3 block_size_2d;
+    dim3 block_size_1d;
+
+    // Computation-specific settings;
+    int kernel_small_diameter = 7;
+    int kernel_large_diameter = 9;
+    int kernel_unsharpen_diameter = 7;
+    float kernel_small_variance = 0.1;
+    float kernel_large_variance = 20;
+    float kernel_unsharpen_variance = 5;
+    float unsharpen_amount = 30;
+
+    // GPU data;
+    int *image, *image3;
+    float *image2, *image_unsharpen, *mask_small, *mask_large, *blurred_small, *blurred_large, *blurred_unsharpen;
+    float *kernel_small, *kernel_large, *kernel_unsharpen, *maximum_1, *minimum_1, *maximum_2, *minimum_2;
+    int *lut[3];
+    cudaStream_t s1, s2, s3, s4, s5;
+
+    // Utility functions;
+    void alloc();
+    void init(unsigned char* input_image, int channel);
+    void execute_sync(int channel);
+    void execute_async(int channel);
+    void run_inner(unsigned char* input_image, int channel);
+
+    inline void gaussian_kernel(float *kernel, int diameter, float sigma) {
+        int mean = diameter / 2;
+        float sum_tmp = 0;
+        for (int i = 0; i < diameter; i++) {
+            for (int j = 0; j < diameter; j++) {
+                kernel[i * diameter + j] = exp(-0.5 * ((i - mean) * (i - mean) + (j - mean) * (j - mean)) / (sigma * sigma));
+                sum_tmp += kernel[i * diameter + j];
+            }
+        }
+        for (int i = 0; i < diameter; i++) {
+            for (int j = 0; j < diameter; j++) {
+                kernel[i * diameter + j] /= sum_tmp;
+            }
+        }
+    }
+
+    // Beziér curve defined by 3 points.
+    // The input is used to map points of the curve to the output LUT,
+    // and can be used to combine multiple LUTs.
+    // By default, it is just [0, 1, ..., 255];
+    inline void spline3(int *input, int *lut, float P[3]) {
+        for (int i = 0; i < CDEPTH; i++) {
+            float t = float(i) / CDEPTH;
+            float x = powf((1 - t), 2) * P[0] + 2 * t * (1 - t) * P[1] + powf(t, 2) * P[2];
+            lut[i] = input[int(x * CDEPTH)];
+        }
+    }
+
+    // Beziér curve defined by 5 points;
+    inline void spline5(int *input, int *lut, float P[5]) {
+        for (int i = 0; i < CDEPTH; i++) {
+            float t = float(i) / CDEPTH;
+            float x = powf((1 - t), 4) * P[0] + 4 * t * powf((1 - t), 3) * P[1] + 6 * powf(t, 2) * powf((1 - t), 2) * P[2] + 4 * powf(t, 3) * (1 - t) * P[3] + powf(t, 4) * P[4];
+            lut[i] = input[int(x * CDEPTH)];
+        }
+    }
+    
+    inline void lut_r(int* lut) {
+        // Create a temporary LUT to swap values;
+        int *lut_tmp = (int*) malloc(sizeof(int) * CDEPTH);
+        // Initialize LUT;
+        for (int i = 0; i < CDEPTH; i++) {
+            lut[i] = i;
+        }
+        // Apply 1st curve;
+        float P[3] = {0.0, 0.2, 1.0};
+        spline3(lut, lut_tmp, P);
+        // Apply 2nd curve;
+        float P2[5] = {0.0, 0.3, 0.5, 0.99, 1};
+        spline5(lut_tmp, lut, P2);
+        free(lut_tmp);        
+    }
+
+    inline void lut_g(int* lut) {
+        // Create a temporary LUT to swap values;
+        int *lut_tmp = (int*) malloc(sizeof(int) * CDEPTH);
+        // Initialize LUT;
+        for (int i = 0; i < CDEPTH; i++) {
+            lut[i] = i;
+        }
+        // Apply 1st curve;
+        float P[5] = {0.0, 0.01, 0.5, 0.99, 1};
+        spline5(lut, lut_tmp, P);
+        // Apply 2nd curve;
+        float P2[5] = {0.0, 0.1, 0.5, 0.75, 1};
+        spline5(lut_tmp, lut, P2);
+        free(lut_tmp);
+    }
+
+    inline void lut_b(int* lut) {
+        // Create a temporary LUT to swap values;
+        int *lut_tmp = (int*) malloc(sizeof(int) * CDEPTH);
+        // Initialize LUT;
+        for (int i = 0; i < CDEPTH; i++) {
+            lut[i] = i;
+        }
+        // Apply 1st curve;
+        float P[5] = {0.0, 0.01, 0.5, 0.99, 1};
+        spline5(lut, lut_tmp, P);
+        // Apply 2nd curve;
+        float P2[5] = {0.0, 0.25, 0.5, 0.70, 1};
+        spline5(lut_tmp, lut, P2);
+        free(lut_tmp);
+    }
+
+// Outdated LUTs;
+// #define FACTOR 0.8
+//     inline void lut_r(int* lut) {
+//         for (int i = 0; i < CDEPTH; i++) {
+//             float x = float(i) / CDEPTH;
+//             float y = x;
+//             // if (i < CDEPTH / 2) {
+//                 // y = 0.8 * (1 / (1 + expf(-x + 0.5) * 7 * FACTOR)) + 0.2;
+//             // } else {
+//                 y = 1 / (1 + expf((-x + 0.5) * 7 * FACTOR));
+//             // }
+//             lut[i] = std::min(CDEPTH - 1, int(255 * y));
+//         }
+//     }
+
+//     inline void lut_g(int* lut) {
+//         for (int i = 0; i < CDEPTH; i++) {
+//             float x = float(i) / CDEPTH;
+//             float y = x;
+//             // if (i < CDEPTH / 2) {
+//                 // y = 0.7 * (1 / (1 + expf(-x + 0.5) * 10 * FACTOR)) + 0.3;
+//             // } else {
+//                 y = 1 / (1 + expf((-x + 0.5) * 10 * FACTOR));
+//             // }
+//             lut[i] = std::min(CDEPTH - 1, int(255 * powf(y, 1.6)));
+//         }
+//     }
+
+//     inline void lut_b(int* lut) {
+//         for (int i = 0; i < CDEPTH; i++) {
+//             float x = float(i) / CDEPTH;
+//             float y = x;
+//             // if (i < CDEPTH / 2) {
+//             //     y = 0.8 * (1 / (1 + expf(-x + 0.5) * 10 * FACTOR)) + 0.2;
+//             // } else {
+//                 y = 1 / (1 + expf((-x + 0.5) * 9 * FACTOR));
+//             // }
+//             lut[i] = std::min(CDEPTH - 1, int(255 * powf(y, 1.4)));
+//         }
+//     }
+
+// img_out = img.copy()
+// lut_b = lambda x: 0.7 * (1 / (1 + np.exp((-x + 0.5) * 10))) + 0.3 if x < 0.5 else 1 / (1 + np.exp((-x + 0.5) * 10))
+// lut_r = lambda x: 0.8 * (1 / (1 + np.exp((-x + 0.5) * 7))) + 0.2 if x < 0.5 else (1 / (1 + np.exp((-x + 0.5) * 7)))
+// lut_g = lambda x: 0.8 * (1 / (1 + np.exp((-x + 0.5) * 10))) + 0.2 if x < 0.5 else  (1 / (1 + np.exp((-x + 0.5) * 9)))
+// lut_g2 = lambda x: x**1.4
+// lut_b2 = lambda x: x**1.6
+// img_out[:, :, 0] = np.vectorize(lut_b)(img[:, :, 0])
+// img_out[:, :, 1] = np.vectorize(lut_g)(img[:, :, 1])
+// img_out[:, :, 2] = np.vectorize(lut_r)(img[:, :, 2])
+
+// img_out[:, :, 1] = np.vectorize(lut_g2)(img_out[:, :, 1])
+// img_out[:, :, 0] = np.vectorize(lut_b2)(img_out[:, :, 0])
+
+};
diff --git a/demos/image_pipeline_local/cuda/main.cpp b/demos/image_pipeline_local/cuda/main.cpp
new file mode 100644
index 00000000..25910b34
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/main.cpp
@@ -0,0 +1,52 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include <string>
+#include <iostream>
+#include <ctime>    // For time()
+#include "options.hpp"
+#include "opencv_interface.hpp"
+#include "image_pipeline.cuh"
+
+// Remember to turn persistence-mode on for GPUs: sudo nvidia-smi -pm 1
+// Install OpenCV following: https://docs.opencv.org/master/d7/d9f/tutorial_linux_install.html
+
+int main(int argc, char *argv[])
+{ 
+    Options options = Options(argc, argv);
+    OpenCVInterface interface = OpenCVInterface(options);
+    ImagePipeline pipeline = ImagePipeline(options);
+
+    // Read the image;
+    auto* img = interface.read_input();
+    // Process the image;
+    pipeline.run(img);
+    // Write the image to output;
+    interface.write_output(img);
+}
diff --git a/demos/image_pipeline_local/cuda/opencv_interface.cpp b/demos/image_pipeline_local/cuda/opencv_interface.cpp
new file mode 100644
index 00000000..613472f6
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/opencv_interface.cpp
@@ -0,0 +1,99 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "opencv_interface.hpp"
+
+uchar* OpenCVInterface::read_input() {
+    auto start = clock_type::now();
+    // Read image;
+    std::string input_path;
+    if (!full_input_path.empty()) {
+        input_path = full_input_path;
+    } else {
+        std::stringstream ss;
+        ss << "../../" << INPUT_IMAGE_FOLDER << "/" << image_name << ".jpg";
+        input_path = ss.str();
+    }
+    image_matrix = imread(input_path,  black_and_white ? cv::IMREAD_GRAYSCALE : cv::IMREAD_COLOR);
+    
+    if (debug) std::cout << "loaded image=" << image_name << " of size " << image_matrix.rows << "x" << image_matrix.cols << std::endl;
+    // Resize image if necessary;
+    bool resized = false;
+    if (image_matrix.rows != image_width || image_matrix.cols != image_width) {
+        cv::resize(image_matrix, resized_image, cv::Size(image_width, image_width));
+        if (debug) std::cout << "resized image to " << image_width << "x" << image_width << std::endl;
+        resized = true;
+    }
+    auto end = clock_type::now();
+    if (debug) std::cout << "read image time=" << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000 << " ms" << std::endl;
+
+    if (resized) {
+        image_array_length = resized_image.total() * resized_image.channels();
+        return resized_image.data;
+    } else {
+        image_array_length = image_matrix.total() * image_matrix.channels();
+        return image_matrix.data;
+    }
+}
+
+void OpenCVInterface::write_output_inner(std::string kind, int resize_width) {
+    // Resize image;
+    cv::Mat image_matrix_out;
+    cv::resize(output_matrix, image_matrix_out, cv::Size(resize_width, resize_width));
+    // Write to file;
+    std::vector<int> compression_params;
+    compression_params.push_back(cv::IMWRITE_JPEG_QUALITY);
+    compression_params.push_back(80);
+
+    std::string output_path;
+    if (kind == "small" && !full_output_path_small.empty()) {
+        output_path = full_output_path_small;
+    } else if (kind == "large" && !full_output_path_large.empty()) {
+        output_path = full_output_path_large;
+    } else if (!image_name.empty()) {
+         std::stringstream ss;
+        ss << "../../" << OUTPUT_IMAGE_FOLDER << "/" << image_name << "_" << kind << ".jpg";
+        output_path = ss.str();
+    } else {
+        if (debug) std::cout << "error: missing output path or image name, cannot write output" << std::endl;
+        return;
+    }
+    imwrite(output_path, image_matrix_out, compression_params);
+}
+
+void OpenCVInterface::write_output(unsigned char* buffer) {
+    auto start = clock_type::now();
+    // Turn buffer into matrix;
+    output_matrix = cv::Mat(image_width, image_width, black_and_white ? CV_8UC1 : CV_8UC3, buffer);
+    // Write to output;
+    write_output_inner("large", RESIZED_IMAGE_WIDTH_OUT_LARGE);
+    write_output_inner("small", RESIZED_IMAGE_WIDTH_OUT_SMALL);
+    auto end = clock_type::now();
+    if (debug) std::cout << "write image time=" << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000 << " ms" << std::endl;
+}
\ No newline at end of file
diff --git a/demos/image_pipeline_local/cuda/opencv_interface.hpp b/demos/image_pipeline_local/cuda/opencv_interface.hpp
new file mode 100644
index 00000000..17b36097
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/opencv_interface.hpp
@@ -0,0 +1,104 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include <chrono>
+#include <iostream>
+#include <string>
+#include <sstream>
+#include "opencv2/core.hpp"
+#include "opencv2/imgcodecs.hpp"
+#include <opencv2/imgproc.hpp>
+
+#include "options.hpp"
+
+namespace chrono = std::chrono;
+using clock_type = chrono::high_resolution_clock;
+
+class OpenCVInterface {
+   public:
+    OpenCVInterface(Options &options) : debug(options.debug),
+                                        image_name(options.input_image),
+                                        full_input_path(options.full_input_path),
+                                        full_output_path_small(options.full_output_path_small),
+                                        full_output_path_large(options.full_output_path_large),
+                                        black_and_white(options.black_and_white),
+                                        image_width(options.resized_image_width) {
+
+        // Validate input/output values;
+        if ((full_input_path.empty() || full_output_path_small.empty() || full_output_path_large.empty()) && image_name.empty()) {
+            if (debug) std::cout << "error: you must specify the name of an image in " << INPUT_IMAGE_FOLDER <<
+             " (without extension) or specify the full input and output path of the image" << std::endl;
+        }                               
+
+        if (debug) {
+            std::cout << "------------------------------" << std::endl;
+            if (!options.full_input_path.empty()) {
+                std::cout << "- input image path=" << options.full_input_path << std::endl;
+            } else {
+                std::cout << "- image name=" << options.input_image << std::endl;
+            }
+            std::cout << "- image size=" << image_width << "x" << image_width << std::endl;
+            std::cout << "- black and white? " << (options.black_and_white ? "yes" : "no") << std::endl;
+            if (!options.full_output_path_small.empty()) {
+                std::cout << "- ouput image path, small=" << options.full_output_path_small << std::endl;
+            }
+            if (!options.full_output_path_large.empty()) {
+                std::cout << "- ouput image path, large=" << options.full_output_path_large << std::endl;
+            }
+            std::cout << "------------------------------" << std::endl;
+        }
+    }
+
+    // Main execution functions;
+    uchar* read_input();
+    void write_output(unsigned char* buffer);
+    int image_array_length;
+
+   private:
+
+    // Instance-specific settings;
+    std::string image_name;  // Input image for the benchmark;
+    std::string full_input_path;  // Optional input/output paths to the image;
+    std::string full_output_path_small;
+    std::string full_output_path_large;
+    bool black_and_white = DEFAULT_BLACK_AND_WHITE;  // Convert image to black and white;
+    int image_width = DEFAULT_RESIZED_IMAGE_WIDTH;
+    
+    // General configuration settings;
+    int debug = DEBUG;
+
+    // OpenCV data;
+    cv::Mat image_matrix;
+    cv::Mat resized_image;
+    cv::Mat output_matrix;
+
+    // Utility functions;
+    void write_output_inner(std::string kind, int resize_width);
+};
diff --git a/demos/image_pipeline_local/cuda/options.hpp b/demos/image_pipeline_local/cuda/options.hpp
new file mode 100644
index 00000000..3cc9a054
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/options.hpp
@@ -0,0 +1,172 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+
+#include <getopt.h>
+
+#include <cstdlib>
+#include <map>
+#include <string>
+
+#include "utils.hpp"
+
+//////////////////////////////
+//////////////////////////////
+
+#define DEBUG false
+#define DEFAULT_BLOCK_SIZE_1D 32
+#define DEFAULT_BLOCK_SIZE_2D 8
+#define DEFAULT_NUM_BLOCKS 2
+#define DEFAULT_POLICY "default"
+#define DEFAULT_PREFETCH false
+#define DEFAULT_STREAM_ATTACH false
+#define DEFAULT_BLACK_AND_WHITE false
+#define DEFAULT_RESIZED_IMAGE_WIDTH 1024
+#define RESIZED_IMAGE_WIDTH_OUT_SMALL 40
+#define RESIZED_IMAGE_WIDTH_OUT_LARGE DEFAULT_RESIZED_IMAGE_WIDTH
+
+// Input and output folders for images;
+#define INPUT_IMAGE_FOLDER "img_in"
+#define OUTPUT_IMAGE_FOLDER "img_out"
+
+//////////////////////////////
+//////////////////////////////
+
+enum Policy {
+    Sync,
+    Async,
+};
+
+//////////////////////////////
+//////////////////////////////
+
+inline Policy get_policy(std::string policy) {
+    if (policy == "sync")
+        return Policy::Sync;
+    else
+        return Policy::Async;
+}
+
+struct Options {
+    // Testing options;
+    int debug = DEBUG;
+    int block_size_1d = DEFAULT_BLOCK_SIZE_1D;
+    int block_size_2d = DEFAULT_BLOCK_SIZE_2D;
+    int num_blocks = DEFAULT_NUM_BLOCKS;
+    bool prefetch = DEFAULT_PREFETCH;
+    bool stream_attach = DEFAULT_STREAM_ATTACH;
+    Policy policy_choice = get_policy(DEFAULT_POLICY);
+
+    // Input image for the benchmark;
+    std::string input_image;
+    // Use black and white processing instead of color processing;
+    bool black_and_white = DEFAULT_BLACK_AND_WHITE;
+    // Resize input image to this size;
+    int resized_image_width = DEFAULT_RESIZED_IMAGE_WIDTH;
+
+    // Optional full input/output paths;
+    std::string full_input_path;
+    std::string full_output_path_small;
+    std::string full_output_path_large;
+
+    // Used for printing;
+    std::map<Policy, std::string> policy_map;
+
+    //////////////////////////////
+    //////////////////////////////
+
+    Options(int argc, char *argv[]) {
+        map_init(policy_map)(Policy::Sync, "sync")(Policy::Async, "default");
+
+        int opt;
+        static struct option long_options[] = {{"debug", no_argument, 0, 'd'},
+                                               {"block_size_1d", required_argument, 0, 'b'},
+                                               {"block_size_2d", required_argument, 0, 'c'},
+                                               {"num_blocks", required_argument, 0, 'g'},
+                                               {"policy", required_argument, 0, 'p'},
+                                               {"prefetch", no_argument, 0, 'r'},
+                                               {"attach", no_argument, 0, 'a'},
+                                               {"input", required_argument, 0, 'i'},
+                                               {"bw", no_argument, 0, 'w'},
+                                               {"resized_image_width", required_argument, 0, 'n'},
+                                               {"full_input_path", required_argument, 0, 'f'},
+                                               {"full_output_path_small", required_argument, 0, 's'},
+                                               {"full_output_path_large", required_argument, 0, 'l'},
+                                               {0, 0, 0, 0}};
+        // getopt_long stores the option index here;
+        int option_index = 0;
+
+        while ((opt = getopt_long(argc, argv, "db:c:g:p:rai:wn:f:s:l:", long_options, &option_index)) != EOF) {
+            switch (opt) {
+                case 'd':
+                    debug = true;
+                    break;
+                case 'b':
+                    block_size_1d = atoi(optarg);
+                    break;
+                case 'c':
+                    block_size_2d = atoi(optarg);
+                    break;
+                case 'g':
+                    num_blocks = atoi(optarg);
+                    break;
+                case 'p':
+                    policy_choice = get_policy(optarg);
+                    break;
+                case 'r':
+                    prefetch = true;
+                    break;
+                case 'a':
+                    stream_attach = true;
+                    break;
+                case 'i':
+                    input_image = optarg;
+                    break;
+                case 'w':
+                    black_and_white = true;
+                    break;
+                case 'n':
+                    resized_image_width = atoi(optarg);
+                    break;
+                case 'f':
+                    full_input_path = optarg;
+                    break;
+                case 's':
+                    full_output_path_small = optarg;
+                    break;
+                case 'l':
+                    full_output_path_large = optarg;
+                    break;
+                default:
+                    break;
+            }
+        }
+    }
+};
diff --git a/demos/image_pipeline_local/cuda/utils.hpp b/demos/image_pipeline_local/cuda/utils.hpp
new file mode 100644
index 00000000..225e41f7
--- /dev/null
+++ b/demos/image_pipeline_local/cuda/utils.hpp
@@ -0,0 +1,65 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+
+#include <vector>
+#include <tuple>
+#include <algorithm>
+#include <map>
+#include "dvrapi_error_string.h"
+
+#define checkCudaErrors(err) __checkCudaErrors(err, __FILE__, __LINE__)
+
+// These are the inline versions for all of the SDK helper functions
+inline void __checkCudaErrors(int err, const char *file, const int line) {
+  if (0 != err) {
+    fprintf(stderr,
+            "checkCudaErrors() Driver API error = %04d \"%s\" from file <%s>, "
+            "line %i.\n",
+            err, getCudaDrvErrorString(err), file, line);
+    exit(EXIT_FAILURE);
+  }
+}
+
+template<typename T> struct map_init_helper
+{
+    T& data;
+    map_init_helper(T& d) : data(d) {}
+    map_init_helper& operator() (typename T::key_type const& key, typename T::mapped_type const& value)
+    {
+        data[key] = value;
+        return *this;
+    }
+};
+
+template<typename T> map_init_helper<T> map_init(T& item)
+{
+    return map_init_helper<T>(item);
+}
diff --git a/demos/image_pipeline_local/cuda_kernels.js b/demos/image_pipeline_local/cuda_kernels.js
new file mode 100644
index 00000000..cdbac178
--- /dev/null
+++ b/demos/image_pipeline_local/cuda_kernels.js
@@ -0,0 +1,238 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+GAUSSIAN_BLUR = `
+extern "C" __global__ void gaussian_blur(const int *image, float *result, int rows, int cols, const float* kernel, int diameter) {
+    extern __shared__ float kernel_local[];
+    for(int i = threadIdx.x; i < diameter; i += blockDim.x) {
+        for(int j = threadIdx.y; j < diameter; j += blockDim.y) {
+            kernel_local[i * diameter + j] = kernel[i * diameter + j];
+        }
+    }
+    __syncthreads();
+
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            int radius = diameter / 2;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        sum += kernel_local[(x + radius) * diameter + (y + radius)] * (float(image[nx * cols + ny]) / 255);
+                    }
+                }
+            }
+            result[i * cols + j] = sum;
+        }
+    }
+}
+`
+
+SOBEL = `
+extern "C" __global__ void sobel(float *image, float *result, int rows, int cols) {
+    // int SOBEL_X[3][3] = {{-1, -2, -1}, {0, 0, 0}, {1, 2, 1}};
+    // int SOBEL_Y[3][3] = {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}};
+    __shared__ int SOBEL_X[9];
+    __shared__ int SOBEL_Y[9];
+    if (threadIdx.x == 0 && threadIdx.y == 0) {   
+        SOBEL_X[0] = -1;
+        SOBEL_X[1] = -2;
+        SOBEL_X[2] = -1;
+        SOBEL_X[3] = 0;
+        SOBEL_X[4] = 0;
+        SOBEL_X[5] = 0;
+        SOBEL_X[6] = 1;
+        SOBEL_X[7] = 2;
+        SOBEL_X[8] = 1;
+
+        SOBEL_Y[0] = -1;
+        SOBEL_Y[1] = 0;
+        SOBEL_Y[2] = 1;
+        SOBEL_Y[3] = -2;
+        SOBEL_Y[4] = 0;
+        SOBEL_Y[5] = 2;
+        SOBEL_Y[6] = -1;
+        SOBEL_Y[7] = 0;
+        SOBEL_Y[8] = 1;
+    }
+    __syncthreads();
+    
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum_gradient_x = 0.0, sum_gradient_y = 0.0;
+            int radius = 1;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        float neighbour = image[nx * cols + ny];
+                        int s = (x + radius) * 3 + y + radius;
+                        sum_gradient_x += SOBEL_X[s] * neighbour;
+                        sum_gradient_y += SOBEL_Y[s] * neighbour;
+                    }
+                }
+            }
+            result[i * cols + j] = sqrt(sum_gradient_x * sum_gradient_x + sum_gradient_y * sum_gradient_y);
+        }
+    }
+}
+`
+
+EXTEND_MASK = `
+__device__ float atomicMinf(float* address, float val)
+{
+    int *address_as_int =(int*)address;
+    int old = *address_as_int, assumed;
+    while (val < __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed,
+                        __float_as_int(val));
+        }
+    return __int_as_float(old);
+}
+
+__device__ float atomicMaxf(float* address, float val)
+{
+    int *address_as_int = (int*) address;
+    int old = *address_as_int, assumed;
+    // If val is smaller than current, don't do anything, else update the current value atomically;
+    while (val > __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed, __float_as_int(val));
+    }
+    return __int_as_float(old);
+}
+
+__inline__ __device__ float warp_reduce_max(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = max(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+__inline__ __device__ float warp_reduce_min(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+extern "C" __global__ void maximum(float *in, float* out, int N) {
+    int warp_size = 32;
+    float maximum = -1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        maximum = max(maximum, in[i]);
+    }
+    maximum = warp_reduce_max(maximum); // Obtain the max of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMaxf(out, maximum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void minimum(float *in, float* out, int N) {
+    int warp_size = 32;
+    float minimum = 1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        minimum = min(minimum, in[i]);
+    }
+    minimum = warp_reduce_min(minimum); // Obtain the min of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMinf(out, minimum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void extend(float *x, const float *minimum, const float *maximum, int n, int extend_factor) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = extend_factor * (x[i] - *minimum) / (*maximum - *minimum);
+        x[i] = res_tmp > 1 ? 1 : res_tmp;
+    }
+}
+`
+
+UNSHARPEN = `
+extern "C" __global__ void unsharpen(const int *x, const float *y, float *res, float amount, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = (float(x[i]) / 255) * (1 + amount) - y[i] * amount;
+        res_tmp = res_tmp > 1 ? 1 : res_tmp;
+        res[i] = res_tmp < 0 ? 0 : res_tmp;
+    }
+}
+`
+
+COMBINE = `
+extern "C" __global__ void combine(const float *x, const float *y, const float *mask, float *res, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        res[i] = x[i] * mask[i] + y[i] * (1 - mask[i]);
+    }
+}
+`
+
+COMBINE_2 = `
+extern "C" __global__ void combine_lut(const float *x, const float *y, const float *mask, int *res, int n, int* lut) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        int res_tmp = min(256 - 1, int(256 * (x[i] * mask[i] + y[i] * (1 - mask[i]))));
+        res[i] = lut[res_tmp];
+    }
+}
+`
+
+RESET = `
+extern "C" __global__ void reset(float *x, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        x[i] = 0.0;
+    }
+}
+`
+
+INT_TO_FLOAT = `
+extern "C" __global__ void int_to_float(const int *x, float *y, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        y[i] = float(x) / 255;
+    }
+}
+`
+
+FLOAT_TO_INT = `
+extern "C" __global__ void float_to_int(const float *x, int *y, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        y[i] = int(x[i] * 255);
+    }
+}
+`
+
+exports.GAUSSIAN_BLUR = GAUSSIAN_BLUR;
+exports.SOBEL = SOBEL;
+exports.EXTEND_MASK = EXTEND_MASK;
+exports.UNSHARPEN = UNSHARPEN;
+exports.COMBINE = COMBINE;
+exports.COMBINE_2 = COMBINE_2;
+exports.INT_TO_FLOAT = INT_TO_FLOAT;
+exports.FLOAT_TO_INT = FLOAT_TO_INT;
+exports.RESET = RESET;
diff --git a/demos/image_pipeline_local/image_pipeline.js b/demos/image_pipeline_local/image_pipeline.js
new file mode 100644
index 00000000..916068c6
--- /dev/null
+++ b/demos/image_pipeline_local/image_pipeline.js
@@ -0,0 +1,399 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+// Use Java System to measure time;
+const System = Java.type("java.lang.System");
+// Load OpenCV;
+const cv = require("opencv4nodejs");
+// Load function to write to file;
+const fs = require("fs");
+// Load GrCUDA;
+const cu = Polyglot.eval("grcuda", "CU");
+// Load CUDA kernels;
+const ck = require("./cuda_kernels.js");
+
+/////////////////////////////
+/////////////////////////////
+
+// Convert images to black and white;
+const BW = false;
+// Edge width (in pixel) of input images.
+// If a loaded image has lower width than this, it is rescaled;
+const RESIZED_IMG_WIDTH = 512;
+// Edge width (in pixel) of output images.
+// We store processed images in 2 variants: small and large;
+const RESIZED_IMG_WIDTH_OUT_SMALL = 40;
+const RESIZED_IMG_WIDTH_OUT_LARGE = RESIZED_IMG_WIDTH;
+
+// Build the CUDA kernels;
+const GAUSSIAN_BLUR_KERNEL = cu.buildkernel(ck.GAUSSIAN_BLUR, "gaussian_blur", "const pointer, pointer, sint32, sint32, const pointer, sint32");
+const SOBEL_KERNEL = cu.buildkernel(ck.SOBEL, "sobel", "pointer, pointer, sint32, sint32");
+const EXTEND_KERNEL = cu.buildkernel(ck.EXTEND_MASK, "extend", "pointer, const pointer, const pointer, sint32, sint32");
+const MAXIMUM_KERNEL = cu.buildkernel(ck.EXTEND_MASK, "maximum", "const pointer, pointer, sint32");
+const MINIMUM_KERNEL = cu.buildkernel(ck.EXTEND_MASK, "minimum", "const pointer, pointer, sint32");
+const UNSHARPEN_KERNEL = cu.buildkernel(ck.UNSHARPEN, "unsharpen", "const pointer, const pointer, pointer, float, sint32");
+const COMBINE_KERNEL = cu.buildkernel(ck.COMBINE, "combine", "const pointer, const pointer, const pointer, pointer, sint32");
+const COMBINE_KERNEL_LUT = cu.buildkernel(ck.COMBINE_2, "combine_lut", "const pointer, const pointer, const pointer, pointer, sint32, pointer");
+
+// Constant parameters used in the image processing;
+const KERNEL_SMALL_DIAMETER = 5;
+const KERNEL_SMALL_VARIANCE = 0.1;
+const KERNEL_LARGE_DIAMETER = 7;
+const KERNEL_LARGE_VARIANCE = 20;
+const KERNEL_UNSHARPEN_DIAMETER = 5;
+const KERNEL_UNSHARPEN_VARIANCE = 5;
+const UNSHARPEN_AMOUNT = 30;
+const CDEPTH = 256;
+// CUDA parameters;
+const BLOCKS = 2;
+const THREADS_1D = 32;
+const THREADS_2D = 8;
+
+/////////////////////////////
+// Utility functions ////////
+/////////////////////////////
+
+function intervalToMs(start, end) {
+    return (end - start) / 1e6;
+}
+
+function gaussian_kernel(buffer, diameter, sigma) {
+    const mean = diameter / 2;
+    let sum_tmp = 0;
+    for (let x = 0; x < diameter; x++) {
+        for (let y = 0; y < diameter; y++) {
+            const val = Math.exp(-0.5 * (Math.pow(x - mean, 2) + Math.pow(y - mean, 2)) / Math.pow(sigma, 2));
+            buffer[x][y] = val;
+            sum_tmp += val;
+        }
+    }
+    // Normalize;
+    for (let x = 0; x < diameter; x++) {
+        for (let y = 0; y < diameter; y++) {
+            buffer[x][y] /= sum_tmp;
+        }
+    }
+}
+
+// Outdated LUTs;
+// const FACTOR = 0.8
+// function lut_r(lut) {
+//     for (let i = 0; i < CDEPTH; i++) {
+//         x = i / CDEPTH;
+//         if (i < CDEPTH / 2) {
+//             lut[i] = Math.min(CDEPTH - 1, 255 * (0.8 * (1 / (1 + Math.exp(-x + 0.5) * 7 * FACTOR)) + 0.2)) >> 0;
+//         } else {
+//             lut[i] = Math.min(CDEPTH - 1, 255 * (1 / (1 + Math.exp((-x + 0.5) * 7 * FACTOR)))) >> 0;
+//         }
+//     }
+// }
+
+// function lut_g(lut) {
+//     for (let i = 0; i < CDEPTH; i++) {
+//         x = i / CDEPTH;
+//         y = 0;
+//         if (i < CDEPTH / 2) {
+//             y = 0.8 * (1 / (1 + Math.exp(-x + 0.5) * 10 * FACTOR)) + 0.2;
+//         } else {
+//             y = 1 / (1 + Math.exp((-x + 0.5) * 9 * FACTOR));
+//         }
+//         lut[i] = Math.min(CDEPTH - 1, 255 * Math.pow(y, 1.4)) >> 0;
+//     }
+// }
+
+// function lut_b(lut) {
+//     for (let i = 0; i < CDEPTH; i++) {
+//         x = i / CDEPTH;
+//         y = 0;
+//         if (i < CDEPTH / 2) {
+//             y = 0.7 * (1 / (1 + Math.exp(-x + 0.5) * 10 * FACTOR)) + 0.3;
+//         } else {
+//             y = 1 / (1 + Math.exp((-x + 0.5) * 10 * FACTOR));
+//         }
+//         lut[i] = Math.min(CDEPTH - 1, 255 * Math.pow(y, 1.6)) >> 0;
+//     }
+// }
+
+// Beziér curve defined by 3 points.
+// The input is used to map points of the curve to the output LUT,
+// and can be used to combine multiple LUTs.
+// By default, it is just [0, 1, ..., 255];
+function spline3(input, lut, P) {
+    for (let i = 0; i < CDEPTH; i++) {
+        t = i / CDEPTH;
+        x = Math.pow((1 - t), 2) * P[0] + 2 * t * (1 - t) * P[1] + Math.pow(t, 2) * P[2];
+        lut[i] = input[(x * CDEPTH) >> 0]; // >> 0 is an evil hack to cast float to int;
+    }
+}
+
+// Beziér curve defined by 5 points;
+function spline5(input, lut, P) {
+    for (let i = 0; i < CDEPTH; i++) {
+        t = i / CDEPTH;
+        x = Math.pow((1 - t), 4) * P[0] + 4 * t * Math.pow((1 - t), 3) * P[1] + 6 * Math.pow(t, 2) * Math.pow((1 - t), 2) * P[2] + 4 * Math.pow(t, 3) * (1 - t) * P[3] + Math.pow(t, 4) * P[4];
+        lut[i] = input[(x * CDEPTH) >> 0];
+    }
+}
+
+function lut_r(lut) {
+    // Create a temporary LUT to swap values;
+    lut_tmp = new Array(CDEPTH).fill(0);
+
+    // Initialize LUT;
+    for (let i = 0; i < CDEPTH; i++) {
+        lut[i] = i;
+    }
+    // Apply 1st curve;
+    const P = [0.0, 0.2, 1.0];
+    spline3(lut, lut_tmp, P);
+    // Apply 2nd curve;
+    const P2 = [0.0, 0.3, 0.5, 0.99, 1];
+    spline5(lut_tmp, lut, P2);     
+}
+
+function lut_g(lut) {
+    // Create a temporary LUT to swap values;
+    let lut_tmp = new Array(CDEPTH).fill(0);
+    // Initialize LUT;
+    for (let i = 0; i < CDEPTH; i++) {
+        lut[i] = i;
+    }
+    // Apply 1st curve;
+    const P = [0.0, 0.01, 0.5, 0.99, 1];
+    spline5(lut, lut_tmp, P);
+    // Apply 2nd curve;
+    const P2 = [0.0, 0.1, 0.5, 0.75, 1];
+    spline5(lut_tmp, lut, P2);
+}
+
+function lut_b(lut) {
+    // Create a temporary LUT to swap values;
+    let lut_tmp = new Array(CDEPTH).fill(0);
+    // Initialize LUT;
+    for (let i = 0; i < CDEPTH; i++) {
+        lut[i] = i;
+    }
+    // Apply 1st curve;
+    const P = [0.0, 0.01, 0.5, 0.99, 1];
+    spline5(lut, lut_tmp, P);
+    // Apply 2nd curve;
+    const P2 = [0.0, 0.25, 0.5, 0.70, 1];
+    spline5(lut_tmp, lut, P2);
+}
+
+// Initialize LUTs;
+const LUT = [new Array(CDEPTH).fill(0), new Array(CDEPTH).fill(0), new Array(CDEPTH).fill(0)];
+lut_r(LUT[0]);
+lut_r(LUT[1]);
+lut_r(LUT[2]);
+
+async function storeImageInner(img, imgName, resolution, kind) {
+    
+    if (kind == "small") { 
+        const imgResized = img.resize(resolution, resolution);
+        const buffer = await cv.imencodeAsync('.jpg', imgResized, [cv.IMWRITE_JPEG_QUALITY, 40])
+        fs.writeFileSync("img_out/" + imgName + "_" + kind + ".jpg", buffer);
+    } else {
+        const buffer = await cv.imencodeAsync('.jpg', img, [cv.IMWRITE_JPEG_QUALITY, 40])
+        fs.writeFileSync("img_out/" + imgName + "_" + kind + ".jpg", buffer);
+    }
+
+}
+
+/////////////////////////////
+// Main computations ////////
+/////////////////////////////
+
+// Load and preprocess an image, return it as a matrix;
+async function loadImage(imgName) {
+    return cv.imreadAsync("img_in/" + imgName + ".jpg", BW ? cv.IMREAD_GRAYSCALE : cv.IMREAD_COLOR)
+        .then(img => {
+            // Resize input;
+            return img; // .resize(RESIZED_IMG_WIDTH, RESIZED_IMG_WIDTH);
+        });
+}
+
+function copy_array(x, y) {
+    let i = y.length;
+	while(i--) x[i] = y[i];
+}
+
+// Main processing of the image;
+async function processImage(img, size, channel) {
+    // Allocate image data;
+    const image = cu.DeviceArray("int", size * size);
+    const image2 = cu.DeviceArray("float", size, size);
+    const image3 = cu.DeviceArray("int", size * size);
+
+    const kernel_small = cu.DeviceArray("float", KERNEL_SMALL_DIAMETER, KERNEL_SMALL_DIAMETER);
+    const kernel_large = cu.DeviceArray("float", KERNEL_LARGE_DIAMETER, KERNEL_LARGE_DIAMETER);
+    const kernel_unsharpen = cu.DeviceArray("float", KERNEL_UNSHARPEN_DIAMETER, KERNEL_UNSHARPEN_DIAMETER);
+
+    const maximum_1 = cu.DeviceArray("float", 1);
+    const minimum_1 = cu.DeviceArray("float", 1);
+    const maximum_2 = cu.DeviceArray("float", 1);
+    const minimum_2 = cu.DeviceArray("float", 1);
+
+    const mask_small = cu.DeviceArray("float", size, size);
+    const mask_large = cu.DeviceArray("float", size, size);
+    const image_unsharpen = cu.DeviceArray("float", size, size);
+
+    const blurred_small = cu.DeviceArray("float", size, size);
+    const blurred_large = cu.DeviceArray("float", size, size);
+    const blurred_unsharpen = cu.DeviceArray("float", size, size);
+    
+    const lut = cu.DeviceArray("int", CDEPTH);  
+
+    const s0 = System.nanoTime();
+    // Initialize the LUT;
+    copy_array(LUT[channel], lut);
+    const e0 = System.nanoTime();
+    console.log("--lut=" + intervalToMs(s0, e0) + " ms");
+
+    // Fill the image data;
+    const s1 = System.nanoTime();
+    // image.copyFrom(img, size * size);
+    copy_array(image, img);
+    const e1 = System.nanoTime();
+    console.log("--img to device array=" + intervalToMs(s1, e1) + " ms");
+
+    const start = System.nanoTime();
+
+    // Create Gaussian kernels;
+    gaussian_kernel(kernel_small, KERNEL_SMALL_DIAMETER, KERNEL_SMALL_VARIANCE);
+    gaussian_kernel(kernel_large, KERNEL_LARGE_DIAMETER, KERNEL_LARGE_VARIANCE);
+    gaussian_kernel(kernel_unsharpen, KERNEL_UNSHARPEN_DIAMETER, KERNEL_UNSHARPEN_VARIANCE);
+
+    // Main GPU computation;
+    // Blur - Small;
+    GAUSSIAN_BLUR_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D], 4 * KERNEL_SMALL_DIAMETER * KERNEL_SMALL_DIAMETER)(
+        image, blurred_small, size, size, kernel_small, KERNEL_SMALL_DIAMETER);
+    // Blur - Large;
+    GAUSSIAN_BLUR_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D], 4 * KERNEL_LARGE_DIAMETER * KERNEL_LARGE_DIAMETER)(
+        image, blurred_large, size, size, kernel_large, KERNEL_LARGE_DIAMETER);
+    // Blur - Unsharpen;
+    GAUSSIAN_BLUR_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D], 4 * KERNEL_UNSHARPEN_DIAMETER * KERNEL_UNSHARPEN_DIAMETER)(
+        image, blurred_unsharpen, size, size, kernel_unsharpen, KERNEL_UNSHARPEN_DIAMETER);
+    // Sobel filter (edge detection);
+    SOBEL_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D])(
+        blurred_small, mask_small, size, size);
+    SOBEL_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D])(
+        blurred_large, mask_large, size, size);
+    // Ensure that the output of Sobel is in [0, 1];
+    MAXIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_small, maximum_1, size * size);
+    MINIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_small, minimum_1, size * size);
+    EXTEND_KERNEL(BLOCKS * 2, THREADS_1D)(mask_small, minimum_1, maximum_1, size * size, 1);
+    // Extend large edge detection mask, and normalize it;
+    MAXIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_large, maximum_2, size * size);
+    MINIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_large, minimum_2, size * size);
+    EXTEND_KERNEL(BLOCKS * 2, THREADS_1D)(mask_large, minimum_2, maximum_2, size * size, 5);
+    // Unsharpen;
+    UNSHARPEN_KERNEL(BLOCKS * 2, THREADS_1D)(
+        image, blurred_unsharpen, image_unsharpen, UNSHARPEN_AMOUNT, size * size);
+    // Combine results;
+    COMBINE_KERNEL(BLOCKS * 2, THREADS_1D)(
+        image_unsharpen, blurred_large, mask_large, image2, size * size);
+    COMBINE_KERNEL_LUT(BLOCKS * 2, THREADS_1D)(
+        image2, blurred_small, mask_small, image3, size * size, lut);
+
+    // Store the image data.
+    const tmp = image3[0]; // Used only to "sync" the GPU computation and obtain the GPU computation time;
+    const end = System.nanoTime();
+    console.log("--cuda time=" + intervalToMs(start, end) + " ms");
+    const s2 = System.nanoTime();
+    // copy_array(img, image3);
+
+    img.set(image3);
+    const e2 = System.nanoTime();
+    console.log("--device array to image=" + intervalToMs(s2, e2) + " ms");
+    return img;
+}
+
+async function processImageBW(img) {
+    return new cv.Mat(Buffer.from(await processImage(img.getData(), img.rows, 0)), img.rows, img.cols, cv.CV_8UC1);
+}
+
+async function processImageColor(img) {
+    // Possibly not the most efficient way to do this,
+    // we should process the 3 channels concurrently, and avoid creation of temporary cv.Mat;
+    let channels = img.splitChannels();
+    
+    const b = await Promise.all([
+      processImage(channels[0].getData(), img.rows, 0),
+      processImage(channels[1].getData(), img.rows, 1),
+      processImage(channels[2].getData(), img.rows, 2)
+    ]);
+
+    channels = b.map(buffer => new cv.Mat(buffer, img.rows, img.cols, cv.CV_8UC1));
+    
+    return new cv.Mat(channels);
+}
+
+// Store the output of the image processing into 2 images,
+// with low and high resolution;
+async function storeImage(img, imgName) {
+    storeImageInner(img, imgName, RESIZED_IMG_WIDTH_OUT_LARGE, "large");
+    storeImageInner(img, imgName, RESIZED_IMG_WIDTH_OUT_SMALL, "small");
+}
+
+// Main function, it loads an image, process it with our pipeline, writes it to a file;
+async function imagePipeline(imgName, count) {
+    try {
+        // Load image;
+        const start = System.nanoTime();
+        let img = await loadImage(imgName);
+        const endLoad = System.nanoTime();
+        // Process image;
+        if (BW) img = await processImageBW(img);
+        else img = await processImageColor(img);
+        const endProcess = System.nanoTime();
+        // Store image;
+        await storeImage(img, imgName + "_" + count)
+        const endStore = System.nanoTime();
+        console.log("- total time=" + intervalToMs(start, endStore) + ", load=" + intervalToMs(start, endLoad) + ", processing=" + intervalToMs(endLoad, endProcess) + ", store=" + intervalToMs(endProcess, endStore));
+    } catch (err) {
+        console.error(err);
+    }
+}
+
+async function main() {
+    // This will be some kind of server endpoint;
+    for (let i = 0; i < 20; i++) {
+        // Use await for serial execution, otherwise it processes multiple images in parallel.
+        // Performance looks identical though;
+        await imagePipeline(i < 10 ? "lena" : "astro1", i);
+    }
+}
+
+/////////////////////////////
+/////////////////////////////
+
+// Begin the computation;
+main();
+
diff --git a/demos/image_pipeline_local/image_pipeline.py b/demos/image_pipeline_local/image_pipeline.py
new file mode 100755
index 00000000..1a67cbe5
--- /dev/null
+++ b/demos/image_pipeline_local/image_pipeline.py
@@ -0,0 +1,283 @@
+# Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# -*- coding: utf-8 -*-
+"""
+Created on Tue Jul 20 08:39:27 2021
+
+Implement the image processing pipeline using Python and OpenCV, 
+and Python implementations of the CUDA kernels.
+Used for debugging, and to visualize intermediate results.
+
+@author: albyr
+"""
+
+from skimage.io import imread, imsave
+from skimage.filters import gaussian, sobel, unsharp_mask
+from skimage.color import rgb2gray
+from skimage import data, img_as_float
+
+import matplotlib.pyplot as plt
+import numpy as np
+from typing import Callable
+import time
+
+BW = False
+KERNEL_SMALL = 0.1
+KERNEL_LARGE = 2
+KERNEL_UNSHARPEN = 0.7
+
+KERNEL_SMALL_DIAMETER = 3
+KERNEL_SMALL_VARIANCE = 0.1
+KERNEL_LARGE_DIAMETER = 5
+KERNEL_LARGE_VARIANCE = 10
+KERNEL_UNSHARPEN_DIAMETER = 3
+KERNEL_UNSHARPEN_VARIANCE = 5
+
+SOBEL_FILTER_DIAMETER = 3
+SOBEL_FILTER_X = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]])
+SOBEL_FILTER_Y = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
+
+
+def time_function(name: str=None) -> Callable:
+    """
+    Decorator that simplifies timing a function call;
+    :param name: name of the function or computation to measure
+    :return: the output of the wrapped function
+    """
+    def inner_func(func) -> Callable:
+        def func_call(self, *args, **kwargs) -> object:
+            start = time.time()
+            result = func(self, *args, **kwargs)
+            end = time.time()
+            print(f"{name if name is not None else func.__name__} took {end - start} sec")
+            return result
+        return func_call
+    return inner_func
+
+
+def gaussian_kernel(diameter, sigma):
+    kernel = np.zeros((diameter, diameter))
+    mean = diameter / 2
+    sum_tmp = 0
+    for x in range(diameter):
+        for y in range(diameter):
+            kernel[x, y] = np.exp(-0.5 * ((x - mean) ** 2 + (y - mean) ** 2) / sigma ** 2)
+            sum_tmp += kernel[x, y]
+    for x in range(diameter):
+        for y in range(diameter):
+            kernel[x, y] /= sum_tmp
+    return kernel
+
+
+def gaussian_blur_py(image, kernel):
+    out = np.zeros(image.shape)
+    rows, cols = image.shape
+
+    # Blur radius;
+    diameter = kernel.shape[0]
+    radius = diameter // 2
+
+    # Flatten image and kernel;
+    image_1d = image.reshape(-1)
+    kernel_1d = kernel.reshape(-1)
+
+    for i in range(rows):
+        for j in range(cols):
+            sum_tmp = 0
+            for x in range(-radius, radius + 1):
+                for y in range(-radius, radius + 1):
+                    nx = x + i
+                    ny = y + j
+                    if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                        sum_tmp += kernel_1d[(x + radius) * diameter + (y + radius)] * image_1d[nx * cols + ny]
+            out[i, j] = sum_tmp
+    return out
+
+
+def sobel_filter_py(image):
+    out = np.zeros(image.shape)
+    rows, cols = image.shape
+    radius = SOBEL_FILTER_DIAMETER // 2
+
+    for i in range(rows):
+        for j in range(cols):
+            sum_gradient_x = 0
+            sum_gradient_y = 0
+            for x in range(-radius, radius + 1):
+                for y in range(-radius, radius + 1):
+                    nx = x + i
+                    ny = y + j
+                    if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                        gray_value_neigh = image[nx, ny]
+                        gradient_x = SOBEL_FILTER_X[x + radius][y + radius]
+                        gradient_y = SOBEL_FILTER_Y[x + radius][y + radius]
+                        sum_gradient_x += gray_value_neigh * gradient_x
+                        sum_gradient_y += gray_value_neigh * gradient_y
+            out[i, j] = np.sqrt(sum_gradient_x ** 2 + sum_gradient_y ** 2)
+    return out
+
+
+def normalize(image):
+    return (image - np.min(image)) / (np.max(image) - np.min(image))
+
+
+def truncate(image, minimum=0, maximum=1):
+    out = image.copy()
+    out[out < minimum] = minimum
+    out[out > maximum] = maximum
+    return out
+
+
+def scurve(img):
+    img_out = img.copy()
+    lut_b = lambda x: 0.7 * (1 / (1 + np.exp((-x + 0.5) * 10))) + 0.3 if x < 0.5 else 1 / (1 + np.exp((-x + 0.5) * 10))
+    lut_r = lambda x: 0.8 * (1 / (1 + np.exp((-x + 0.5) * 7))) + 0.2 if x < 0.5 else (1 / (1 + np.exp((-x + 0.5) * 7)))
+    lut_g = lambda x: 0.8 * (1 / (1 + np.exp((-x + 0.5) * 10))) + 0.2 if x < 0.5 else  (1 / (1 + np.exp((-x + 0.5) * 9)))
+    lut_g2 = lambda x: x**1.4
+    lut_b2 = lambda x: x**1.6
+    img_out[:, :, 0] = np.vectorize(lut_b)(img[:, :, 0])
+    img_out[:, :, 1] = np.vectorize(lut_g)(img[:, :, 1])
+    img_out[:, :, 2] = np.vectorize(lut_r)(img[:, :, 2])
+    
+    img_out[:, :, 1] = np.vectorize(lut_g2)(img_out[:, :, 1])
+    img_out[:, :, 0] = np.vectorize(lut_b2)(img_out[:, :, 0])
+    
+    return img_out
+# plt.plot(np.linspace(0,1,255), scurve(np.linspace(0,1,255)))
+#%%
+@time_function()
+def pipeline_golden(img):
+    multichannel = not BW
+    # Part 1: Small blur on medium frequencies;
+    blurred_small = gaussian(img, sigma=(KERNEL_SMALL, KERNEL_SMALL), multichannel=multichannel)
+    edges_small = normalize(sobel(blurred_small))
+
+    # Part 2: High blur on low frequencies;
+    blurred_large = gaussian(img, sigma=(KERNEL_LARGE, KERNEL_LARGE), multichannel=multichannel)
+    edges_large = sobel(blurred_large)
+    # Extend mask to cover a larger area;
+    edges_large = truncate(normalize(edges_large) * 5)
+
+    # Part 3: Sharpen image;
+    amount = 10
+    sharpened = unsharp_mask(img, radius=KERNEL_UNSHARPEN, amount=amount, multichannel=multichannel)
+
+    # Part 4: Merge sharpened image and low frequencies;
+    image2 = normalize(sharpened * edges_large + blurred_large * (1 - edges_large))
+
+    # Part 5: Merge image and medium frequencies;
+    result = image2 * edges_small + blurred_small * (1 - edges_small)
+    
+    # Part 6: Apply LUT;
+    result_lut = scurve(result)
+    
+    return result_lut, [blurred_small, edges_small, blurred_large, edges_large, sharpened, image2, result]
+
+
+@time_function()
+def pipeline_bw(img):
+    
+    # Create kernels for blur;
+    kernel_small_cpu = gaussian_kernel(KERNEL_SMALL_DIAMETER, KERNEL_SMALL_VARIANCE)
+    kernel_large_cpu = gaussian_kernel(KERNEL_LARGE_DIAMETER, KERNEL_LARGE_VARIANCE)
+    kernel_unsharpen_cpu = gaussian_kernel(KERNEL_UNSHARPEN_DIAMETER, KERNEL_UNSHARPEN_VARIANCE)
+    
+    # Part 1: Small blur on medium frequencies;
+    blurred_small = gaussian_blur_py(img, kernel_small_cpu)
+    edges_small = normalize(sobel_filter_py(blurred_small))
+
+    # Part 2: High blur on low frequencies;
+    blurred_large = gaussian_blur_py(img, kernel_large_cpu)
+    edges_large = sobel_filter_py(blurred_large)
+    # Extend mask to cover a larger area;
+    edges_large = truncate(normalize(edges_large) * 5)
+
+    # Part 3: Sharpen image;
+    unsharpen = gaussian_blur_py(img, kernel_unsharpen_cpu)
+    amount = 8
+    sharpened = truncate(img * (1 + amount) - unsharpen * amount)
+
+    # Part 4: Merge sharpened image and low frequencies;
+    image2 = normalize(sharpened * edges_large + blurred_large * (1 - edges_large))
+
+    # Part 5: Merge image and medium frequencies;
+    result = image2 * edges_small + blurred_small * (1 - edges_small)
+    return result, [blurred_small, edges_small, blurred_large, edges_large, sharpened, image2]
+
+
+if __name__ == "__main__":
+        
+    # img = imread("puppy.jpg")
+    img = img_as_float(data.astronaut())
+    if BW:
+        img = rgb2gray(img) # Output is a [0,1] matrix;
+    
+    # Golden pipeline;
+    result, other = pipeline_golden(img)
+    
+    fig, axes = plt.subplots(4, 2, figsize=(6, 6))
+    ax = axes.ravel()
+    
+    cmap =  plt.cm.gray if BW else None
+    ax[0].imshow(img, cmap=cmap)
+    ax[1].imshow(other[0], cmap=cmap)
+    ax[2].imshow(np.dot(other[1][...,:3], [0.33, 0.33, 0.33]), cmap='gray') 
+    ax[3].imshow(other[2], cmap=cmap)
+    ax[4].imshow(np.dot(other[3][...,:3], [0.33, 0.33, 0.33]), cmap='gray')
+    ax[5].imshow(other[4], cmap=cmap)
+    ax[6].imshow(other[5], cmap=cmap)
+    ax[7].imshow(result, cmap=cmap)
+    for i in ax:
+        i.axis("off")
+    fig.tight_layout()
+    plt.show()
+    fig.savefig("astronaut_g.jpg")
+       
+    # Custom BW pipeline;
+    result2 = np.zeros(img.shape)
+    other2 = [np.zeros(img.shape) for i in range(len(other))]
+    for i in range(img.shape[-1]):
+        result2[:, :, i], tmp = pipeline_bw(img[:, :, i])
+        for j, x in enumerate(tmp):
+            other2[j][:, :, i] = x
+    
+    fig, axes = plt.subplots(2, 2, figsize=(6, 6))
+    ax = axes.ravel()
+    
+    cmap =  plt.cm.gray if BW else None
+    ax[0].imshow(img, cmap=cmap)
+    ax[1].imshow(other2[2], cmap=cmap)
+    ax[2].imshow(other2[3], cmap=cmap)
+    ax[3].imshow(result2, cmap=cmap)
+    for i in ax:
+        i.axis("off")
+    fig.tight_layout()
+    plt.show()
+    fig.savefig("astronaut_py.jpg")
diff --git a/demos/image_pipeline_local/img_in/astro1.jpg b/demos/image_pipeline_local/img_in/astro1.jpg
new file mode 100755
index 00000000..03ba90f2
Binary files /dev/null and b/demos/image_pipeline_local/img_in/astro1.jpg differ
diff --git a/demos/image_pipeline_local/img_in/lena.jpg b/demos/image_pipeline_local/img_in/lena.jpg
new file mode 100644
index 00000000..f06aa74a
Binary files /dev/null and b/demos/image_pipeline_local/img_in/lena.jpg differ
diff --git a/demos/image_pipeline_local/img_in/lena_bw.jpg b/demos/image_pipeline_local/img_in/lena_bw.jpg
new file mode 100644
index 00000000..f81885b7
Binary files /dev/null and b/demos/image_pipeline_local/img_in/lena_bw.jpg differ
diff --git a/demos/image_pipeline_local/package-lock.json b/demos/image_pipeline_local/package-lock.json
new file mode 100644
index 00000000..768aefaa
--- /dev/null
+++ b/demos/image_pipeline_local/package-lock.json
@@ -0,0 +1,197 @@
+{
+  "requires": true,
+  "lockfileVersion": 1,
+  "dependencies": {
+    "@types/node": {
+      "version": "16.7.10",
+      "resolved": "https://registry.npmjs.org/@types/node/-/node-16.7.10.tgz",
+      "integrity": "sha512-S63Dlv4zIPb8x6MMTgDq5WWRJQe56iBEY0O3SOFA9JrRienkOVDXSXBjjJw6HTNQYSE2JI6GMCR6LVbIMHJVvA==",
+      "optional": true
+    },
+    "ansi-regex": {
+      "version": "2.1.1",
+      "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-2.1.1.tgz",
+      "integrity": "sha1-w7M6te42DYbg5ijwRorn7yfWVN8="
+    },
+    "aproba": {
+      "version": "1.2.0",
+      "resolved": "https://registry.npmjs.org/aproba/-/aproba-1.2.0.tgz",
+      "integrity": "sha512-Y9J6ZjXtoYh8RnXVCMOU/ttDmk1aBjunq9vO0ta5x85WDQiQfUF9sIPBITdbiiIVcBo03Hi3jMxigBtsddlXRw=="
+    },
+    "are-we-there-yet": {
+      "version": "1.1.6",
+      "resolved": "https://registry.npmjs.org/are-we-there-yet/-/are-we-there-yet-1.1.6.tgz",
+      "integrity": "sha512-+1byPnimWdGcKFRS48zG73nxM08kamPFReUYvEmRXI3E8E4YhF4voMRDaGlfGD1UeRHEgs4NhQCE28KI8JVj1A==",
+      "requires": {
+        "delegates": "^1.0.0",
+        "readable-stream": "^3.6.0"
+      }
+    },
+    "code-point-at": {
+      "version": "1.1.0",
+      "resolved": "https://registry.npmjs.org/code-point-at/-/code-point-at-1.1.0.tgz",
+      "integrity": "sha1-DQcLTQQ6W+ozovGkDi7bPZpMz3c="
+    },
+    "console-control-strings": {
+      "version": "1.1.0",
+      "resolved": "https://registry.npmjs.org/console-control-strings/-/console-control-strings-1.1.0.tgz",
+      "integrity": "sha1-PXz0Rk22RG6mRL9LOVB/mFEAjo4="
+    },
+    "delegates": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/delegates/-/delegates-1.0.0.tgz",
+      "integrity": "sha1-hMbhWbgZBP3KWaDvRM2HDTElD5o="
+    },
+    "gauge": {
+      "version": "2.7.4",
+      "resolved": "https://registry.npmjs.org/gauge/-/gauge-2.7.4.tgz",
+      "integrity": "sha1-LANAXHU4w51+s3sxcCLjJfsBi/c=",
+      "requires": {
+        "aproba": "^1.0.3",
+        "console-control-strings": "^1.0.0",
+        "has-unicode": "^2.0.0",
+        "object-assign": "^4.1.0",
+        "signal-exit": "^3.0.0",
+        "string-width": "^1.0.1",
+        "strip-ansi": "^3.0.1",
+        "wide-align": "^1.1.0"
+      }
+    },
+    "has-unicode": {
+      "version": "2.0.1",
+      "resolved": "https://registry.npmjs.org/has-unicode/-/has-unicode-2.0.1.tgz",
+      "integrity": "sha1-4Ob+aijPUROIVeCG0Wkedx3iqLk="
+    },
+    "inherits": {
+      "version": "2.0.4",
+      "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.4.tgz",
+      "integrity": "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="
+    },
+    "is-fullwidth-code-point": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/is-fullwidth-code-point/-/is-fullwidth-code-point-1.0.0.tgz",
+      "integrity": "sha1-754xOG8DGn8NZDr4L95QxFfvAMs=",
+      "requires": {
+        "number-is-nan": "^1.0.0"
+      }
+    },
+    "nan": {
+      "version": "2.15.0",
+      "resolved": "https://registry.npmjs.org/nan/-/nan-2.15.0.tgz",
+      "integrity": "sha512-8ZtvEnA2c5aYCZYd1cvgdnU6cqwixRoYg70xPLWUws5ORTa/lnw+u4amixRS/Ac5U5mQVgp9pnlSUnbNWFaWZQ=="
+    },
+    "native-node-utils": {
+      "version": "0.2.7",
+      "resolved": "https://registry.npmjs.org/native-node-utils/-/native-node-utils-0.2.7.tgz",
+      "integrity": "sha512-61v0G3uVxWlXHppSZGwZi+ZEIgGUKI8QvEkEJLb1GVePI7P8SBe+G747z+QMXSt4TxfgbVZP0DyobbRKYVIjdw==",
+      "requires": {
+        "nan": "^2.13.2"
+      }
+    },
+    "npmlog": {
+      "version": "4.1.2",
+      "resolved": "https://registry.npmjs.org/npmlog/-/npmlog-4.1.2.tgz",
+      "integrity": "sha512-2uUqazuKlTaSI/dC8AzicUck7+IrEaOnN/e0jd3Xtt1KcGpwx30v50mL7oPyr/h9bL3E4aZccVwpwP+5W9Vjkg==",
+      "requires": {
+        "are-we-there-yet": "~1.1.2",
+        "console-control-strings": "~1.1.0",
+        "gauge": "~2.7.3",
+        "set-blocking": "~2.0.0"
+      }
+    },
+    "number-is-nan": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/number-is-nan/-/number-is-nan-1.0.1.tgz",
+      "integrity": "sha1-CXtgK1NCKlIsGvuHkDGDNpQaAR0="
+    },
+    "object-assign": {
+      "version": "4.1.1",
+      "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
+      "integrity": "sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM="
+    },
+    "opencv-build": {
+      "version": "0.1.9",
+      "resolved": "https://registry.npmjs.org/opencv-build/-/opencv-build-0.1.9.tgz",
+      "integrity": "sha512-tgT/bnJAcYROen9yaPynfK98IMl62mPSgMLmTx41911m5bczlq21xtE5r+UWLB/xEo/0hKk6tl5zHyxV/JS5Rg==",
+      "requires": {
+        "npmlog": "^4.1.2"
+      }
+    },
+    "opencv4nodejs": {
+      "version": "5.6.0",
+      "resolved": "https://registry.npmjs.org/opencv4nodejs/-/opencv4nodejs-5.6.0.tgz",
+      "integrity": "sha512-JvcT1hb2JUCdntcVABgD9Gprr+gkXBe+jhHKvrr0Ug51y087K4ybm0vHBQVzI2ei1aJxEc9tNknPL9rpyx5Xuw==",
+      "requires": {
+        "@types/node": ">6",
+        "nan": "^2.14.0",
+        "native-node-utils": "^0.2.7",
+        "npmlog": "^4.1.2",
+        "opencv-build": "^0.1.9"
+      }
+    },
+    "readable-stream": {
+      "version": "3.6.0",
+      "resolved": "https://registry.npmjs.org/readable-stream/-/readable-stream-3.6.0.tgz",
+      "integrity": "sha512-BViHy7LKeTz4oNnkcLJ+lVSL6vpiFeX6/d3oSH8zCW7UxP2onchk+vTGB143xuFjHS3deTgkKoXXymXqymiIdA==",
+      "requires": {
+        "inherits": "^2.0.3",
+        "string_decoder": "^1.1.1",
+        "util-deprecate": "^1.0.1"
+      }
+    },
+    "safe-buffer": {
+      "version": "5.2.1",
+      "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.2.1.tgz",
+      "integrity": "sha512-rp3So07KcdmmKbGvgaNxQSJr7bGVSVk5S9Eq1F+ppbRo70+YeaDxkw5Dd8NPN+GD6bjnYm2VuPuCXmpuYvmCXQ=="
+    },
+    "set-blocking": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/set-blocking/-/set-blocking-2.0.0.tgz",
+      "integrity": "sha1-BF+XgtARrppoA93TgrJDkrPYkPc="
+    },
+    "signal-exit": {
+      "version": "3.0.3",
+      "resolved": "https://registry.npmjs.org/signal-exit/-/signal-exit-3.0.3.tgz",
+      "integrity": "sha512-VUJ49FC8U1OxwZLxIbTTrDvLnf/6TDgxZcK8wxR8zs13xpx7xbG60ndBlhNrFi2EMuFRoeDoJO7wthSLq42EjA=="
+    },
+    "string-width": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/string-width/-/string-width-1.0.2.tgz",
+      "integrity": "sha1-EYvfW4zcUaKn5w0hHgfisLmxB9M=",
+      "requires": {
+        "code-point-at": "^1.0.0",
+        "is-fullwidth-code-point": "^1.0.0",
+        "strip-ansi": "^3.0.0"
+      }
+    },
+    "string_decoder": {
+      "version": "1.3.0",
+      "resolved": "https://registry.npmjs.org/string_decoder/-/string_decoder-1.3.0.tgz",
+      "integrity": "sha512-hkRX8U1WjJFd8LsDJ2yQ/wWWxaopEsABU1XfkM8A+j0+85JAGppt16cr1Whg6KIbb4okU6Mql6BOj+uup/wKeA==",
+      "requires": {
+        "safe-buffer": "~5.2.0"
+      }
+    },
+    "strip-ansi": {
+      "version": "3.0.1",
+      "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-3.0.1.tgz",
+      "integrity": "sha1-ajhfuIU9lS1f8F0Oiq+UJ43GPc8=",
+      "requires": {
+        "ansi-regex": "^2.0.0"
+      }
+    },
+    "util-deprecate": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/util-deprecate/-/util-deprecate-1.0.2.tgz",
+      "integrity": "sha1-RQ1Nyfpw3nMnYvvS1KKJgUGaDM8="
+    },
+    "wide-align": {
+      "version": "1.1.3",
+      "resolved": "https://registry.npmjs.org/wide-align/-/wide-align-1.1.3.tgz",
+      "integrity": "sha512-QGkOQc8XL6Bt5PwnsExKBPuMKBxnGxWWW3fU55Xt4feHozMUhdUMaBCk290qpm/wG5u/RSKzwdAC4i51YigihA==",
+      "requires": {
+        "string-width": "^1.0.2 || 2"
+      }
+    }
+  }
+}
diff --git a/demos/image_pipeline_web/.gitignore b/demos/image_pipeline_web/.gitignore
new file mode 100644
index 00000000..f04da027
--- /dev/null
+++ b/demos/image_pipeline_web/.gitignore
@@ -0,0 +1,256 @@
+
+# Backend and Frontend related stuff
+
+# Created by https://www.toptal.com/developers/gitignore/api/webstorm,node,yarn
+# Edit at https://www.toptal.com/developers/gitignore?templates=webstorm,node,yarn
+
+### Node ###
+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+lerna-debug.log*
+.pnpm-debug.log*
+
+# Diagnostic reports (https://nodejs.org/api/report.html)
+report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
+
+# Runtime data
+pids
+*.pid
+*.seed
+*.pid.lock
+
+# Directory for instrumented libs generated by jscoverage/JSCover
+lib-cov
+
+# Coverage directory used by tools like istanbul
+coverage
+*.lcov
+
+# nyc test coverage
+.nyc_output
+
+# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
+.grunt
+
+# Bower dependency directory (https://bower.io/)
+bower_components
+
+# node-waf configuration
+.lock-wscript
+
+# Compiled binary addons (https://nodejs.org/api/addons.html)
+build/Release
+
+# Dependency directories
+node_modules/
+jspm_packages/
+
+# Snowpack dependency directory (https://snowpack.dev/)
+web_modules/
+
+# TypeScript cache
+*.tsbuildinfo
+
+# Optional npm cache directory
+.npm
+
+# Optional eslint cache
+.eslintcache
+
+# Microbundle cache
+.rpt2_cache/
+.rts2_cache_cjs/
+.rts2_cache_es/
+.rts2_cache_umd/
+
+# Optional REPL history
+.node_repl_history
+
+# Output of 'npm pack'
+*.tgz
+
+# Yarn Integrity file
+.yarn-integrity
+
+# dotenv environment variables file
+.env
+.env.test
+.env.production
+
+# parcel-bundler cache (https://parceljs.org/)
+.cache
+.parcel-cache
+
+# Next.js build output
+.next
+out
+
+# Nuxt.js build / generate output
+.nuxt
+dist
+
+# Gatsby files
+.cache/
+# Comment in the public line in if your project uses Gatsby and not Next.js
+# https://nextjs.org/blog/next-9-1#public-directory-support
+# public
+
+# vuepress build output
+.vuepress/dist
+
+# Serverless directories
+.serverless/
+
+# FuseBox cache
+.fusebox/
+
+# DynamoDB Local files
+.dynamodb/
+
+# TernJS port file
+.tern-port
+
+# Stores VSCode versions used for testing VSCode extensions
+.vscode-test
+
+# yarn v2
+.yarn/cache
+.yarn/unplugged
+.yarn/build-state.yml
+.yarn/install-state.gz
+.pnp.*
+
+### WebStorm ###
+# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
+
+# User-specific stuff
+.idea/**/workspace.xml
+.idea/**/tasks.xml
+.idea/**/usage.statistics.xml
+.idea/**/dictionaries
+.idea/**/shelf
+
+# AWS User-specific
+.idea/**/aws.xml
+
+# Generated files
+.idea/**/contentModel.xml
+
+# Sensitive or high-churn files
+.idea/**/dataSources/
+.idea/**/dataSources.ids
+.idea/**/dataSources.local.xml
+.idea/**/sqlDataSources.xml
+.idea/**/dynamic.xml
+.idea/**/uiDesigner.xml
+.idea/**/dbnavigator.xml
+
+# Gradle
+.idea/**/gradle.xml
+.idea/**/libraries
+
+# Gradle and Maven with auto-import
+# When using Gradle or Maven with auto-import, you should exclude module files,
+# since they will be recreated, and may cause churn.  Uncomment if using
+# auto-import.
+# .idea/artifacts
+# .idea/compiler.xml
+# .idea/jarRepositories.xml
+# .idea/modules.xml
+# .idea/*.iml
+# .idea/modules
+# *.iml
+# *.ipr
+
+# CMake
+cmake-build-*/
+
+# Mongo Explorer plugin
+.idea/**/mongoSettings.xml
+
+# File-based project format
+*.iws
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Cursive Clojure plugin
+.idea/replstate.xml
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+fabric.properties
+
+# Editor-based Rest Client
+.idea/httpRequests
+
+# Android studio 3.1+ serialized cache file
+.idea/caches/build_file_checksums.ser
+
+### WebStorm Patch ###
+# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721
+
+# *.iml
+# modules.xml
+# .idea/misc.xml
+# *.ipr
+
+# Sonarlint plugin
+# https://plugins.jetbrains.com/plugin/7973-sonarlint
+.idea/**/sonarlint/
+
+# SonarQube Plugin
+# https://plugins.jetbrains.com/plugin/7238-sonarqube-community-plugin
+.idea/**/sonarIssues.xml
+
+# Markdown Navigator plugin
+# https://plugins.jetbrains.com/plugin/7896-markdown-navigator-enhanced
+.idea/**/markdown-navigator.xml
+.idea/**/markdown-navigator-enh.xml
+.idea/**/markdown-navigator/
+
+# Cache file creation bug
+# See https://youtrack.jetbrains.com/issue/JBR-2257
+.idea/$CACHE_FILE$
+
+# CodeStream plugin
+# https://plugins.jetbrains.com/plugin/12206-codestream
+.idea/codestream.xml
+
+### yarn ###
+# https://yarnpkg.com/advanced/qa#which-files-should-be-gitignored
+
+.yarn/*
+!.yarn/releases
+!.yarn/plugins
+!.yarn/sdks
+!.yarn/versions
+
+# if you are NOT using Zero-installs, then:
+# comment the following lines
+!.yarn/cache
+
+# and uncomment the following lines
+# .pnp.*
+
+# End of https://www.toptal.com/developers/gitignore/api/webstorm,node,yarn
+
+# Images
+*.jpg
+*.png
+
+# MacOS generated directories
+.DS_Store
diff --git a/demos/image_pipeline_web/README.md b/demos/image_pipeline_web/README.md
new file mode 100644
index 00000000..e800b30b
--- /dev/null
+++ b/demos/image_pipeline_web/README.md
@@ -0,0 +1,43 @@
+# Web Demo for SeptembeRSE
+The goal of this demo is to showcase an image processing pipeline in _GrCUDA_.
+
+* **Abstract from the demo**
+    ```
+    GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java.
+    Moreover, very few ninja programmers have the expert knowledge of asynchronous programming required to use GPUs at their best.
+    The GrCUDA polyglot API is a significant step forward in the never-ending quest of making GPU programming more accessible. 
+    GrCUDA exposes the CUDA API to all the high-level languages supported by GraalVM, such as JavaScript, R, Python and Scala, drastically lowering the integration efforst with these languages.
+    But that's not all: we have recently improved GrCUDA to transparently provide asynchronous execution, hardware space-sharing, and transfer-computation overlap without requiring in advance any information about the program dependency structure.
+    We achieve an average of 44% speedup against synchronous execution, with no code change whatsoever!
+
+    In this tutorial, we show the strengths of GrCUDA by showcasing a complex image processing application.
+    But no fear: you will see how easy it is to accelerate JavaScript using GPUs, achieving the same performance as the low-level C++ CUDA API, with drastically simpler code!
+    ```
+ 
+## Installation and setup
+
+1. `./setup_demo.sh` compiles GrCUDA, downloads the image dataset, installs dependencies for the demo, builds the backend (including the native CUDA implementation), and starts the demo
+2. The `./setup_demo.sh` script launches an HTTP accessible as `localhost:8085` from your web browser
+3. If you need to change some of the ports (e.g. because they are already in use), modify `backend/package.json` and `frontend/index.json`, then rebuild the demo
+4. If running the demo on a remote machine, you need to setup port forwarding before connecting. From the terminal, `ssh -f -L LOCAL_PORT:DESTINATION_IP:DESTINATION_PORT user@DESTINATION_IP`. If you are running the demo inside Visual Studio Code, open the `PORTS` tab (right of the `TERMINAL` tab), and type the ports you have to forward (e.g. 8080, 8082, 8083)
+5. `./run_demo.sh` simply starts the demo, without building it first
+
+## Backend
+The backend is in charge of receiving signal (via `websockets`) associated to the beginning of the computation and the computation mode (either `sync`, `async` or `cuda-native`) from the frontend and initiate the actual computation using the specified mode.
+At each image processed, the backend signals the frontend of the current progresses and of which images (in batch) are ready to be displayed to the final user.
+
+### Install dependencies and run
+To install the dependencies run `npm install` in the `backend` directory and compile the cuda binary in the `../image_pipeline` directory using `cmake`.
+When in development, it is advisable to run `npm run devall` to compile and run the servers at each code save.
+In production, first compile the `typescript` files using the `typescript` compiler (`tsc`) or the `npm build` command. The compiled files can be found in the `dist` directory and can be executed by running `npm run runall`.
+
+## Frontend
+The frontend is in charge of signaling the beginning of the computation to the backend, showing the progress and, when the computation is finished, display a grid of the computed images. By clicking on any thumbnail in the grid, the user is displayed the full resolution image.
+
+### Install dependencies and run
+Open the `index.html` file, requires the backend to be already running in the local server (`localhost`) on port 8080 (sync), 8083 (async), 8082 (cuda-native).
+
+If running on a remote machine remember setup port forwarding using ssh: 
+```
+ssh -f -L LOCAL_PORT:DESTINATION_IP:DESTINATION_PORT user@DESTINATION_IP
+```
\ No newline at end of file
diff --git a/demos/image_pipeline_web/backend/package-lock.json b/demos/image_pipeline_web/backend/package-lock.json
new file mode 100644
index 00000000..3fdcfd8c
--- /dev/null
+++ b/demos/image_pipeline_web/backend/package-lock.json
@@ -0,0 +1,1658 @@
+{
+  "name": "SeptembeRSE-demo-backend",
+  "version": "1.0.0",
+  "lockfileVersion": 1,
+  "requires": true,
+  "dependencies": {
+    "@sindresorhus/is": {
+      "version": "0.14.0",
+      "resolved": "https://registry.npmjs.org/@sindresorhus/is/-/is-0.14.0.tgz",
+      "integrity": "sha512-9NET910DNaIPngYnLLPeg+Ogzqsi9uM4mSboU5y6p8S5DzMTVEsJZrawi+BoDNUVBa2DhJqQYUFvMDfgU062LQ==",
+      "dev": true
+    },
+    "@szmarczak/http-timer": {
+      "version": "1.1.2",
+      "resolved": "https://registry.npmjs.org/@szmarczak/http-timer/-/http-timer-1.1.2.tgz",
+      "integrity": "sha512-XIB2XbzHTN6ieIjfIMV9hlVcfPU26s2vafYWQcZHWXHOxiaRZYEDKEwdl129Zyg50+foYV2jCgtrqSA6qNuNSA==",
+      "dev": true,
+      "requires": {
+        "defer-to-connect": "^1.0.1"
+      }
+    },
+    "@types/body-parser": {
+      "version": "1.19.1",
+      "resolved": "https://registry.npmjs.org/@types/body-parser/-/body-parser-1.19.1.tgz",
+      "integrity": "sha512-a6bTJ21vFOGIkwM0kzh9Yr89ziVxq4vYH2fQ6N8AeipEzai/cFK6aGMArIkUeIdRIgpwQa+2bXiLuUJCpSf2Cg==",
+      "dev": true,
+      "requires": {
+        "@types/connect": "*",
+        "@types/node": "*"
+      }
+    },
+    "@types/connect": {
+      "version": "3.4.35",
+      "resolved": "https://registry.npmjs.org/@types/connect/-/connect-3.4.35.tgz",
+      "integrity": "sha512-cdeYyv4KWoEgpBISTxWvqYsVy444DOqehiF3fM3ne10AmJ62RSyNkUnxMJXHQWRQQX2eR94m5y1IZyDwBjV9FQ==",
+      "dev": true,
+      "requires": {
+        "@types/node": "*"
+      }
+    },
+    "@types/express": {
+      "version": "4.17.13",
+      "resolved": "https://registry.npmjs.org/@types/express/-/express-4.17.13.tgz",
+      "integrity": "sha512-6bSZTPaTIACxn48l50SR+axgrqm6qXFIxrdAKaG6PaJk3+zuUr35hBlgT7vOmJcum+OEaIBLtHV/qloEAFITeA==",
+      "dev": true,
+      "requires": {
+        "@types/body-parser": "*",
+        "@types/express-serve-static-core": "^4.17.18",
+        "@types/qs": "*",
+        "@types/serve-static": "*"
+      }
+    },
+    "@types/express-serve-static-core": {
+      "version": "4.17.24",
+      "resolved": "https://registry.npmjs.org/@types/express-serve-static-core/-/express-serve-static-core-4.17.24.tgz",
+      "integrity": "sha512-3UJuW+Qxhzwjq3xhwXm2onQcFHn76frIYVbTu+kn24LFxI+dEhdfISDFovPB8VpEgW8oQCTpRuCe+0zJxB7NEA==",
+      "dev": true,
+      "requires": {
+        "@types/node": "*",
+        "@types/qs": "*",
+        "@types/range-parser": "*"
+      }
+    },
+    "@types/mime": {
+      "version": "1.3.2",
+      "resolved": "https://registry.npmjs.org/@types/mime/-/mime-1.3.2.tgz",
+      "integrity": "sha512-YATxVxgRqNH6nHEIsvg6k2Boc1JHI9ZbH5iWFFv/MTkchz3b1ieGDa5T0a9RznNdI0KhVbdbWSN+KWWrQZRxTw==",
+      "dev": true
+    },
+    "@types/node": {
+      "version": "16.7.10",
+      "resolved": "https://registry.npmjs.org/@types/node/-/node-16.7.10.tgz",
+      "integrity": "sha512-S63Dlv4zIPb8x6MMTgDq5WWRJQe56iBEY0O3SOFA9JrRienkOVDXSXBjjJw6HTNQYSE2JI6GMCR6LVbIMHJVvA=="
+    },
+    "@types/qs": {
+      "version": "6.9.7",
+      "resolved": "https://registry.npmjs.org/@types/qs/-/qs-6.9.7.tgz",
+      "integrity": "sha512-FGa1F62FT09qcrueBA6qYTrJPVDzah9a+493+o2PCXsesWHIn27G98TsSMs3WPNbZIEj4+VJf6saSFpvD+3Zsw==",
+      "dev": true
+    },
+    "@types/range-parser": {
+      "version": "1.2.4",
+      "resolved": "https://registry.npmjs.org/@types/range-parser/-/range-parser-1.2.4.tgz",
+      "integrity": "sha512-EEhsLsD6UsDM1yFhAvy0Cjr6VwmpMWqFBCb9w07wVugF7w9nfajxLuVmngTIpgS6svCnm6Vaw+MZhoDCKnOfsw==",
+      "dev": true
+    },
+    "@types/serve-static": {
+      "version": "1.13.10",
+      "resolved": "https://registry.npmjs.org/@types/serve-static/-/serve-static-1.13.10.tgz",
+      "integrity": "sha512-nCkHGI4w7ZgAdNkrEu0bv+4xNV/XDqW+DydknebMOQwkpDGx8G+HTlj7R7ABI8i8nKxVw0wtKPi1D+lPOkh4YQ==",
+      "dev": true,
+      "requires": {
+        "@types/mime": "^1",
+        "@types/node": "*"
+      }
+    },
+    "@types/ws": {
+      "version": "7.4.7",
+      "resolved": "https://registry.npmjs.org/@types/ws/-/ws-7.4.7.tgz",
+      "integrity": "sha512-JQbbmxZTZehdc2iszGKs5oC3NFnjeay7mtAWrdt7qNtAVK0g19muApzAy4bm9byz79xa2ZnO/BOBC2R8RC5Lww==",
+      "dev": true,
+      "requires": {
+        "@types/node": "*"
+      }
+    },
+    "abbrev": {
+      "version": "1.1.1",
+      "resolved": "https://registry.npmjs.org/abbrev/-/abbrev-1.1.1.tgz",
+      "integrity": "sha512-nne9/IiQ/hzIhY6pdDnbBtz7DjPTKrY00P/zvPSm5pOFkl6xuGrGnXn/VtTNNfNtAfZ9/1RtehkszU9qcTii0Q==",
+      "dev": true
+    },
+    "accepts": {
+      "version": "1.3.7",
+      "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.7.tgz",
+      "integrity": "sha512-Il80Qs2WjYlJIBNzNkK6KYqlVMTbZLXgHx2oT0pU/fjRHyEp+PEfEPY0R3WCwAGVOtauxh1hOxNgIf5bv7dQpA==",
+      "requires": {
+        "mime-types": "~2.1.24",
+        "negotiator": "0.6.2"
+      }
+    },
+    "ansi-align": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/ansi-align/-/ansi-align-3.0.0.tgz",
+      "integrity": "sha512-ZpClVKqXN3RGBmKibdfWzqCY4lnjEuoNzU5T0oEFpfd/z5qJHVarukridD4juLO2FXMiwUQxr9WqQtaYa8XRYw==",
+      "dev": true,
+      "requires": {
+        "string-width": "^3.0.0"
+      },
+      "dependencies": {
+        "ansi-regex": {
+          "version": "4.1.0",
+          "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.1.0.tgz",
+          "integrity": "sha512-1apePfXM1UOSqw0o9IiFAovVz9M5S1Dg+4TrDwfMewQ6p/rmMueb7tWZjQ1rx4Loy1ArBggoqGpfqqdI4rondg==",
+          "dev": true
+        },
+        "is-fullwidth-code-point": {
+          "version": "2.0.0",
+          "resolved": "https://registry.npmjs.org/is-fullwidth-code-point/-/is-fullwidth-code-point-2.0.0.tgz",
+          "integrity": "sha1-o7MKXE8ZkYMWeqq5O+764937ZU8=",
+          "dev": true
+        },
+        "string-width": {
+          "version": "3.1.0",
+          "resolved": "https://registry.npmjs.org/string-width/-/string-width-3.1.0.tgz",
+          "integrity": "sha512-vafcv6KjVZKSgz06oM/H6GDBrAtz8vdhQakGjFIvNrHA6y3HCF1CInLy+QLq8dTJPQ1b+KDUqDFctkdRW44e1w==",
+          "dev": true,
+          "requires": {
+            "emoji-regex": "^7.0.1",
+            "is-fullwidth-code-point": "^2.0.0",
+            "strip-ansi": "^5.1.0"
+          }
+        },
+        "strip-ansi": {
+          "version": "5.2.0",
+          "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.2.0.tgz",
+          "integrity": "sha512-DuRs1gKbBqsMKIZlrffwlug8MHkcnpjs5VPmL1PAh+mA30U0DTotfDZ0d2UUsXpPmPmMMJ6W773MaA3J+lbiWA==",
+          "dev": true,
+          "requires": {
+            "ansi-regex": "^4.1.0"
+          }
+        }
+      }
+    },
+    "ansi-regex": {
+      "version": "2.1.1",
+      "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-2.1.1.tgz",
+      "integrity": "sha1-w7M6te42DYbg5ijwRorn7yfWVN8="
+    },
+    "ansi-styles": {
+      "version": "4.3.0",
+      "resolved": "https://registry.npmjs.org/ansi-styles/-/ansi-styles-4.3.0.tgz",
+      "integrity": "sha512-zbB9rCJAT1rbjiVDb2hqKFHNYLxgtk8NURxZ3IZwD3F6NtxbXZQCnnSi1Lkx+IDohdPlFp222wVALIheZJQSEg==",
+      "dev": true,
+      "requires": {
+        "color-convert": "^2.0.1"
+      }
+    },
+    "anymatch": {
+      "version": "3.1.2",
+      "resolved": "https://registry.npmjs.org/anymatch/-/anymatch-3.1.2.tgz",
+      "integrity": "sha512-P43ePfOAIupkguHUycrc4qJ9kz8ZiuOUijaETwX7THt0Y/GNK7v0aa8rY816xWjZ7rJdA5XdMcpVFTKMq+RvWg==",
+      "dev": true,
+      "requires": {
+        "normalize-path": "^3.0.0",
+        "picomatch": "^2.0.4"
+      }
+    },
+    "aproba": {
+      "version": "1.2.0",
+      "resolved": "https://registry.npmjs.org/aproba/-/aproba-1.2.0.tgz",
+      "integrity": "sha512-Y9J6ZjXtoYh8RnXVCMOU/ttDmk1aBjunq9vO0ta5x85WDQiQfUF9sIPBITdbiiIVcBo03Hi3jMxigBtsddlXRw=="
+    },
+    "are-we-there-yet": {
+      "version": "1.1.6",
+      "resolved": "https://registry.npmjs.org/are-we-there-yet/-/are-we-there-yet-1.1.6.tgz",
+      "integrity": "sha512-+1byPnimWdGcKFRS48zG73nxM08kamPFReUYvEmRXI3E8E4YhF4voMRDaGlfGD1UeRHEgs4NhQCE28KI8JVj1A==",
+      "requires": {
+        "delegates": "^1.0.0",
+        "readable-stream": "^3.6.0"
+      }
+    },
+    "array-flatten": {
+      "version": "1.1.1",
+      "resolved": "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz",
+      "integrity": "sha1-ml9pkFGx5wczKPKgCJaLZOopVdI="
+    },
+    "balanced-match": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/balanced-match/-/balanced-match-1.0.2.tgz",
+      "integrity": "sha512-3oSeUO0TMV67hN1AmbXsK4yaqU7tjiHlbxRDZOpH0KW9+CeX4bRAaX0Anxt0tx2MrpRpWwQaPwIlISEJhYU5Pw==",
+      "dev": true
+    },
+    "binary-extensions": {
+      "version": "2.2.0",
+      "resolved": "https://registry.npmjs.org/binary-extensions/-/binary-extensions-2.2.0.tgz",
+      "integrity": "sha512-jDctJ/IVQbZoJykoeHbhXpOlNBqGNcwXJKJog42E5HDPUwQTSdjCHdihjj0DlnheQ7blbT6dHOafNAiS8ooQKA==",
+      "dev": true
+    },
+    "body-parser": {
+      "version": "1.19.0",
+      "resolved": "https://registry.npmjs.org/body-parser/-/body-parser-1.19.0.tgz",
+      "integrity": "sha512-dhEPs72UPbDnAQJ9ZKMNTP6ptJaionhP5cBb541nXPlW60Jepo9RV/a4fX4XWW9CuFNK22krhrj1+rgzifNCsw==",
+      "requires": {
+        "bytes": "3.1.0",
+        "content-type": "~1.0.4",
+        "debug": "2.6.9",
+        "depd": "~1.1.2",
+        "http-errors": "1.7.2",
+        "iconv-lite": "0.4.24",
+        "on-finished": "~2.3.0",
+        "qs": "6.7.0",
+        "raw-body": "2.4.0",
+        "type-is": "~1.6.17"
+      }
+    },
+    "boxen": {
+      "version": "4.2.0",
+      "resolved": "https://registry.npmjs.org/boxen/-/boxen-4.2.0.tgz",
+      "integrity": "sha512-eB4uT9RGzg2odpER62bBwSLvUeGC+WbRjjyyFhGsKnc8wp/m0+hQsMUvUe3H2V0D5vw0nBdO1hCJoZo5mKeuIQ==",
+      "dev": true,
+      "requires": {
+        "ansi-align": "^3.0.0",
+        "camelcase": "^5.3.1",
+        "chalk": "^3.0.0",
+        "cli-boxes": "^2.2.0",
+        "string-width": "^4.1.0",
+        "term-size": "^2.1.0",
+        "type-fest": "^0.8.1",
+        "widest-line": "^3.1.0"
+      },
+      "dependencies": {
+        "ansi-regex": {
+          "version": "5.0.0",
+          "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-5.0.0.tgz",
+          "integrity": "sha512-bY6fj56OUQ0hU1KjFNDQuJFezqKdrAyFdIevADiqrWHwSlbmBNMHp5ak2f40Pm8JTFyM2mqxkG6ngkHO11f/lg==",
+          "dev": true
+        },
+        "emoji-regex": {
+          "version": "8.0.0",
+          "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz",
+          "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A==",
+          "dev": true
+        },
+        "is-fullwidth-code-point": {
+          "version": "3.0.0",
+          "resolved": "https://registry.npmjs.org/is-fullwidth-code-point/-/is-fullwidth-code-point-3.0.0.tgz",
+          "integrity": "sha512-zymm5+u+sCsSWyD9qNaejV3DFvhCKclKdizYaJUuHA83RLjb7nSuGnddCHGv0hk+KY7BMAlsWeK4Ueg6EV6XQg==",
+          "dev": true
+        },
+        "string-width": {
+          "version": "4.2.2",
+          "resolved": "https://registry.npmjs.org/string-width/-/string-width-4.2.2.tgz",
+          "integrity": "sha512-XBJbT3N4JhVumXE0eoLU9DCjcaF92KLNqTmFCnG1pf8duUxFGwtP6AD6nkjw9a3IdiRtL3E2w3JDiE/xi3vOeA==",
+          "dev": true,
+          "requires": {
+            "emoji-regex": "^8.0.0",
+            "is-fullwidth-code-point": "^3.0.0",
+            "strip-ansi": "^6.0.0"
+          }
+        },
+        "strip-ansi": {
+          "version": "6.0.0",
+          "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-6.0.0.tgz",
+          "integrity": "sha512-AuvKTrTfQNYNIctbR1K/YGTR1756GycPsg7b9bdV9Duqur4gv6aKqHXah67Z8ImS7WEz5QVcOtlfW2rZEugt6w==",
+          "dev": true,
+          "requires": {
+            "ansi-regex": "^5.0.0"
+          }
+        }
+      }
+    },
+    "brace-expansion": {
+      "version": "1.1.11",
+      "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.11.tgz",
+      "integrity": "sha512-iCuPHDFgrHX7H2vEI/5xpz07zSHB00TpugqhmYtVmMO6518mCuRMoOYFldEBl0g187ufozdaHgWKcYFb61qGiA==",
+      "dev": true,
+      "requires": {
+        "balanced-match": "^1.0.0",
+        "concat-map": "0.0.1"
+      }
+    },
+    "braces": {
+      "version": "3.0.2",
+      "resolved": "https://registry.npmjs.org/braces/-/braces-3.0.2.tgz",
+      "integrity": "sha512-b8um+L1RzM3WDSzvhm6gIz1yfTbBt6YTlcEKAvsmqCZZFw46z626lVj9j1yEPW33H5H+lBQpZMP1k8l+78Ha0A==",
+      "dev": true,
+      "requires": {
+        "fill-range": "^7.0.1"
+      }
+    },
+    "bytes": {
+      "version": "3.1.0",
+      "resolved": "https://registry.npmjs.org/bytes/-/bytes-3.1.0.tgz",
+      "integrity": "sha512-zauLjrfCG+xvoyaqLoV8bLVXXNGC4JqlxFCutSDWA6fJrTo2ZuvLYTqZ7aHBLZSMOopbzwv8f+wZcVzfVTI2Dg=="
+    },
+    "cacheable-request": {
+      "version": "6.1.0",
+      "resolved": "https://registry.npmjs.org/cacheable-request/-/cacheable-request-6.1.0.tgz",
+      "integrity": "sha512-Oj3cAGPCqOZX7Rz64Uny2GYAZNliQSqfbePrgAQ1wKAihYmCUnraBtJtKcGR4xz7wF+LoJC+ssFZvv5BgF9Igg==",
+      "dev": true,
+      "requires": {
+        "clone-response": "^1.0.2",
+        "get-stream": "^5.1.0",
+        "http-cache-semantics": "^4.0.0",
+        "keyv": "^3.0.0",
+        "lowercase-keys": "^2.0.0",
+        "normalize-url": "^4.1.0",
+        "responselike": "^1.0.2"
+      },
+      "dependencies": {
+        "get-stream": {
+          "version": "5.2.0",
+          "resolved": "https://registry.npmjs.org/get-stream/-/get-stream-5.2.0.tgz",
+          "integrity": "sha512-nBF+F1rAZVCu/p7rjzgA+Yb4lfYXrpl7a6VmJrU8wF9I1CKvP/QwPNZHnOlwbTkY6dvtFIzFMSyQXbLoTQPRpA==",
+          "dev": true,
+          "requires": {
+            "pump": "^3.0.0"
+          }
+        },
+        "lowercase-keys": {
+          "version": "2.0.0",
+          "resolved": "https://registry.npmjs.org/lowercase-keys/-/lowercase-keys-2.0.0.tgz",
+          "integrity": "sha512-tqNXrS78oMOE73NMxK4EMLQsQowWf8jKooH9g7xPavRT706R6bkQJ6DY2Te7QukaZsulxa30wQ7bk0pm4XiHmA==",
+          "dev": true
+        }
+      }
+    },
+    "camelcase": {
+      "version": "5.3.1",
+      "resolved": "https://registry.npmjs.org/camelcase/-/camelcase-5.3.1.tgz",
+      "integrity": "sha512-L28STB170nwWS63UjtlEOE3dldQApaJXZkOI1uMFfzf3rRuPegHaHesyee+YxQ+W6SvRDQV6UrdOdRiR153wJg==",
+      "dev": true
+    },
+    "chalk": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/chalk/-/chalk-3.0.0.tgz",
+      "integrity": "sha512-4D3B6Wf41KOYRFdszmDqMCGq5VV/uMAB273JILmO+3jAlh8X4qDtdtgCR3fxtbLEMzSx22QdhnDcJvu2u1fVwg==",
+      "dev": true,
+      "requires": {
+        "ansi-styles": "^4.1.0",
+        "supports-color": "^7.1.0"
+      },
+      "dependencies": {
+        "has-flag": {
+          "version": "4.0.0",
+          "resolved": "https://registry.npmjs.org/has-flag/-/has-flag-4.0.0.tgz",
+          "integrity": "sha512-EykJT/Q1KjTWctppgIAgfSO0tKVuZUjhgMr17kqTumMl6Afv3EISleU7qZUzoXDFTAHTDC4NOoG/ZxU3EvlMPQ==",
+          "dev": true
+        },
+        "supports-color": {
+          "version": "7.2.0",
+          "resolved": "https://registry.npmjs.org/supports-color/-/supports-color-7.2.0.tgz",
+          "integrity": "sha512-qpCAvRl9stuOHveKsn7HncJRvv501qIacKzQlO/+Lwxc9+0q2wLyv4Dfvt80/DPn2pqOBsJdDiogXGR9+OvwRw==",
+          "dev": true,
+          "requires": {
+            "has-flag": "^4.0.0"
+          }
+        }
+      }
+    },
+    "chokidar": {
+      "version": "3.5.2",
+      "resolved": "https://registry.npmjs.org/chokidar/-/chokidar-3.5.2.tgz",
+      "integrity": "sha512-ekGhOnNVPgT77r4K/U3GDhu+FQ2S8TnK/s2KbIGXi0SZWuwkZ2QNyfWdZW+TVfn84DpEP7rLeCt2UI6bJ8GwbQ==",
+      "dev": true,
+      "requires": {
+        "anymatch": "~3.1.2",
+        "braces": "~3.0.2",
+        "fsevents": "~2.3.2",
+        "glob-parent": "~5.1.2",
+        "is-binary-path": "~2.1.0",
+        "is-glob": "~4.0.1",
+        "normalize-path": "~3.0.0",
+        "readdirp": "~3.6.0"
+      }
+    },
+    "ci-info": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/ci-info/-/ci-info-2.0.0.tgz",
+      "integrity": "sha512-5tK7EtrZ0N+OLFMthtqOj4fI2Jeb88C4CAZPu25LDVUgXJ0A3Js4PMGqrn0JU1W0Mh1/Z8wZzYPxqUrXeBboCQ==",
+      "dev": true
+    },
+    "cli-boxes": {
+      "version": "2.2.1",
+      "resolved": "https://registry.npmjs.org/cli-boxes/-/cli-boxes-2.2.1.tgz",
+      "integrity": "sha512-y4coMcylgSCdVinjiDBuR8PCC2bLjyGTwEmPb9NHR/QaNU6EUOXcTY/s6VjGMD6ENSEaeQYHCY0GNGS5jfMwPw==",
+      "dev": true
+    },
+    "clone-response": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/clone-response/-/clone-response-1.0.2.tgz",
+      "integrity": "sha1-0dyXOSAxTfZ/vrlCI7TuNQI56Ws=",
+      "dev": true,
+      "requires": {
+        "mimic-response": "^1.0.0"
+      }
+    },
+    "code-point-at": {
+      "version": "1.1.0",
+      "resolved": "https://registry.npmjs.org/code-point-at/-/code-point-at-1.1.0.tgz",
+      "integrity": "sha1-DQcLTQQ6W+ozovGkDi7bPZpMz3c="
+    },
+    "color-convert": {
+      "version": "2.0.1",
+      "resolved": "https://registry.npmjs.org/color-convert/-/color-convert-2.0.1.tgz",
+      "integrity": "sha512-RRECPsj7iu/xb5oKYcsFHSppFNnsj/52OVTRKb4zP5onXwVF3zVmmToNcOfGC+CRDpfK/U584fMg38ZHCaElKQ==",
+      "dev": true,
+      "requires": {
+        "color-name": "~1.1.4"
+      }
+    },
+    "color-name": {
+      "version": "1.1.4",
+      "resolved": "https://registry.npmjs.org/color-name/-/color-name-1.1.4.tgz",
+      "integrity": "sha512-dOy+3AuW3a2wNbZHIuMZpTcgjGuLU/uBL/ubcZF9OXbDo8ff4O8yVp5Bf0efS8uEoYo5q4Fx7dY9OgQGXgAsQA==",
+      "dev": true
+    },
+    "concat-map": {
+      "version": "0.0.1",
+      "resolved": "https://registry.npmjs.org/concat-map/-/concat-map-0.0.1.tgz",
+      "integrity": "sha1-2Klr13/Wjfd5OnMDajug1UBdR3s=",
+      "dev": true
+    },
+    "configstore": {
+      "version": "5.0.1",
+      "resolved": "https://registry.npmjs.org/configstore/-/configstore-5.0.1.tgz",
+      "integrity": "sha512-aMKprgk5YhBNyH25hj8wGt2+D52Sw1DRRIzqBwLp2Ya9mFmY8KPvvtvmna8SxVR9JMZ4kzMD68N22vlaRpkeFA==",
+      "dev": true,
+      "requires": {
+        "dot-prop": "^5.2.0",
+        "graceful-fs": "^4.1.2",
+        "make-dir": "^3.0.0",
+        "unique-string": "^2.0.0",
+        "write-file-atomic": "^3.0.0",
+        "xdg-basedir": "^4.0.0"
+      }
+    },
+    "console-control-strings": {
+      "version": "1.1.0",
+      "resolved": "https://registry.npmjs.org/console-control-strings/-/console-control-strings-1.1.0.tgz",
+      "integrity": "sha1-PXz0Rk22RG6mRL9LOVB/mFEAjo4="
+    },
+    "content-disposition": {
+      "version": "0.5.3",
+      "resolved": "https://registry.npmjs.org/content-disposition/-/content-disposition-0.5.3.tgz",
+      "integrity": "sha512-ExO0774ikEObIAEV9kDo50o+79VCUdEB6n6lzKgGwupcVeRlhrj3qGAfwq8G6uBJjkqLrhT0qEYFcWng8z1z0g==",
+      "requires": {
+        "safe-buffer": "5.1.2"
+      }
+    },
+    "content-type": {
+      "version": "1.0.4",
+      "resolved": "https://registry.npmjs.org/content-type/-/content-type-1.0.4.tgz",
+      "integrity": "sha512-hIP3EEPs8tB9AT1L+NUqtwOAps4mk2Zob89MWXMHjHWg9milF/j4osnnQLXBCBFBk/tvIG/tUc9mOUJiPBhPXA=="
+    },
+    "cookie": {
+      "version": "0.4.0",
+      "resolved": "https://registry.npmjs.org/cookie/-/cookie-0.4.0.tgz",
+      "integrity": "sha512-+Hp8fLp57wnUSt0tY0tHEXh4voZRDnoIrZPqlo3DPiI4y9lwg/jqx+1Om94/W6ZaPDOUbnjOt/99w66zk+l1Xg=="
+    },
+    "cookie-signature": {
+      "version": "1.0.6",
+      "resolved": "https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.6.tgz",
+      "integrity": "sha1-4wOogrNCzD7oylE6eZmXNNqzriw="
+    },
+    "crypto-random-string": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/crypto-random-string/-/crypto-random-string-2.0.0.tgz",
+      "integrity": "sha512-v1plID3y9r/lPhviJ1wrXpLeyUIGAZ2SHNYTEapm7/8A9nLPoyvVp3RK/EPFqn5kEznyWgYZNsRtYYIWbuG8KA==",
+      "dev": true
+    },
+    "debug": {
+      "version": "2.6.9",
+      "resolved": "https://registry.npmjs.org/debug/-/debug-2.6.9.tgz",
+      "integrity": "sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA==",
+      "requires": {
+        "ms": "2.0.0"
+      }
+    },
+    "decompress-response": {
+      "version": "3.3.0",
+      "resolved": "https://registry.npmjs.org/decompress-response/-/decompress-response-3.3.0.tgz",
+      "integrity": "sha1-gKTdMjdIOEv6JICDYirt7Jgq3/M=",
+      "dev": true,
+      "requires": {
+        "mimic-response": "^1.0.0"
+      }
+    },
+    "deep-extend": {
+      "version": "0.6.0",
+      "resolved": "https://registry.npmjs.org/deep-extend/-/deep-extend-0.6.0.tgz",
+      "integrity": "sha512-LOHxIOaPYdHlJRtCQfDIVZtfw/ufM8+rVj649RIHzcm/vGwQRXFt6OPqIFWsm2XEMrNIEtWR64sY1LEKD2vAOA==",
+      "dev": true
+    },
+    "defer-to-connect": {
+      "version": "1.1.3",
+      "resolved": "https://registry.npmjs.org/defer-to-connect/-/defer-to-connect-1.1.3.tgz",
+      "integrity": "sha512-0ISdNousHvZT2EiFlZeZAHBUvSxmKswVCEf8hW7KWgG4a8MVEu/3Vb6uWYozkjylyCxe0JBIiRB1jV45S70WVQ==",
+      "dev": true
+    },
+    "delegates": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/delegates/-/delegates-1.0.0.tgz",
+      "integrity": "sha1-hMbhWbgZBP3KWaDvRM2HDTElD5o="
+    },
+    "depd": {
+      "version": "1.1.2",
+      "resolved": "https://registry.npmjs.org/depd/-/depd-1.1.2.tgz",
+      "integrity": "sha1-m81S4UwJd2PnSbJ0xDRu0uVgtak="
+    },
+    "destroy": {
+      "version": "1.0.4",
+      "resolved": "https://registry.npmjs.org/destroy/-/destroy-1.0.4.tgz",
+      "integrity": "sha1-l4hXRCxEdJ5CBmE+N5RiBYJqvYA="
+    },
+    "dot-prop": {
+      "version": "5.3.0",
+      "resolved": "https://registry.npmjs.org/dot-prop/-/dot-prop-5.3.0.tgz",
+      "integrity": "sha512-QM8q3zDe58hqUqjraQOmzZ1LIH9SWQJTlEKCH4kJ2oQvLZk7RbQXvtDM2XEq3fwkV9CCvvH4LA0AV+ogFsBM2Q==",
+      "dev": true,
+      "requires": {
+        "is-obj": "^2.0.0"
+      }
+    },
+    "duplexer3": {
+      "version": "0.1.4",
+      "resolved": "https://registry.npmjs.org/duplexer3/-/duplexer3-0.1.4.tgz",
+      "integrity": "sha1-7gHdHKwO08vH/b6jfcCo8c4ALOI=",
+      "dev": true
+    },
+    "ee-first": {
+      "version": "1.1.1",
+      "resolved": "https://registry.npmjs.org/ee-first/-/ee-first-1.1.1.tgz",
+      "integrity": "sha1-WQxhFWsK4vTwJVcyoViyZrxWsh0="
+    },
+    "emoji-regex": {
+      "version": "7.0.3",
+      "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-7.0.3.tgz",
+      "integrity": "sha512-CwBLREIQ7LvYFB0WyRvwhq5N5qPhc6PMjD6bYggFlI5YyDgl+0vxq5VHbMOFqLg7hfWzmu8T5Z1QofhmTIhItA==",
+      "dev": true
+    },
+    "encodeurl": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/encodeurl/-/encodeurl-1.0.2.tgz",
+      "integrity": "sha1-rT/0yG7C0CkyL1oCw6mmBslbP1k="
+    },
+    "end-of-stream": {
+      "version": "1.4.4",
+      "resolved": "https://registry.npmjs.org/end-of-stream/-/end-of-stream-1.4.4.tgz",
+      "integrity": "sha512-+uw1inIHVPQoaVuHzRyXd21icM+cnt4CzD5rW+NC1wjOUSTOs+Te7FOv7AhN7vS9x/oIyhLP5PR1H+phQAHu5Q==",
+      "dev": true,
+      "requires": {
+        "once": "^1.4.0"
+      }
+    },
+    "escape-goat": {
+      "version": "2.1.1",
+      "resolved": "https://registry.npmjs.org/escape-goat/-/escape-goat-2.1.1.tgz",
+      "integrity": "sha512-8/uIhbG12Csjy2JEW7D9pHbreaVaS/OpN3ycnyvElTdwM5n6GY6W6e2IPemfvGZeUMqZ9A/3GqIZMgKnBhAw/Q==",
+      "dev": true
+    },
+    "escape-html": {
+      "version": "1.0.3",
+      "resolved": "https://registry.npmjs.org/escape-html/-/escape-html-1.0.3.tgz",
+      "integrity": "sha1-Aljq5NPQwJdN4cFpGI7wBR0dGYg="
+    },
+    "etag": {
+      "version": "1.8.1",
+      "resolved": "https://registry.npmjs.org/etag/-/etag-1.8.1.tgz",
+      "integrity": "sha1-Qa4u62XvpiJorr/qg6x9eSmbCIc="
+    },
+    "express": {
+      "version": "4.17.1",
+      "resolved": "https://registry.npmjs.org/express/-/express-4.17.1.tgz",
+      "integrity": "sha512-mHJ9O79RqluphRrcw2X/GTh3k9tVv8YcoyY4Kkh4WDMUYKRZUq0h1o0w2rrrxBqM7VoeUVqgb27xlEMXTnYt4g==",
+      "requires": {
+        "accepts": "~1.3.7",
+        "array-flatten": "1.1.1",
+        "body-parser": "1.19.0",
+        "content-disposition": "0.5.3",
+        "content-type": "~1.0.4",
+        "cookie": "0.4.0",
+        "cookie-signature": "1.0.6",
+        "debug": "2.6.9",
+        "depd": "~1.1.2",
+        "encodeurl": "~1.0.2",
+        "escape-html": "~1.0.3",
+        "etag": "~1.8.1",
+        "finalhandler": "~1.1.2",
+        "fresh": "0.5.2",
+        "merge-descriptors": "1.0.1",
+        "methods": "~1.1.2",
+        "on-finished": "~2.3.0",
+        "parseurl": "~1.3.3",
+        "path-to-regexp": "0.1.7",
+        "proxy-addr": "~2.0.5",
+        "qs": "6.7.0",
+        "range-parser": "~1.2.1",
+        "safe-buffer": "5.1.2",
+        "send": "0.17.1",
+        "serve-static": "1.14.1",
+        "setprototypeof": "1.1.1",
+        "statuses": "~1.5.0",
+        "type-is": "~1.6.18",
+        "utils-merge": "1.0.1",
+        "vary": "~1.1.2"
+      }
+    },
+    "fill-range": {
+      "version": "7.0.1",
+      "resolved": "https://registry.npmjs.org/fill-range/-/fill-range-7.0.1.tgz",
+      "integrity": "sha512-qOo9F+dMUmC2Lcb4BbVvnKJxTPjCm+RRpe4gDuGrzkL7mEVl/djYSu2OdQ2Pa302N4oqkSg9ir6jaLWJ2USVpQ==",
+      "dev": true,
+      "requires": {
+        "to-regex-range": "^5.0.1"
+      }
+    },
+    "finalhandler": {
+      "version": "1.1.2",
+      "resolved": "https://registry.npmjs.org/finalhandler/-/finalhandler-1.1.2.tgz",
+      "integrity": "sha512-aAWcW57uxVNrQZqFXjITpW3sIUQmHGG3qSb9mUah9MgMC4NeWhNOlNjXEYq3HjRAvL6arUviZGGJsBg6z0zsWA==",
+      "requires": {
+        "debug": "2.6.9",
+        "encodeurl": "~1.0.2",
+        "escape-html": "~1.0.3",
+        "on-finished": "~2.3.0",
+        "parseurl": "~1.3.3",
+        "statuses": "~1.5.0",
+        "unpipe": "~1.0.0"
+      }
+    },
+    "forwarded": {
+      "version": "0.2.0",
+      "resolved": "https://registry.npmjs.org/forwarded/-/forwarded-0.2.0.tgz",
+      "integrity": "sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow=="
+    },
+    "fresh": {
+      "version": "0.5.2",
+      "resolved": "https://registry.npmjs.org/fresh/-/fresh-0.5.2.tgz",
+      "integrity": "sha1-PYyt2Q2XZWn6g1qx+OSyOhBWBac="
+    },
+    "fsevents": {
+      "version": "2.3.2",
+      "resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.2.tgz",
+      "integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==",
+      "dev": true,
+      "optional": true
+    },
+    "gauge": {
+      "version": "2.7.4",
+      "resolved": "https://registry.npmjs.org/gauge/-/gauge-2.7.4.tgz",
+      "integrity": "sha1-LANAXHU4w51+s3sxcCLjJfsBi/c=",
+      "requires": {
+        "aproba": "^1.0.3",
+        "console-control-strings": "^1.0.0",
+        "has-unicode": "^2.0.0",
+        "object-assign": "^4.1.0",
+        "signal-exit": "^3.0.0",
+        "string-width": "^1.0.1",
+        "strip-ansi": "^3.0.1",
+        "wide-align": "^1.1.0"
+      }
+    },
+    "get-stream": {
+      "version": "4.1.0",
+      "resolved": "https://registry.npmjs.org/get-stream/-/get-stream-4.1.0.tgz",
+      "integrity": "sha512-GMat4EJ5161kIy2HevLlr4luNjBgvmj413KaQA7jt4V8B4RDsfpHk7WQ9GVqfYyyx8OS/L66Kox+rJRNklLK7w==",
+      "dev": true,
+      "requires": {
+        "pump": "^3.0.0"
+      }
+    },
+    "glob-parent": {
+      "version": "5.1.2",
+      "resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-5.1.2.tgz",
+      "integrity": "sha512-AOIgSQCepiJYwP3ARnGx+5VnTu2HBYdzbGP45eLw1vr3zB3vZLeyed1sC9hnbcOc9/SrMyM5RPQrkGz4aS9Zow==",
+      "dev": true,
+      "requires": {
+        "is-glob": "^4.0.1"
+      }
+    },
+    "global-dirs": {
+      "version": "2.1.0",
+      "resolved": "https://registry.npmjs.org/global-dirs/-/global-dirs-2.1.0.tgz",
+      "integrity": "sha512-MG6kdOUh/xBnyo9cJFeIKkLEc1AyFq42QTU4XiX51i2NEdxLxLWXIjEjmqKeSuKR7pAZjTqUVoT2b2huxVLgYQ==",
+      "dev": true,
+      "requires": {
+        "ini": "1.3.7"
+      }
+    },
+    "got": {
+      "version": "9.6.0",
+      "resolved": "https://registry.npmjs.org/got/-/got-9.6.0.tgz",
+      "integrity": "sha512-R7eWptXuGYxwijs0eV+v3o6+XH1IqVK8dJOEecQfTmkncw9AV4dcw/Dhxi8MdlqPthxxpZyizMzyg8RTmEsG+Q==",
+      "dev": true,
+      "requires": {
+        "@sindresorhus/is": "^0.14.0",
+        "@szmarczak/http-timer": "^1.1.2",
+        "cacheable-request": "^6.0.0",
+        "decompress-response": "^3.3.0",
+        "duplexer3": "^0.1.4",
+        "get-stream": "^4.1.0",
+        "lowercase-keys": "^1.0.1",
+        "mimic-response": "^1.0.1",
+        "p-cancelable": "^1.0.0",
+        "to-readable-stream": "^1.0.0",
+        "url-parse-lax": "^3.0.0"
+      }
+    },
+    "graceful-fs": {
+      "version": "4.2.8",
+      "resolved": "https://registry.npmjs.org/graceful-fs/-/graceful-fs-4.2.8.tgz",
+      "integrity": "sha512-qkIilPUYcNhJpd33n0GBXTB1MMPp14TxEsEs0pTrsSVucApsYzW5V+Q8Qxhik6KU3evy+qkAAowTByymK0avdg==",
+      "dev": true
+    },
+    "has-flag": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/has-flag/-/has-flag-3.0.0.tgz",
+      "integrity": "sha1-tdRU3CGZriJWmfNGfloH87lVuv0=",
+      "dev": true
+    },
+    "has-unicode": {
+      "version": "2.0.1",
+      "resolved": "https://registry.npmjs.org/has-unicode/-/has-unicode-2.0.1.tgz",
+      "integrity": "sha1-4Ob+aijPUROIVeCG0Wkedx3iqLk="
+    },
+    "has-yarn": {
+      "version": "2.1.0",
+      "resolved": "https://registry.npmjs.org/has-yarn/-/has-yarn-2.1.0.tgz",
+      "integrity": "sha512-UqBRqi4ju7T+TqGNdqAO0PaSVGsDGJUBQvk9eUWNGRY1CFGDzYhLWoM7JQEemnlvVcv/YEmc2wNW8BC24EnUsw==",
+      "dev": true
+    },
+    "http-cache-semantics": {
+      "version": "4.1.0",
+      "resolved": "https://registry.npmjs.org/http-cache-semantics/-/http-cache-semantics-4.1.0.tgz",
+      "integrity": "sha512-carPklcUh7ROWRK7Cv27RPtdhYhUsela/ue5/jKzjegVvXDqM2ILE9Q2BGn9JZJh1g87cp56su/FgQSzcWS8cQ==",
+      "dev": true
+    },
+    "http-errors": {
+      "version": "1.7.2",
+      "resolved": "https://registry.npmjs.org/http-errors/-/http-errors-1.7.2.tgz",
+      "integrity": "sha512-uUQBt3H/cSIVfch6i1EuPNy/YsRSOUBXTVfZ+yR7Zjez3qjBz6i9+i4zjNaoqcoFVI4lQJ5plg63TvGfRSDCRg==",
+      "requires": {
+        "depd": "~1.1.2",
+        "inherits": "2.0.3",
+        "setprototypeof": "1.1.1",
+        "statuses": ">= 1.5.0 < 2",
+        "toidentifier": "1.0.0"
+      }
+    },
+    "iconv-lite": {
+      "version": "0.4.24",
+      "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.4.24.tgz",
+      "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==",
+      "requires": {
+        "safer-buffer": ">= 2.1.2 < 3"
+      }
+    },
+    "ignore-by-default": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/ignore-by-default/-/ignore-by-default-1.0.1.tgz",
+      "integrity": "sha1-SMptcvbGo68Aqa1K5odr44ieKwk=",
+      "dev": true
+    },
+    "import-lazy": {
+      "version": "2.1.0",
+      "resolved": "https://registry.npmjs.org/import-lazy/-/import-lazy-2.1.0.tgz",
+      "integrity": "sha1-BWmOPUXIjo1+nZLLBYTnfwlvPkM=",
+      "dev": true
+    },
+    "imurmurhash": {
+      "version": "0.1.4",
+      "resolved": "https://registry.npmjs.org/imurmurhash/-/imurmurhash-0.1.4.tgz",
+      "integrity": "sha1-khi5srkoojixPcT7a21XbyMUU+o=",
+      "dev": true
+    },
+    "inherits": {
+      "version": "2.0.3",
+      "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.3.tgz",
+      "integrity": "sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4="
+    },
+    "ini": {
+      "version": "1.3.7",
+      "resolved": "https://registry.npmjs.org/ini/-/ini-1.3.7.tgz",
+      "integrity": "sha512-iKpRpXP+CrP2jyrxvg1kMUpXDyRUFDWurxbnVT1vQPx+Wz9uCYsMIqYuSBLV+PAaZG/d7kRLKRFc9oDMsH+mFQ==",
+      "dev": true
+    },
+    "ipaddr.js": {
+      "version": "1.9.1",
+      "resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-1.9.1.tgz",
+      "integrity": "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g=="
+    },
+    "is-binary-path": {
+      "version": "2.1.0",
+      "resolved": "https://registry.npmjs.org/is-binary-path/-/is-binary-path-2.1.0.tgz",
+      "integrity": "sha512-ZMERYes6pDydyuGidse7OsHxtbI7WVeUEozgR/g7rd0xUimYNlvZRE/K2MgZTjWy725IfelLeVcEM97mmtRGXw==",
+      "dev": true,
+      "requires": {
+        "binary-extensions": "^2.0.0"
+      }
+    },
+    "is-ci": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/is-ci/-/is-ci-2.0.0.tgz",
+      "integrity": "sha512-YfJT7rkpQB0updsdHLGWrvhBJfcfzNNawYDNIyQXJz0IViGf75O8EBPKSdvw2rF+LGCsX4FZ8tcr3b19LcZq4w==",
+      "dev": true,
+      "requires": {
+        "ci-info": "^2.0.0"
+      }
+    },
+    "is-extglob": {
+      "version": "2.1.1",
+      "resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-2.1.1.tgz",
+      "integrity": "sha1-qIwCU1eR8C7TfHahueqXc8gz+MI=",
+      "dev": true
+    },
+    "is-fullwidth-code-point": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/is-fullwidth-code-point/-/is-fullwidth-code-point-1.0.0.tgz",
+      "integrity": "sha1-754xOG8DGn8NZDr4L95QxFfvAMs=",
+      "requires": {
+        "number-is-nan": "^1.0.0"
+      }
+    },
+    "is-glob": {
+      "version": "4.0.1",
+      "resolved": "https://registry.npmjs.org/is-glob/-/is-glob-4.0.1.tgz",
+      "integrity": "sha512-5G0tKtBTFImOqDnLB2hG6Bp2qcKEFduo4tZu9MT/H6NQv/ghhy30o55ufafxJ/LdH79LLs2Kfrn85TLKyA7BUg==",
+      "dev": true,
+      "requires": {
+        "is-extglob": "^2.1.1"
+      }
+    },
+    "is-installed-globally": {
+      "version": "0.3.2",
+      "resolved": "https://registry.npmjs.org/is-installed-globally/-/is-installed-globally-0.3.2.tgz",
+      "integrity": "sha512-wZ8x1js7Ia0kecP/CHM/3ABkAmujX7WPvQk6uu3Fly/Mk44pySulQpnHG46OMjHGXApINnV4QhY3SWnECO2z5g==",
+      "dev": true,
+      "requires": {
+        "global-dirs": "^2.0.1",
+        "is-path-inside": "^3.0.1"
+      }
+    },
+    "is-npm": {
+      "version": "4.0.0",
+      "resolved": "https://registry.npmjs.org/is-npm/-/is-npm-4.0.0.tgz",
+      "integrity": "sha512-96ECIfh9xtDDlPylNPXhzjsykHsMJZ18ASpaWzQyBr4YRTcVjUvzaHayDAES2oU/3KpljhHUjtSRNiDwi0F0ig==",
+      "dev": true
+    },
+    "is-number": {
+      "version": "7.0.0",
+      "resolved": "https://registry.npmjs.org/is-number/-/is-number-7.0.0.tgz",
+      "integrity": "sha512-41Cifkg6e8TylSpdtTpeLVMqvSBEVzTttHvERD741+pnZ8ANv0004MRL43QKPDlK9cGvNp6NZWZUBlbGXYxxng==",
+      "dev": true
+    },
+    "is-obj": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/is-obj/-/is-obj-2.0.0.tgz",
+      "integrity": "sha512-drqDG3cbczxxEJRoOXcOjtdp1J/lyp1mNn0xaznRs8+muBhgQcrnbspox5X5fOw0HnMnbfDzvnEMEtqDEJEo8w==",
+      "dev": true
+    },
+    "is-path-inside": {
+      "version": "3.0.3",
+      "resolved": "https://registry.npmjs.org/is-path-inside/-/is-path-inside-3.0.3.tgz",
+      "integrity": "sha512-Fd4gABb+ycGAmKou8eMftCupSir5lRxqf4aD/vd0cD2qc4HL07OjCeuHMr8Ro4CoMaeCKDB0/ECBOVWjTwUvPQ==",
+      "dev": true
+    },
+    "is-typedarray": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/is-typedarray/-/is-typedarray-1.0.0.tgz",
+      "integrity": "sha1-5HnICFjfDBsR3dppQPlgEfzaSpo=",
+      "dev": true
+    },
+    "is-yarn-global": {
+      "version": "0.3.0",
+      "resolved": "https://registry.npmjs.org/is-yarn-global/-/is-yarn-global-0.3.0.tgz",
+      "integrity": "sha512-VjSeb/lHmkoyd8ryPVIKvOCn4D1koMqY+vqyjjUfc3xyKtP4dYOxM44sZrnqQSzSds3xyOrUTLTC9LVCVgLngw==",
+      "dev": true
+    },
+    "json-buffer": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/json-buffer/-/json-buffer-3.0.0.tgz",
+      "integrity": "sha1-Wx85evx11ne96Lz8Dkfh+aPZqJg=",
+      "dev": true
+    },
+    "keyv": {
+      "version": "3.1.0",
+      "resolved": "https://registry.npmjs.org/keyv/-/keyv-3.1.0.tgz",
+      "integrity": "sha512-9ykJ/46SN/9KPM/sichzQ7OvXyGDYKGTaDlKMGCAlg2UK8KRy4jb0d8sFc+0Tt0YYnThq8X2RZgCg74RPxgcVA==",
+      "dev": true,
+      "requires": {
+        "json-buffer": "3.0.0"
+      }
+    },
+    "latest-version": {
+      "version": "5.1.0",
+      "resolved": "https://registry.npmjs.org/latest-version/-/latest-version-5.1.0.tgz",
+      "integrity": "sha512-weT+r0kTkRQdCdYCNtkMwWXQTMEswKrFBkm4ckQOMVhhqhIMI1UT2hMj+1iigIhgSZm5gTmrRXBNoGUgaTY1xA==",
+      "dev": true,
+      "requires": {
+        "package-json": "^6.3.0"
+      }
+    },
+    "lowercase-keys": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/lowercase-keys/-/lowercase-keys-1.0.1.tgz",
+      "integrity": "sha512-G2Lj61tXDnVFFOi8VZds+SoQjtQC3dgokKdDG2mTm1tx4m50NUHBOZSBwQQHyy0V12A0JTG4icfZQH+xPyh8VA==",
+      "dev": true
+    },
+    "make-dir": {
+      "version": "3.1.0",
+      "resolved": "https://registry.npmjs.org/make-dir/-/make-dir-3.1.0.tgz",
+      "integrity": "sha512-g3FeP20LNwhALb/6Cz6Dd4F2ngze0jz7tbzrD2wAV+o9FeNHe4rL+yK2md0J/fiSf1sa1ADhXqi5+oVwOM/eGw==",
+      "dev": true,
+      "requires": {
+        "semver": "^6.0.0"
+      },
+      "dependencies": {
+        "semver": {
+          "version": "6.3.0",
+          "resolved": "https://registry.npmjs.org/semver/-/semver-6.3.0.tgz",
+          "integrity": "sha512-b39TBaTSfV6yBrapU89p5fKekE2m/NwnDocOVruQFS1/veMgdzuPcnOM34M6CwxW8jH/lxEa5rBoDeUwu5HHTw==",
+          "dev": true
+        }
+      }
+    },
+    "media-typer": {
+      "version": "0.3.0",
+      "resolved": "https://registry.npmjs.org/media-typer/-/media-typer-0.3.0.tgz",
+      "integrity": "sha1-hxDXrwqmJvj/+hzgAWhUUmMlV0g="
+    },
+    "merge-descriptors": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz",
+      "integrity": "sha1-sAqqVW3YtEVoFQ7J0blT8/kMu2E="
+    },
+    "methods": {
+      "version": "1.1.2",
+      "resolved": "https://registry.npmjs.org/methods/-/methods-1.1.2.tgz",
+      "integrity": "sha1-VSmk1nZUE07cxSZmVoNbD4Ua/O4="
+    },
+    "mime": {
+      "version": "1.6.0",
+      "resolved": "https://registry.npmjs.org/mime/-/mime-1.6.0.tgz",
+      "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg=="
+    },
+    "mime-db": {
+      "version": "1.49.0",
+      "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.49.0.tgz",
+      "integrity": "sha512-CIc8j9URtOVApSFCQIF+VBkX1RwXp/oMMOrqdyXSBXq5RWNEsRfyj1kiRnQgmNXmHxPoFIxOroKA3zcU9P+nAA=="
+    },
+    "mime-types": {
+      "version": "2.1.32",
+      "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.32.tgz",
+      "integrity": "sha512-hJGaVS4G4c9TSMYh2n6SQAGrC4RnfU+daP8G7cSCmaqNjiOoUY0VHCMS42pxnQmVF1GWwFhbHWn3RIxCqTmZ9A==",
+      "requires": {
+        "mime-db": "1.49.0"
+      }
+    },
+    "mimic-response": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/mimic-response/-/mimic-response-1.0.1.tgz",
+      "integrity": "sha512-j5EctnkH7amfV/q5Hgmoal1g2QHFJRraOtmx0JpIqkxhBhI/lJSl1nMpQ45hVarwNETOoWEimndZ4QK0RHxuxQ==",
+      "dev": true
+    },
+    "minimatch": {
+      "version": "3.0.4",
+      "resolved": "https://registry.npmjs.org/minimatch/-/minimatch-3.0.4.tgz",
+      "integrity": "sha512-yJHVQEhyqPLUTgt9B83PXu6W3rx4MvvHvSUvToogpwoGDOUQ+yDrR0HRot+yOCdCO7u4hX3pWft6kWBBcqh0UA==",
+      "dev": true,
+      "requires": {
+        "brace-expansion": "^1.1.7"
+      }
+    },
+    "minimist": {
+      "version": "1.2.5",
+      "resolved": "https://registry.npmjs.org/minimist/-/minimist-1.2.5.tgz",
+      "integrity": "sha512-FM9nNUYrRBAELZQT3xeZQ7fmMOBg6nWNmJKTcgsJeaLstP/UODVpGsr5OhXhhXg6f+qtJ8uiZ+PUxkDWcgIXLw==",
+      "dev": true
+    },
+    "ms": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz",
+      "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g="
+    },
+    "nan": {
+      "version": "2.15.0",
+      "resolved": "https://registry.npmjs.org/nan/-/nan-2.15.0.tgz",
+      "integrity": "sha512-8ZtvEnA2c5aYCZYd1cvgdnU6cqwixRoYg70xPLWUws5ORTa/lnw+u4amixRS/Ac5U5mQVgp9pnlSUnbNWFaWZQ=="
+    },
+    "native-node-utils": {
+      "version": "0.2.7",
+      "resolved": "https://registry.npmjs.org/native-node-utils/-/native-node-utils-0.2.7.tgz",
+      "integrity": "sha512-61v0G3uVxWlXHppSZGwZi+ZEIgGUKI8QvEkEJLb1GVePI7P8SBe+G747z+QMXSt4TxfgbVZP0DyobbRKYVIjdw==",
+      "requires": {
+        "nan": "^2.13.2"
+      }
+    },
+    "negotiator": {
+      "version": "0.6.2",
+      "resolved": "https://registry.npmjs.org/negotiator/-/negotiator-0.6.2.tgz",
+      "integrity": "sha512-hZXc7K2e+PgeI1eDBe/10Ard4ekbfrrqG8Ep+8Jmf4JID2bNg7NvCPOZN+kfF574pFQI7mum2AUqDidoKqcTOw=="
+    },
+    "nodemon": {
+      "version": "2.0.12",
+      "resolved": "https://registry.npmjs.org/nodemon/-/nodemon-2.0.12.tgz",
+      "integrity": "sha512-egCTmNZdObdBxUBw6ZNwvZ/xzk24CKRs5K6d+5zbmrMr7rOpPmfPeF6OxM3DDpaRx331CQRFEktn+wrFFfBSOA==",
+      "dev": true,
+      "requires": {
+        "chokidar": "^3.2.2",
+        "debug": "^3.2.6",
+        "ignore-by-default": "^1.0.1",
+        "minimatch": "^3.0.4",
+        "pstree.remy": "^1.1.7",
+        "semver": "^5.7.1",
+        "supports-color": "^5.5.0",
+        "touch": "^3.1.0",
+        "undefsafe": "^2.0.3",
+        "update-notifier": "^4.1.0"
+      },
+      "dependencies": {
+        "debug": {
+          "version": "3.2.7",
+          "resolved": "https://registry.npmjs.org/debug/-/debug-3.2.7.tgz",
+          "integrity": "sha512-CFjzYYAi4ThfiQvizrFQevTTXHtnCqWfe7x1AhgEscTz6ZbLbfoLRLPugTQyBth6f8ZERVUSyWHFD/7Wu4t1XQ==",
+          "dev": true,
+          "requires": {
+            "ms": "^2.1.1"
+          }
+        },
+        "ms": {
+          "version": "2.1.3",
+          "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.3.tgz",
+          "integrity": "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA==",
+          "dev": true
+        }
+      }
+    },
+    "nopt": {
+      "version": "1.0.10",
+      "resolved": "https://registry.npmjs.org/nopt/-/nopt-1.0.10.tgz",
+      "integrity": "sha1-bd0hvSoxQXuScn3Vhfim83YI6+4=",
+      "dev": true,
+      "requires": {
+        "abbrev": "1"
+      }
+    },
+    "normalize-path": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/normalize-path/-/normalize-path-3.0.0.tgz",
+      "integrity": "sha512-6eZs5Ls3WtCisHWp9S2GUy8dqkpGi4BVSz3GaqiE6ezub0512ESztXUwUB6C6IKbQkY2Pnb/mD4WYojCRwcwLA==",
+      "dev": true
+    },
+    "normalize-url": {
+      "version": "4.5.1",
+      "resolved": "https://registry.npmjs.org/normalize-url/-/normalize-url-4.5.1.tgz",
+      "integrity": "sha512-9UZCFRHQdNrfTpGg8+1INIg93B6zE0aXMVFkw1WFwvO4SlZywU6aLg5Of0Ap/PgcbSw4LNxvMWXMeugwMCX0AA==",
+      "dev": true
+    },
+    "npmlog": {
+      "version": "4.1.2",
+      "resolved": "https://registry.npmjs.org/npmlog/-/npmlog-4.1.2.tgz",
+      "integrity": "sha512-2uUqazuKlTaSI/dC8AzicUck7+IrEaOnN/e0jd3Xtt1KcGpwx30v50mL7oPyr/h9bL3E4aZccVwpwP+5W9Vjkg==",
+      "requires": {
+        "are-we-there-yet": "~1.1.2",
+        "console-control-strings": "~1.1.0",
+        "gauge": "~2.7.3",
+        "set-blocking": "~2.0.0"
+      }
+    },
+    "number-is-nan": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/number-is-nan/-/number-is-nan-1.0.1.tgz",
+      "integrity": "sha1-CXtgK1NCKlIsGvuHkDGDNpQaAR0="
+    },
+    "object-assign": {
+      "version": "4.1.1",
+      "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
+      "integrity": "sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM="
+    },
+    "on-finished": {
+      "version": "2.3.0",
+      "resolved": "https://registry.npmjs.org/on-finished/-/on-finished-2.3.0.tgz",
+      "integrity": "sha1-IPEzZIGwg811M3mSoWlxqi2QaUc=",
+      "requires": {
+        "ee-first": "1.1.1"
+      }
+    },
+    "once": {
+      "version": "1.4.0",
+      "resolved": "https://registry.npmjs.org/once/-/once-1.4.0.tgz",
+      "integrity": "sha1-WDsap3WWHUsROsF9nFC6753Xa9E=",
+      "dev": true,
+      "requires": {
+        "wrappy": "1"
+      }
+    },
+    "opencv-build": {
+      "version": "0.1.9",
+      "resolved": "https://registry.npmjs.org/opencv-build/-/opencv-build-0.1.9.tgz",
+      "integrity": "sha512-tgT/bnJAcYROen9yaPynfK98IMl62mPSgMLmTx41911m5bczlq21xtE5r+UWLB/xEo/0hKk6tl5zHyxV/JS5Rg==",
+      "requires": {
+        "npmlog": "^4.1.2"
+      }
+    },
+    "opencv4nodejs": {
+      "version": "5.6.0",
+      "resolved": "https://registry.npmjs.org/opencv4nodejs/-/opencv4nodejs-5.6.0.tgz",
+      "integrity": "sha512-JvcT1hb2JUCdntcVABgD9Gprr+gkXBe+jhHKvrr0Ug51y087K4ybm0vHBQVzI2ei1aJxEc9tNknPL9rpyx5Xuw==",
+      "requires": {
+        "@types/node": ">6",
+        "nan": "^2.14.0",
+        "native-node-utils": "^0.2.7",
+        "npmlog": "^4.1.2",
+        "opencv-build": "^0.1.9"
+      }
+    },
+    "p-cancelable": {
+      "version": "1.1.0",
+      "resolved": "https://registry.npmjs.org/p-cancelable/-/p-cancelable-1.1.0.tgz",
+      "integrity": "sha512-s73XxOZ4zpt1edZYZzvhqFa6uvQc1vwUa0K0BdtIZgQMAJj9IbebH+JkgKZc9h+B05PKHLOTl4ajG1BmNrVZlw==",
+      "dev": true
+    },
+    "package-json": {
+      "version": "6.5.0",
+      "resolved": "https://registry.npmjs.org/package-json/-/package-json-6.5.0.tgz",
+      "integrity": "sha512-k3bdm2n25tkyxcjSKzB5x8kfVxlMdgsbPr0GkZcwHsLpba6cBjqCt1KlcChKEvxHIcTB1FVMuwoijZ26xex5MQ==",
+      "dev": true,
+      "requires": {
+        "got": "^9.6.0",
+        "registry-auth-token": "^4.0.0",
+        "registry-url": "^5.0.0",
+        "semver": "^6.2.0"
+      },
+      "dependencies": {
+        "semver": {
+          "version": "6.3.0",
+          "resolved": "https://registry.npmjs.org/semver/-/semver-6.3.0.tgz",
+          "integrity": "sha512-b39TBaTSfV6yBrapU89p5fKekE2m/NwnDocOVruQFS1/veMgdzuPcnOM34M6CwxW8jH/lxEa5rBoDeUwu5HHTw==",
+          "dev": true
+        }
+      }
+    },
+    "parseurl": {
+      "version": "1.3.3",
+      "resolved": "https://registry.npmjs.org/parseurl/-/parseurl-1.3.3.tgz",
+      "integrity": "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ=="
+    },
+    "path-to-regexp": {
+      "version": "0.1.7",
+      "resolved": "https://registry.npmjs.org/path-to-regexp/-/path-to-regexp-0.1.7.tgz",
+      "integrity": "sha1-32BBeABfUi8V60SQ5yR6G/qmf4w="
+    },
+    "picomatch": {
+      "version": "2.3.0",
+      "resolved": "https://registry.npmjs.org/picomatch/-/picomatch-2.3.0.tgz",
+      "integrity": "sha512-lY1Q/PiJGC2zOv/z391WOTD+Z02bCgsFfvxoXXf6h7kv9o+WmsmzYqrAwY63sNgOxE4xEdq0WyUnXfKeBrSvYw==",
+      "dev": true
+    },
+    "prepend-http": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-2.0.0.tgz",
+      "integrity": "sha1-6SQ0v6XqjBn0HN/UAddBo8gZ2Jc=",
+      "dev": true
+    },
+    "proxy-addr": {
+      "version": "2.0.7",
+      "resolved": "https://registry.npmjs.org/proxy-addr/-/proxy-addr-2.0.7.tgz",
+      "integrity": "sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg==",
+      "requires": {
+        "forwarded": "0.2.0",
+        "ipaddr.js": "1.9.1"
+      }
+    },
+    "pstree.remy": {
+      "version": "1.1.8",
+      "resolved": "https://registry.npmjs.org/pstree.remy/-/pstree.remy-1.1.8.tgz",
+      "integrity": "sha512-77DZwxQmxKnu3aR542U+X8FypNzbfJ+C5XQDk3uWjWxn6151aIMGthWYRXTqT1E5oJvg+ljaa2OJi+VfvCOQ8w==",
+      "dev": true
+    },
+    "pump": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/pump/-/pump-3.0.0.tgz",
+      "integrity": "sha512-LwZy+p3SFs1Pytd/jYct4wpv49HiYCqd9Rlc5ZVdk0V+8Yzv6jR5Blk3TRmPL1ft69TxP0IMZGJ+WPFU2BFhww==",
+      "dev": true,
+      "requires": {
+        "end-of-stream": "^1.1.0",
+        "once": "^1.3.1"
+      }
+    },
+    "pupa": {
+      "version": "2.1.1",
+      "resolved": "https://registry.npmjs.org/pupa/-/pupa-2.1.1.tgz",
+      "integrity": "sha512-l1jNAspIBSFqbT+y+5FosojNpVpF94nlI+wDUpqP9enwOTfHx9f0gh5nB96vl+6yTpsJsypeNrwfzPrKuHB41A==",
+      "dev": true,
+      "requires": {
+        "escape-goat": "^2.0.0"
+      }
+    },
+    "qs": {
+      "version": "6.7.0",
+      "resolved": "https://registry.npmjs.org/qs/-/qs-6.7.0.tgz",
+      "integrity": "sha512-VCdBRNFTX1fyE7Nb6FYoURo/SPe62QCaAyzJvUjwRaIsc+NePBEniHlvxFmmX56+HZphIGtV0XeCirBtpDrTyQ=="
+    },
+    "range-parser": {
+      "version": "1.2.1",
+      "resolved": "https://registry.npmjs.org/range-parser/-/range-parser-1.2.1.tgz",
+      "integrity": "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg=="
+    },
+    "raw-body": {
+      "version": "2.4.0",
+      "resolved": "https://registry.npmjs.org/raw-body/-/raw-body-2.4.0.tgz",
+      "integrity": "sha512-4Oz8DUIwdvoa5qMJelxipzi/iJIi40O5cGV1wNYp5hvZP8ZN0T+jiNkL0QepXs+EsQ9XJ8ipEDoiH70ySUJP3Q==",
+      "requires": {
+        "bytes": "3.1.0",
+        "http-errors": "1.7.2",
+        "iconv-lite": "0.4.24",
+        "unpipe": "1.0.0"
+      }
+    },
+    "rc": {
+      "version": "1.2.8",
+      "resolved": "https://registry.npmjs.org/rc/-/rc-1.2.8.tgz",
+      "integrity": "sha512-y3bGgqKj3QBdxLbLkomlohkvsA8gdAiUQlSBJnBhfn+BPxg4bc62d8TcBW15wavDfgexCgccckhcZvywyQYPOw==",
+      "dev": true,
+      "requires": {
+        "deep-extend": "^0.6.0",
+        "ini": "~1.3.0",
+        "minimist": "^1.2.0",
+        "strip-json-comments": "~2.0.1"
+      }
+    },
+    "readable-stream": {
+      "version": "3.6.0",
+      "resolved": "https://registry.npmjs.org/readable-stream/-/readable-stream-3.6.0.tgz",
+      "integrity": "sha512-BViHy7LKeTz4oNnkcLJ+lVSL6vpiFeX6/d3oSH8zCW7UxP2onchk+vTGB143xuFjHS3deTgkKoXXymXqymiIdA==",
+      "requires": {
+        "inherits": "^2.0.3",
+        "string_decoder": "^1.1.1",
+        "util-deprecate": "^1.0.1"
+      }
+    },
+    "readdirp": {
+      "version": "3.6.0",
+      "resolved": "https://registry.npmjs.org/readdirp/-/readdirp-3.6.0.tgz",
+      "integrity": "sha512-hOS089on8RduqdbhvQ5Z37A0ESjsqz6qnRcffsMU3495FuTdqSm+7bhJ29JvIOsBDEEnan5DPu9t3To9VRlMzA==",
+      "dev": true,
+      "requires": {
+        "picomatch": "^2.2.1"
+      }
+    },
+    "registry-auth-token": {
+      "version": "4.2.1",
+      "resolved": "https://registry.npmjs.org/registry-auth-token/-/registry-auth-token-4.2.1.tgz",
+      "integrity": "sha512-6gkSb4U6aWJB4SF2ZvLb76yCBjcvufXBqvvEx1HbmKPkutswjW1xNVRY0+daljIYRbogN7O0etYSlbiaEQyMyw==",
+      "dev": true,
+      "requires": {
+        "rc": "^1.2.8"
+      }
+    },
+    "registry-url": {
+      "version": "5.1.0",
+      "resolved": "https://registry.npmjs.org/registry-url/-/registry-url-5.1.0.tgz",
+      "integrity": "sha512-8acYXXTI0AkQv6RAOjE3vOaIXZkT9wo4LOFbBKYQEEnnMNBpKqdUrI6S4NT0KPIo/WVvJ5tE/X5LF/TQUf0ekw==",
+      "dev": true,
+      "requires": {
+        "rc": "^1.2.8"
+      }
+    },
+    "responselike": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/responselike/-/responselike-1.0.2.tgz",
+      "integrity": "sha1-kYcg7ztjHFZCvgaPFa3lpG9Loec=",
+      "dev": true,
+      "requires": {
+        "lowercase-keys": "^1.0.0"
+      }
+    },
+    "safe-buffer": {
+      "version": "5.1.2",
+      "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz",
+      "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g=="
+    },
+    "safer-buffer": {
+      "version": "2.1.2",
+      "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz",
+      "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="
+    },
+    "semver": {
+      "version": "5.7.1",
+      "resolved": "https://registry.npmjs.org/semver/-/semver-5.7.1.tgz",
+      "integrity": "sha512-sauaDf/PZdVgrLTNYHRtpXa1iRiKcaebiKQ1BJdpQlWH2lCvexQdX55snPFyK7QzpudqbCI0qXFfOasHdyNDGQ==",
+      "dev": true
+    },
+    "semver-diff": {
+      "version": "3.1.1",
+      "resolved": "https://registry.npmjs.org/semver-diff/-/semver-diff-3.1.1.tgz",
+      "integrity": "sha512-GX0Ix/CJcHyB8c4ykpHGIAvLyOwOobtM/8d+TQkAd81/bEjgPHrfba41Vpesr7jX/t8Uh+R3EX9eAS5be+jQYg==",
+      "dev": true,
+      "requires": {
+        "semver": "^6.3.0"
+      },
+      "dependencies": {
+        "semver": {
+          "version": "6.3.0",
+          "resolved": "https://registry.npmjs.org/semver/-/semver-6.3.0.tgz",
+          "integrity": "sha512-b39TBaTSfV6yBrapU89p5fKekE2m/NwnDocOVruQFS1/veMgdzuPcnOM34M6CwxW8jH/lxEa5rBoDeUwu5HHTw==",
+          "dev": true
+        }
+      }
+    },
+    "send": {
+      "version": "0.17.1",
+      "resolved": "https://registry.npmjs.org/send/-/send-0.17.1.tgz",
+      "integrity": "sha512-BsVKsiGcQMFwT8UxypobUKyv7irCNRHk1T0G680vk88yf6LBByGcZJOTJCrTP2xVN6yI+XjPJcNuE3V4fT9sAg==",
+      "requires": {
+        "debug": "2.6.9",
+        "depd": "~1.1.2",
+        "destroy": "~1.0.4",
+        "encodeurl": "~1.0.2",
+        "escape-html": "~1.0.3",
+        "etag": "~1.8.1",
+        "fresh": "0.5.2",
+        "http-errors": "~1.7.2",
+        "mime": "1.6.0",
+        "ms": "2.1.1",
+        "on-finished": "~2.3.0",
+        "range-parser": "~1.2.1",
+        "statuses": "~1.5.0"
+      },
+      "dependencies": {
+        "ms": {
+          "version": "2.1.1",
+          "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz",
+          "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg=="
+        }
+      }
+    },
+    "serve-static": {
+      "version": "1.14.1",
+      "resolved": "https://registry.npmjs.org/serve-static/-/serve-static-1.14.1.tgz",
+      "integrity": "sha512-JMrvUwE54emCYWlTI+hGrGv5I8dEwmco/00EvkzIIsR7MqrHonbD9pO2MOfFnpFntl7ecpZs+3mW+XbQZu9QCg==",
+      "requires": {
+        "encodeurl": "~1.0.2",
+        "escape-html": "~1.0.3",
+        "parseurl": "~1.3.3",
+        "send": "0.17.1"
+      }
+    },
+    "set-blocking": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/set-blocking/-/set-blocking-2.0.0.tgz",
+      "integrity": "sha1-BF+XgtARrppoA93TgrJDkrPYkPc="
+    },
+    "setprototypeof": {
+      "version": "1.1.1",
+      "resolved": "https://registry.npmjs.org/setprototypeof/-/setprototypeof-1.1.1.tgz",
+      "integrity": "sha512-JvdAWfbXeIGaZ9cILp38HntZSFSo3mWg6xGcJJsd+d4aRMOqauag1C63dJfDw7OaMYwEbHMOxEZ1lqVRYP2OAw=="
+    },
+    "signal-exit": {
+      "version": "3.0.3",
+      "resolved": "https://registry.npmjs.org/signal-exit/-/signal-exit-3.0.3.tgz",
+      "integrity": "sha512-VUJ49FC8U1OxwZLxIbTTrDvLnf/6TDgxZcK8wxR8zs13xpx7xbG60ndBlhNrFi2EMuFRoeDoJO7wthSLq42EjA=="
+    },
+    "statuses": {
+      "version": "1.5.0",
+      "resolved": "https://registry.npmjs.org/statuses/-/statuses-1.5.0.tgz",
+      "integrity": "sha1-Fhx9rBd2Wf2YEfQ3cfqZOBR4Yow="
+    },
+    "string-width": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/string-width/-/string-width-1.0.2.tgz",
+      "integrity": "sha1-EYvfW4zcUaKn5w0hHgfisLmxB9M=",
+      "requires": {
+        "code-point-at": "^1.0.0",
+        "is-fullwidth-code-point": "^1.0.0",
+        "strip-ansi": "^3.0.0"
+      }
+    },
+    "string_decoder": {
+      "version": "1.3.0",
+      "resolved": "https://registry.npmjs.org/string_decoder/-/string_decoder-1.3.0.tgz",
+      "integrity": "sha512-hkRX8U1WjJFd8LsDJ2yQ/wWWxaopEsABU1XfkM8A+j0+85JAGppt16cr1Whg6KIbb4okU6Mql6BOj+uup/wKeA==",
+      "requires": {
+        "safe-buffer": "~5.2.0"
+      },
+      "dependencies": {
+        "safe-buffer": {
+          "version": "5.2.1",
+          "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.2.1.tgz",
+          "integrity": "sha512-rp3So07KcdmmKbGvgaNxQSJr7bGVSVk5S9Eq1F+ppbRo70+YeaDxkw5Dd8NPN+GD6bjnYm2VuPuCXmpuYvmCXQ=="
+        }
+      }
+    },
+    "strip-ansi": {
+      "version": "3.0.1",
+      "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-3.0.1.tgz",
+      "integrity": "sha1-ajhfuIU9lS1f8F0Oiq+UJ43GPc8=",
+      "requires": {
+        "ansi-regex": "^2.0.0"
+      }
+    },
+    "strip-json-comments": {
+      "version": "2.0.1",
+      "resolved": "https://registry.npmjs.org/strip-json-comments/-/strip-json-comments-2.0.1.tgz",
+      "integrity": "sha1-PFMZQukIwml8DsNEhYwobHygpgo=",
+      "dev": true
+    },
+    "supports-color": {
+      "version": "5.5.0",
+      "resolved": "https://registry.npmjs.org/supports-color/-/supports-color-5.5.0.tgz",
+      "integrity": "sha512-QjVjwdXIt408MIiAqCX4oUKsgU2EqAGzs2Ppkm4aQYbjm+ZEWEcW4SfFNTr4uMNZma0ey4f5lgLrkB0aX0QMow==",
+      "dev": true,
+      "requires": {
+        "has-flag": "^3.0.0"
+      }
+    },
+    "term-size": {
+      "version": "2.2.1",
+      "resolved": "https://registry.npmjs.org/term-size/-/term-size-2.2.1.tgz",
+      "integrity": "sha512-wK0Ri4fOGjv/XPy8SBHZChl8CM7uMc5VML7SqiQ0zG7+J5Vr+RMQDoHa2CNT6KHUnTGIXH34UDMkPzAUyapBZg==",
+      "dev": true
+    },
+    "to-readable-stream": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/to-readable-stream/-/to-readable-stream-1.0.0.tgz",
+      "integrity": "sha512-Iq25XBt6zD5npPhlLVXGFN3/gyR2/qODcKNNyTMd4vbm39HUaOiAM4PMq0eMVC/Tkxz+Zjdsc55g9yyz+Yq00Q==",
+      "dev": true
+    },
+    "to-regex-range": {
+      "version": "5.0.1",
+      "resolved": "https://registry.npmjs.org/to-regex-range/-/to-regex-range-5.0.1.tgz",
+      "integrity": "sha512-65P7iz6X5yEr1cwcgvQxbbIw7Uk3gOy5dIdtZ4rDveLqhrdJP+Li/Hx6tyK0NEb+2GCyneCMJiGqrADCSNk8sQ==",
+      "dev": true,
+      "requires": {
+        "is-number": "^7.0.0"
+      }
+    },
+    "toidentifier": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/toidentifier/-/toidentifier-1.0.0.tgz",
+      "integrity": "sha512-yaOH/Pk/VEhBWWTlhI+qXxDFXlejDGcQipMlyxda9nthulaxLZUNcUqFxokp0vcYnvteJln5FNQDRrxj3YcbVw=="
+    },
+    "touch": {
+      "version": "3.1.0",
+      "resolved": "https://registry.npmjs.org/touch/-/touch-3.1.0.tgz",
+      "integrity": "sha512-WBx8Uy5TLtOSRtIq+M03/sKDrXCLHxwDcquSP2c43Le03/9serjQBIztjRz6FkJez9D/hleyAXTBGLwwZUw9lA==",
+      "dev": true,
+      "requires": {
+        "nopt": "~1.0.10"
+      }
+    },
+    "type-fest": {
+      "version": "0.8.1",
+      "resolved": "https://registry.npmjs.org/type-fest/-/type-fest-0.8.1.tgz",
+      "integrity": "sha512-4dbzIzqvjtgiM5rw1k5rEHtBANKmdudhGyBEajN01fEyhaAIhsoKNy6y7+IN93IfpFtwY9iqi7kD+xwKhQsNJA==",
+      "dev": true
+    },
+    "type-is": {
+      "version": "1.6.18",
+      "resolved": "https://registry.npmjs.org/type-is/-/type-is-1.6.18.tgz",
+      "integrity": "sha512-TkRKr9sUTxEH8MdfuCSP7VizJyzRNMjj2J2do2Jr3Kym598JVdEksuzPQCnlFPW4ky9Q+iA+ma9BGm06XQBy8g==",
+      "requires": {
+        "media-typer": "0.3.0",
+        "mime-types": "~2.1.24"
+      }
+    },
+    "typedarray-to-buffer": {
+      "version": "3.1.5",
+      "resolved": "https://registry.npmjs.org/typedarray-to-buffer/-/typedarray-to-buffer-3.1.5.tgz",
+      "integrity": "sha512-zdu8XMNEDepKKR+XYOXAVPtWui0ly0NtohUscw+UmaHiAWT8hrV1rr//H6V+0DvJ3OQ19S979M0laLfX8rm82Q==",
+      "dev": true,
+      "requires": {
+        "is-typedarray": "^1.0.0"
+      }
+    },
+    "typescript": {
+      "version": "4.4.2",
+      "resolved": "https://registry.npmjs.org/typescript/-/typescript-4.4.2.tgz",
+      "integrity": "sha512-gzP+t5W4hdy4c+68bfcv0t400HVJMMd2+H9B7gae1nQlBzCqvrXX+6GL/b3GAgyTH966pzrZ70/fRjwAtZksSQ==",
+      "dev": true
+    },
+    "undefsafe": {
+      "version": "2.0.3",
+      "resolved": "https://registry.npmjs.org/undefsafe/-/undefsafe-2.0.3.tgz",
+      "integrity": "sha512-nrXZwwXrD/T/JXeygJqdCO6NZZ1L66HrxM/Z7mIq2oPanoN0F1nLx3lwJMu6AwJY69hdixaFQOuoYsMjE5/C2A==",
+      "dev": true,
+      "requires": {
+        "debug": "^2.2.0"
+      }
+    },
+    "unique-string": {
+      "version": "2.0.0",
+      "resolved": "https://registry.npmjs.org/unique-string/-/unique-string-2.0.0.tgz",
+      "integrity": "sha512-uNaeirEPvpZWSgzwsPGtU2zVSTrn/8L5q/IexZmH0eH6SA73CmAA5U4GwORTxQAZs95TAXLNqeLoPPNO5gZfWg==",
+      "dev": true,
+      "requires": {
+        "crypto-random-string": "^2.0.0"
+      }
+    },
+    "unpipe": {
+      "version": "1.0.0",
+      "resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz",
+      "integrity": "sha1-sr9O6FFKrmFltIF4KdIbLvSZBOw="
+    },
+    "update-notifier": {
+      "version": "4.1.3",
+      "resolved": "https://registry.npmjs.org/update-notifier/-/update-notifier-4.1.3.tgz",
+      "integrity": "sha512-Yld6Z0RyCYGB6ckIjffGOSOmHXj1gMeE7aROz4MG+XMkmixBX4jUngrGXNYz7wPKBmtoD4MnBa2Anu7RSKht/A==",
+      "dev": true,
+      "requires": {
+        "boxen": "^4.2.0",
+        "chalk": "^3.0.0",
+        "configstore": "^5.0.1",
+        "has-yarn": "^2.1.0",
+        "import-lazy": "^2.1.0",
+        "is-ci": "^2.0.0",
+        "is-installed-globally": "^0.3.1",
+        "is-npm": "^4.0.0",
+        "is-yarn-global": "^0.3.0",
+        "latest-version": "^5.0.0",
+        "pupa": "^2.0.1",
+        "semver-diff": "^3.1.1",
+        "xdg-basedir": "^4.0.0"
+      }
+    },
+    "url-parse-lax": {
+      "version": "3.0.0",
+      "resolved": "https://registry.npmjs.org/url-parse-lax/-/url-parse-lax-3.0.0.tgz",
+      "integrity": "sha1-FrXK/Afb42dsGxmZF3gj1lA6yww=",
+      "dev": true,
+      "requires": {
+        "prepend-http": "^2.0.0"
+      }
+    },
+    "util-deprecate": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/util-deprecate/-/util-deprecate-1.0.2.tgz",
+      "integrity": "sha1-RQ1Nyfpw3nMnYvvS1KKJgUGaDM8="
+    },
+    "utils-merge": {
+      "version": "1.0.1",
+      "resolved": "https://registry.npmjs.org/utils-merge/-/utils-merge-1.0.1.tgz",
+      "integrity": "sha1-n5VxD1CiZ5R7LMwSR0HBAoQn5xM="
+    },
+    "vary": {
+      "version": "1.1.2",
+      "resolved": "https://registry.npmjs.org/vary/-/vary-1.1.2.tgz",
+      "integrity": "sha1-IpnwLG3tMNSllhsLn3RSShj2NPw="
+    },
+    "wide-align": {
+      "version": "1.1.3",
+      "resolved": "https://registry.npmjs.org/wide-align/-/wide-align-1.1.3.tgz",
+      "integrity": "sha512-QGkOQc8XL6Bt5PwnsExKBPuMKBxnGxWWW3fU55Xt4feHozMUhdUMaBCk290qpm/wG5u/RSKzwdAC4i51YigihA==",
+      "requires": {
+        "string-width": "^1.0.2 || 2"
+      }
+    },
+    "widest-line": {
+      "version": "3.1.0",
+      "resolved": "https://registry.npmjs.org/widest-line/-/widest-line-3.1.0.tgz",
+      "integrity": "sha512-NsmoXalsWVDMGupxZ5R08ka9flZjjiLvHVAWYOKtiKM8ujtZWr9cRffak+uSE48+Ob8ObalXpwyeUiyDD6QFgg==",
+      "dev": true,
+      "requires": {
+        "string-width": "^4.0.0"
+      },
+      "dependencies": {
+        "ansi-regex": {
+          "version": "5.0.0",
+          "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-5.0.0.tgz",
+          "integrity": "sha512-bY6fj56OUQ0hU1KjFNDQuJFezqKdrAyFdIevADiqrWHwSlbmBNMHp5ak2f40Pm8JTFyM2mqxkG6ngkHO11f/lg==",
+          "dev": true
+        },
+        "emoji-regex": {
+          "version": "8.0.0",
+          "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz",
+          "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A==",
+          "dev": true
+        },
+        "is-fullwidth-code-point": {
+          "version": "3.0.0",
+          "resolved": "https://registry.npmjs.org/is-fullwidth-code-point/-/is-fullwidth-code-point-3.0.0.tgz",
+          "integrity": "sha512-zymm5+u+sCsSWyD9qNaejV3DFvhCKclKdizYaJUuHA83RLjb7nSuGnddCHGv0hk+KY7BMAlsWeK4Ueg6EV6XQg==",
+          "dev": true
+        },
+        "string-width": {
+          "version": "4.2.2",
+          "resolved": "https://registry.npmjs.org/string-width/-/string-width-4.2.2.tgz",
+          "integrity": "sha512-XBJbT3N4JhVumXE0eoLU9DCjcaF92KLNqTmFCnG1pf8duUxFGwtP6AD6nkjw9a3IdiRtL3E2w3JDiE/xi3vOeA==",
+          "dev": true,
+          "requires": {
+            "emoji-regex": "^8.0.0",
+            "is-fullwidth-code-point": "^3.0.0",
+            "strip-ansi": "^6.0.0"
+          }
+        },
+        "strip-ansi": {
+          "version": "6.0.0",
+          "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-6.0.0.tgz",
+          "integrity": "sha512-AuvKTrTfQNYNIctbR1K/YGTR1756GycPsg7b9bdV9Duqur4gv6aKqHXah67Z8ImS7WEz5QVcOtlfW2rZEugt6w==",
+          "dev": true,
+          "requires": {
+            "ansi-regex": "^5.0.0"
+          }
+        }
+      }
+    },
+    "wrappy": {
+      "version": "1.0.2",
+      "resolved": "https://registry.npmjs.org/wrappy/-/wrappy-1.0.2.tgz",
+      "integrity": "sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8=",
+      "dev": true
+    },
+    "write-file-atomic": {
+      "version": "3.0.3",
+      "resolved": "https://registry.npmjs.org/write-file-atomic/-/write-file-atomic-3.0.3.tgz",
+      "integrity": "sha512-AvHcyZ5JnSfq3ioSyjrBkH9yW4m7Ayk8/9My/DD9onKeu/94fwrMocemO2QAJFAlnnDN+ZDS+ZjAR5ua1/PV/Q==",
+      "dev": true,
+      "requires": {
+        "imurmurhash": "^0.1.4",
+        "is-typedarray": "^1.0.0",
+        "signal-exit": "^3.0.2",
+        "typedarray-to-buffer": "^3.1.5"
+      }
+    },
+    "ws": {
+      "version": "8.2.1",
+      "resolved": "https://registry.npmjs.org/ws/-/ws-8.2.1.tgz",
+      "integrity": "sha512-XkgWpJU3sHU7gX8f13NqTn6KQ85bd1WU7noBHTT8fSohx7OS1TPY8k+cyRPCzFkia7C4mM229yeHr1qK9sM4JQ=="
+    },
+    "xdg-basedir": {
+      "version": "4.0.0",
+      "resolved": "https://registry.npmjs.org/xdg-basedir/-/xdg-basedir-4.0.0.tgz",
+      "integrity": "sha512-PSNhEJDejZYV7h50BohL09Er9VaIefr2LMAf3OEmpCkjOi34eYyQYAXUTjEQtZJTKcF0E2UKTh+osDLsgNim9Q==",
+      "dev": true
+    }
+  }
+}
diff --git a/demos/image_pipeline_web/backend/package.json b/demos/image_pipeline_web/backend/package.json
new file mode 100644
index 00000000..12509b13
--- /dev/null
+++ b/demos/image_pipeline_web/backend/package.json
@@ -0,0 +1,26 @@
+{
+  "name": "SeptembeRSE-demo-backend",
+  "version": "1.0.0",
+  "description": "Backend for the SeptembeRSE demo",
+  "main": "dist/index.js",
+  "author": "Francesco Sgherzi",
+  "license": "MIT",
+  "scripts": {
+    "build": "tsc",
+    "start": "tsc & node .",
+    "dev": "tsc -w & nodemon --polyglot --jvm --vm.Dtruffle.class.path.append=$GRAAL_HOME/languages/grcuda/grcuda.jar --experimental-options --grcuda.ExecutionPolicy=async dist/index.js 8080",
+    "runall": "node --polyglot --jvm --grcuda.NumberOfGPUs=4 --vm.Dtruffle.class.path.append=$GRAAL_HOME/languages/grcuda/grcuda.jar --experimental-options --grcuda.ExecutionPolicy=sync dist/index.js 8080 2 & node --polyglot --jvm --vm.Dtruffle.class.path.append=$GRAAL_HOME/languages/grcuda/grcuda.jar --experimental-options --grcuda.NumberOfGPUs=4 --grcuda.ExecutionPolicy=async --grcuda.ForceStreamAttach --grcuda.RetrieveNewStreamPolicy=always-new --grcuda.DependencyPolicy=with-const --grcuda.RetrieveParentStreamPolicy=disjoint --grcuda.InputPrefetch dist/index.js 8083 0 & node --polyglot --jvm --vm.Dtruffle.class.path.append=$GRAAL_HOME/languages/grcuda/grcuda.jar --experimental-options --grcuda.NumberOfGPUs=4 --grcuda.ExecutionPolicy=sync dist/index.js 8082 3"
+  },
+  "dependencies": {
+    "express": "^4.17.1",
+    "opencv4nodejs": "^5.6.0",
+    "ws": "^8.0.0"
+  },
+  "devDependencies": {
+    "@types/express": "^4.17.13",
+    "@types/node": "^16.4.6",
+    "@types/ws": "^7.4.7",
+    "nodemon": "^2.0.12",
+    "typescript": "^4.3.5"
+  }
+}
diff --git a/demos/image_pipeline_web/backend/src/CudaKernels.ts b/demos/image_pipeline_web/backend/src/CudaKernels.ts
new file mode 100644
index 00000000..44f08227
--- /dev/null
+++ b/demos/image_pipeline_web/backend/src/CudaKernels.ts
@@ -0,0 +1,227 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+export const GAUSSIAN_BLUR = `
+extern "C" __global__ void gaussian_blur(const int *image, float *result, int rows, int cols, const float* kernel, int diameter) {
+    extern __shared__ float kernel_local[];
+    for(int i = threadIdx.x; i < diameter; i += blockDim.x) {
+        for(int j = threadIdx.y; j < diameter; j += blockDim.y) {
+            kernel_local[i * diameter + j] = kernel[i * diameter + j];
+        }
+    }
+    __syncthreads();
+
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            int radius = diameter / 2;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        sum += kernel_local[(x + radius) * diameter + (y + radius)] * (float(image[nx * cols + ny]) / 255);
+                    }
+                }
+            }
+            result[i * cols + j] = sum;
+        }
+    }
+}
+`
+
+export const SOBEL = `
+extern "C" __global__ void sobel(float *image, float *result, int rows, int cols) {
+    // int SOBEL_X[3][3] = {{-1, -2, -1}, {0, 0, 0}, {1, 2, 1}};
+    // int SOBEL_Y[3][3] = {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}};
+    __shared__ int SOBEL_X[9];
+    __shared__ int SOBEL_Y[9];
+    if (threadIdx.x == 0 && threadIdx.y == 0) {   
+        SOBEL_X[0] = -1;
+        SOBEL_X[1] = -2;
+        SOBEL_X[2] = -1;
+        SOBEL_X[3] = 0;
+        SOBEL_X[4] = 0;
+        SOBEL_X[5] = 0;
+        SOBEL_X[6] = 1;
+        SOBEL_X[7] = 2;
+        SOBEL_X[8] = 1;
+
+        SOBEL_Y[0] = -1;
+        SOBEL_Y[1] = 0;
+        SOBEL_Y[2] = 1;
+        SOBEL_Y[3] = -2;
+        SOBEL_Y[4] = 0;
+        SOBEL_Y[5] = 2;
+        SOBEL_Y[6] = -1;
+        SOBEL_Y[7] = 0;
+        SOBEL_Y[8] = 1;
+    }
+    __syncthreads();
+    
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum_gradient_x = 0.0, sum_gradient_y = 0.0;
+            int radius = 1;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        float neighbour = image[nx * cols + ny];
+                        int s = (x + radius) * 3 + y + radius;
+                        sum_gradient_x += SOBEL_X[s] * neighbour;
+                        sum_gradient_y += SOBEL_Y[s] * neighbour;
+                    }
+                }
+            }
+            result[i * cols + j] = sqrt(sum_gradient_x * sum_gradient_x + sum_gradient_y * sum_gradient_y);
+        }
+    }
+}
+`
+
+export const EXTEND_MASK = `
+__device__ float atomicMinf(float* address, float val)
+{
+    int *address_as_int =(int*)address;
+    int old = *address_as_int, assumed;
+    while (val < __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed,
+                        __float_as_int(val));
+        }
+    return __int_as_float(old);
+}
+
+__device__ float atomicMaxf(float* address, float val)
+{
+    int *address_as_int = (int*) address;
+    int old = *address_as_int, assumed;
+    // If val is smaller than current, don't do anything, else update the current value atomically;
+    while (val > __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed, __float_as_int(val));
+    }
+    return __int_as_float(old);
+}
+
+__inline__ __device__ float warp_reduce_max(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = max(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+__inline__ __device__ float warp_reduce_min(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+extern "C" __global__ void maximum(float *in, float* out, int N) {
+    int warp_size = 32;
+    float maximum = -1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        maximum = max(maximum, in[i]);
+    }
+    maximum = warp_reduce_max(maximum); // Obtain the max of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMaxf(out, maximum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void minimum(float *in, float* out, int N) {
+    int warp_size = 32;
+    float minimum = 1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        minimum = min(minimum, in[i]);
+    }
+    minimum = warp_reduce_min(minimum); // Obtain the min of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMinf(out, minimum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void extend(float *x, const float *minimum, const float *maximum, int n, int extend_factor) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = extend_factor * (x[i] - *minimum) / (*maximum - *minimum);
+        x[i] = res_tmp > 1 ? 1 : res_tmp;
+    }
+}
+`
+
+export const UNSHARPEN = `
+extern "C" __global__ void unsharpen(const int *x, const float *y, float *res, float amount, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = (float(x[i]) / 255) * (1 + amount) - y[i] * amount;
+        res_tmp = res_tmp > 1 ? 1 : res_tmp;
+        res[i] = res_tmp < 0 ? 0 : res_tmp;
+    }
+}
+`
+
+export const COMBINE = `
+extern "C" __global__ void combine(const float *x, const float *y, const float *mask, float *res, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        res[i] = x[i] * mask[i] + y[i] * (1 - mask[i]);
+    }
+}
+`
+
+export const COMBINE_2 = `
+extern "C" __global__ void combine_lut(const float *x, const float *y, const float *mask, int *res, int n, int* lut) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        res[i] = lut[min(256 - 1, int(256 * (x[i] * mask[i] + y[i] * (1 - mask[i]))))];
+    }
+}
+`
+
+export const RESET = `
+extern "C" __global__ void reset(float *x, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        x[i] = 0.0;
+    }
+}
+`
+
+export const INT_TO_FLOAT = `
+extern "C" __global__ void int_to_float(const int *x, float *y, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        y[i] = float(x) / 255;
+    }
+}
+`
+
+export const FLOAT_TO_INT = `
+extern "C" __global__ void float_to_int(const float *x, int *y, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        y[i] = int(x[i] * 255);
+    }
+}
+`
\ No newline at end of file
diff --git a/demos/image_pipeline_web/backend/src/GrCUDAProxy.ts b/demos/image_pipeline_web/backend/src/GrCUDAProxy.ts
new file mode 100644
index 00000000..4cc0787e
--- /dev/null
+++ b/demos/image_pipeline_web/backend/src/GrCUDAProxy.ts
@@ -0,0 +1,408 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import WebSocket from 'ws'
+import {
+  _sleep,
+  _getDelayJitter,
+  _intervalToMs,
+  _gaussianKernel,
+  loadImage,
+  storeImage,
+  LUT,
+  copyFrom
+} from './utils'
+
+import {
+  KERNEL_LARGE_DIAMETER,
+  KERNEL_LARGE_VARIANCE,
+  KERNEL_SMALL_DIAMETER,
+  KERNEL_SMALL_VARIANCE,
+  KERNEL_UNSHARPEN_DIAMETER,
+  KERNEL_UNSHARPEN_VARIANCE,
+  UNSHARPEN_AMOUNT,
+  NUM_BLOCKS as BLOCKS,
+  THREADS_1D,
+  THREADS_2D,
+  DEBUG,
+  RESIZED_IMG_WIDTH,
+  MOCK_OPTIONS,
+  COMPUTATION_MODES,
+  CONFIG_OPTIONS,
+  BW,
+  CUDA_NATIVE_EXEC_FILE,
+  CUDA_NATIVE_IMAGE_OUT_BIG_DIRECTORY,
+  CUDA_NATIVE_IMAGE_OUT_SMALL_DIRECTORY,
+  CUDA_NATIVE_IMAGE_IN_DIRECTORY,
+  CDEPTH
+} from './options';
+
+
+import * as ck from "./CudaKernels"
+
+// Load OpenCV;
+import cv from "opencv4nodejs";
+
+import {
+  execSync
+} from "child_process"
+
+// Load GrCUDA;
+//@ts-ignore
+const cu = Polyglot.eval("grcuda", "CU")
+
+//@ts-ignore
+//const cudaSetDevice = Polyglot.eval("grcuda", "cudaSetDevice")
+
+
+// Use Java System to measure time;
+//@ts-ignore
+const System = Java.type("java.lang.System");
+
+// Build the CUDA kernels;
+const GAUSSIAN_BLUR_KERNEL = cu.buildkernel(ck.GAUSSIAN_BLUR, "gaussian_blur", "const pointer, pointer, sint32, sint32, const pointer, sint32");
+const SOBEL_KERNEL = cu.buildkernel(ck.SOBEL, "sobel", "pointer, pointer, sint32, sint32");
+const EXTEND_KERNEL = cu.buildkernel(ck.EXTEND_MASK, "extend", "pointer, const pointer, const pointer, sint32, sint32");
+const MAXIMUM_KERNEL = cu.buildkernel(ck.EXTEND_MASK, "maximum", "const pointer, pointer, sint32");
+const MINIMUM_KERNEL = cu.buildkernel(ck.EXTEND_MASK, "minimum", "const pointer, pointer, sint32");
+const UNSHARPEN_KERNEL = cu.buildkernel(ck.UNSHARPEN, "unsharpen", "const pointer, const pointer, pointer, float, sint32");
+const COMBINE_KERNEL = cu.buildkernel(ck.COMBINE, "combine", "const pointer, const pointer, const pointer, pointer, sint32");
+const COMBINE_KERNEL_LUT = cu.buildkernel(ck.COMBINE_2, "combine_lut", "const pointer, const pointer, const pointer, pointer, sint32, pointer");
+
+export class GrCUDAProxy {
+  private ws: WebSocket
+  private computationType: string
+  private imagesToSend: { [id: string]: Array<string> } = {}
+  private totalTime: number = 0
+  constructor(ws: WebSocket) {
+    this.ws = ws
+  }
+
+  /*
+   * Begins the computation using the mode specified
+   * by `computationType`
+   * @param computationType {string}
+   * @returns `void`
+   */
+  public async beginComputation(computationType: string) {
+    this.computationType = computationType
+    console.log("beginning computation ", computationType.toString())
+
+    COMPUTATION_MODES.forEach(cm => this.imagesToSend[cm] = [])
+
+    if (computationType == "sync" || computationType == "race-sync") {
+      await this.runGrCUDA(computationType.toString())
+      return
+    }
+    if (computationType == "async" || computationType == "race-async") {
+      await this.runGrCUDA(computationType.toString())
+      return
+    }
+    if (computationType == "cuda-native" || computationType == "race-cuda-native") {
+      await this.runNative(computationType.toString())
+      return
+    }
+
+    throw new Error(`Could not recognize computation type: ${computationType}`)
+  }
+
+  private communicateAll(imageId: number, computationType: string) {
+
+    const {
+      MAX_PHOTOS,
+    } = CONFIG_OPTIONS
+
+    this.communicateProgress(imageId / MAX_PHOTOS * 100, computationType)
+    this.communicateImageProcessed(imageId, computationType)
+  }
+
+  async processImageBW(img: cv.Mat) {
+    return new cv.Mat(Buffer.from(await this.processImage(img.getData(), img.rows, 0)), img.rows, img.cols, cv.CV_8UC1);
+  }
+
+  async processImageColor(img: cv.Mat) {
+    let channels = img.splitChannels()
+
+    const buffers = await Promise.all([
+      this.processImage(channels[0].getData(), img.rows, 0),
+      this.processImage(channels[1].getData(), img.rows, 1),
+      this.processImage(channels[2].getData(), img.rows, 2)
+    ])
+
+    channels = buffers.map(buffer => new cv.Mat(buffer, img.rows, img.cols, cv.CV_8UC1))
+
+    return new cv.Mat(channels);
+  }
+
+  private async processImage(img: Buffer, size: number, channel: number, debug: boolean = DEBUG) {
+    // Allocate image data;
+    const image = cu.DeviceArray("int", size * size);
+    const image2 = cu.DeviceArray("float", size, size);
+    const image3 = cu.DeviceArray("int", size * size);
+
+    const kernel_small = cu.DeviceArray("float", KERNEL_SMALL_DIAMETER, KERNEL_SMALL_DIAMETER);
+    const kernel_large = cu.DeviceArray("float", KERNEL_LARGE_DIAMETER, KERNEL_LARGE_DIAMETER);
+    const kernel_unsharpen = cu.DeviceArray("float", KERNEL_UNSHARPEN_DIAMETER, KERNEL_UNSHARPEN_DIAMETER);
+
+    const maximum_1 = cu.DeviceArray("float", 1);
+    const minimum_1 = cu.DeviceArray("float", 1);
+    const maximum_2 = cu.DeviceArray("float", 1);
+    const minimum_2 = cu.DeviceArray("float", 1);
+
+    const mask_small = cu.DeviceArray("float", size, size);
+    const mask_large = cu.DeviceArray("float", size, size);
+    const image_unsharpen = cu.DeviceArray("float", size, size);
+
+    const blurred_small = cu.DeviceArray("float", size, size);
+    const blurred_large = cu.DeviceArray("float", size, size);
+    const blurred_unsharpen = cu.DeviceArray("float", size, size);
+
+    const lut = cu.DeviceArray("int", CDEPTH);
+
+    // Load the right LUT;
+    copyFrom(LUT[channel], lut);
+    // Fill the image data;
+    const s1 = System.nanoTime();
+    copyFrom(img, image);
+    const e1 = System.nanoTime();
+    if (debug) console.log("--img to device array=" + _intervalToMs(s1, e1) + " ms");
+
+    const start = System.nanoTime();
+
+    // Create Gaussian kernels;
+    _gaussianKernel(kernel_small, KERNEL_SMALL_DIAMETER, KERNEL_SMALL_VARIANCE);
+    _gaussianKernel(kernel_large, KERNEL_LARGE_DIAMETER, KERNEL_LARGE_VARIANCE);
+    _gaussianKernel(kernel_unsharpen, KERNEL_UNSHARPEN_DIAMETER, KERNEL_UNSHARPEN_VARIANCE);
+
+    // Main GPU computation;
+    // Blur - Small;
+    GAUSSIAN_BLUR_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D], 4 * KERNEL_SMALL_DIAMETER * KERNEL_SMALL_DIAMETER)(
+      image, blurred_small, size, size, kernel_small, KERNEL_SMALL_DIAMETER);
+    // Blur - Large;
+    GAUSSIAN_BLUR_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D], 4 * KERNEL_LARGE_DIAMETER * KERNEL_LARGE_DIAMETER)(
+      image, blurred_large, size, size, kernel_large, KERNEL_LARGE_DIAMETER);
+    // Blur - Unsharpen;
+    GAUSSIAN_BLUR_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D], 4 * KERNEL_UNSHARPEN_DIAMETER * KERNEL_UNSHARPEN_DIAMETER)(
+      image, blurred_unsharpen, size, size, kernel_unsharpen, KERNEL_UNSHARPEN_DIAMETER);
+    // Sobel filter (edge detection);
+    SOBEL_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D])(
+      blurred_small, mask_small, size, size);
+    SOBEL_KERNEL([BLOCKS, BLOCKS], [THREADS_2D, THREADS_2D])(
+      blurred_large, mask_large, size, size);
+    // Ensure that the output of Sobel is in [0, 1];
+    MAXIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_small, maximum_1, size * size);
+    MINIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_small, minimum_1, size * size);
+    EXTEND_KERNEL(BLOCKS * 2, THREADS_1D)(mask_small, minimum_1, maximum_1, size * size, 1);
+    // Extend large edge detection mask, and normalize it;
+    MAXIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_large, maximum_2, size * size);
+    MINIMUM_KERNEL(BLOCKS * 2, THREADS_1D)(mask_large, minimum_2, size * size);
+    EXTEND_KERNEL(BLOCKS * 2, THREADS_1D)(mask_large, minimum_2, maximum_2, size * size, 5);
+    // Unsharpen;
+    UNSHARPEN_KERNEL(BLOCKS * 2, THREADS_1D)(
+      image, blurred_unsharpen, image_unsharpen, UNSHARPEN_AMOUNT, size * size);
+    // Combine results;
+    COMBINE_KERNEL(BLOCKS * 2, THREADS_1D)(
+      image_unsharpen, blurred_large, mask_large, image2, size * size);
+    COMBINE_KERNEL_LUT(BLOCKS * 2, THREADS_1D)(
+      image2, blurred_small, mask_small, image3, size * size, lut);
+
+    const tmp = image3[0]; // Required only to "sync" the GPU computation and obtain the precise GPU execution time;
+    const end = System.nanoTime();
+    if (debug) console.log("--cuda time=" + _intervalToMs(start, end) + " ms");
+    const s2 = System.nanoTime();
+    img.set(image3);
+    const e2 = System.nanoTime();
+    if (debug) console.log("--device array to image=" + _intervalToMs(s2, e2) + " ms");
+
+    image.free()
+    image2.free()
+    image3.free()
+    kernel_small.free()
+    kernel_large.free()
+    kernel_unsharpen.free()
+    mask_small.free()
+    mask_large.free()
+    image_unsharpen.free()
+    blurred_small.free()
+    blurred_large.free()
+    blurred_unsharpen.free()
+
+    return img;
+  }
+
+  private async runGrCUDAInner(imageName: string, computationType: string, imageId: number, debug: boolean = DEBUG) {
+    const image = await loadImage(imageName)
+    const processedImage = BW ? await this.processImageBW(image) : await this.processImageColor(image)
+    await storeImage(processedImage, imageName)
+    this.communicateAll(imageId, computationType)
+  }
+
+  /*
+   * Compute the GrCUDA kernels 
+   * Execution mode (sync or async) depends on the options 
+   * passed to nodejs
+   * @returns `void`
+   */
+  private async runGrCUDA(computationType: string, debug: boolean = DEBUG) {
+    console.log(`Computing using mode ${computationType}`)
+
+    const {
+      MAX_PHOTOS,
+    } = CONFIG_OPTIONS
+
+    const beginComputeAllImages = System.nanoTime()
+
+    for (let imageId = 0; imageId < MAX_PHOTOS + 1; ++imageId) {
+      try {
+        const imageName = ("0000" + imageId).slice(-4)
+        const begin = System.nanoTime();
+        await this.runGrCUDAInner(imageName, computationType, imageId)
+        const end = System.nanoTime();
+        if (debug) {
+          console.log(`One image took ${_intervalToMs(begin, end)}`)
+        }
+      } catch (e) {
+        console.log(e)
+        continue
+      }
+    }
+
+    const endComputeAllImages = System.nanoTime()
+    const executionTime = _intervalToMs(beginComputeAllImages, endComputeAllImages)
+    this.communicateExecTime(executionTime, computationType)
+    console.log(`[${this.computationType}] Whole computation took ${executionTime}`)
+  }
+
+  /*
+   * Compute the GrCUDA kernel using native 
+   * CUDA code by `exec`ing the kernel via 
+   * a shell
+   * @returns `void`
+   */
+  private async runNative(computationType: string, debug: boolean = DEBUG) {
+    console.log(`Computing using mode ${computationType}`)
+
+    const {
+      MAX_PHOTOS,
+    } = CONFIG_OPTIONS
+
+    const beginComputeAllImages = System.nanoTime()
+
+    for (let imageId = 0; imageId < MAX_PHOTOS + 1; ++imageId) {
+      try {
+        const imageName = ("0000" + imageId).slice(-4)
+        const blackAndWhite = BW ? "-w" : ""
+        const begin = System.nanoTime();
+        execSync(
+          `${CUDA_NATIVE_EXEC_FILE} -d ${blackAndWhite} -r -f ${CUDA_NATIVE_IMAGE_IN_DIRECTORY}/${imageName}.jpg -s ${CUDA_NATIVE_IMAGE_OUT_SMALL_DIRECTORY}/${imageName}.jpg -l ${CUDA_NATIVE_IMAGE_OUT_BIG_DIRECTORY}/${imageName}.jpg -n ${RESIZED_IMG_WIDTH} -g ${BLOCKS}`
+        )
+        this.communicateAll(imageId, computationType)
+        const end = System.nanoTime();
+        if (debug) {
+          console.log(`One image took ${_intervalToMs(begin, end)}`)
+        }
+      } catch (e) {
+        console.log(e)
+        continue
+      }
+    }
+
+    const endComputeAllImages = System.nanoTime()
+
+    this.communicateAll(MAX_PHOTOS, computationType)
+    const executionTime = _intervalToMs(beginComputeAllImages, endComputeAllImages)
+    this.communicateExecTime(executionTime, computationType)
+
+    console.log(`[${this.computationType}] Whole computation took ${executionTime}`)
+  }
+
+  private async communicateExecTime(executionTime: number, computationType: string) {
+
+    this.ws.send(JSON.stringify({
+      type: "time",
+      data: executionTime,
+      computationType
+    }))
+
+  }
+
+  /* Mock the computation of the kernels 
+   * inside GrCUDA.
+   * Sends a `progress` message every time an image is computed
+   * and a `image` message every time BATCH_SIZE images have been computed
+   */
+  private async mockCompute(computationType: string) {
+
+    const {
+      DELAY
+    } = MOCK_OPTIONS
+
+    const {
+      MAX_PHOTOS,
+    } = CONFIG_OPTIONS
+
+    let delay_jitter = _getDelayJitter(computationType)
+
+    for (let imageId = 0; imageId < MAX_PHOTOS + 1; ++imageId) {
+      // This does mock the actual computation that will happen 
+      // in the CUDA realm
+      await _sleep(DELAY + Math.random() * delay_jitter)
+      this.communicateAll(imageId, computationType)
+    }
+  }
+
+  private communicateProgress(data: number, computationType: string) {
+    this.ws.send(JSON.stringify({
+      type: "progress",
+      data: data,
+      computationType
+    }))
+  }
+
+  private communicateImageProcessed(imageId: number, computationType: string) {
+    const {
+      SEND_BATCH_SIZE,
+      MAX_PHOTOS
+    } = CONFIG_OPTIONS
+
+    this.imagesToSend[computationType].push(`./images/thumb/${("0000" + imageId).slice(-4)}.jpg`)
+
+    if ((imageId !== 0 && !(imageId % SEND_BATCH_SIZE) || imageId === MAX_PHOTOS - 1)) {
+
+      this.ws.send(JSON.stringify({
+        type: "image",
+        images: this.imagesToSend[computationType],
+        computationType
+      }))
+
+      this.imagesToSend[computationType] = []
+    }
+  }
+}
+
diff --git a/demos/image_pipeline_web/backend/src/index.ts b/demos/image_pipeline_web/backend/src/index.ts
new file mode 100644
index 00000000..524b0a1b
--- /dev/null
+++ b/demos/image_pipeline_web/backend/src/index.ts
@@ -0,0 +1,65 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import express from 'express'
+import WebSocket from 'ws'
+import http from 'http'
+import { GrCUDAProxy } from './GrCUDAProxy'
+
+
+const app = express()
+const server = http.createServer(app)
+const PORT = parseInt(process.argv[2])
+let deviceNumber = parseInt(process.argv[3])
+//@ts-ignore
+const cu = Polyglot.eval("grcuda", `CU`)
+
+const numDevices = cu.cudaGetDeviceCount()
+if (deviceNumber >= numDevices) {
+  console.log("warning: device number (" + deviceNumber + ") is bigger than the number of GPUs (" + numDevices + "), using GPU 0 instead");
+  deviceNumber = 0;
+}
+cu.cudaSetDevice(deviceNumber);
+
+const wss = new WebSocket.Server({ server })
+
+wss.on('connection', (ws: WebSocket) => {
+  console.log(`[${PORT}] A new client connected`)
+  const grCUDAProxy = new GrCUDAProxy(ws)
+
+  ws.on('message', async (message: string) => {
+    await grCUDAProxy.beginComputation(message)
+  })
+})
+
+app.get('/', (req: any, res: any) => {
+  res.send("Everything is working properly")
+})
+
+server.listen(PORT, () => console.log(`Running on port ${PORT} - Using GPU ${deviceNumber}`))
diff --git a/demos/image_pipeline_web/backend/src/options.ts b/demos/image_pipeline_web/backend/src/options.ts
new file mode 100644
index 00000000..968f709b
--- /dev/null
+++ b/demos/image_pipeline_web/backend/src/options.ts
@@ -0,0 +1,81 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+export const MOCK_OPTIONS = {
+  DELAY: 10,              //ms
+  DELAY_JITTER_SYNC: 30,  //ms
+  DELAY_JITTER_ASYNC: 0,  //ms
+  DELAY_JITTER_NATIVE: 50 //ms
+}
+
+export const CONFIG_OPTIONS = {
+  MAX_PHOTOS: 30,
+  SEND_BATCH_SIZE: 1
+}
+
+export const COMPUTATION_MODES: Array<string> = ["sync", "async", "cuda-native", "race-sync", "race-async", "race-cuda-native"]
+
+// Convert images to black and white
+export const BW = false
+
+// Edge width (in pixel) of input images.
+// If a loaded image has lower width than this, it is rescaled
+export const RESIZED_IMG_WIDTH = 1024
+// Edge width (in pixel) of output images.
+// We store processed images in 2 variants: small and large
+export const RESIZED_IMG_WIDTH_OUT_SMALL = 40
+export const RESIZED_IMG_WIDTH_OUT_LARGE = RESIZED_IMG_WIDTH
+
+
+// Constant parameters used in the image processing
+export const KERNEL_SMALL_DIAMETER = 7
+export const KERNEL_SMALL_VARIANCE = 0.1
+export const KERNEL_LARGE_DIAMETER = 9
+export const KERNEL_LARGE_VARIANCE = 20
+export const KERNEL_UNSHARPEN_DIAMETER = 7
+export const KERNEL_UNSHARPEN_VARIANCE = 5
+export const UNSHARPEN_AMOUNT = 30
+export const CDEPTH = 256
+export const FACTOR = 0.8
+
+// CUDA parameters
+export const NUM_BLOCKS = 2
+export const THREADS_1D = 32
+export const THREADS_2D = 8
+export const IMAGE_IN_DIRECTORY = `../frontend/images/dataset${RESIZED_IMG_WIDTH}`
+export const IMAGE_OUT_SMALL_DIRECTORY = "../frontend/images/thumb"
+export const IMAGE_OUT_BIG_DIRECTORY = "../frontend/images/full_res"
+
+export const CUDA_NATIVE_IMAGE_IN_DIRECTORY = IMAGE_IN_DIRECTORY
+export const CUDA_NATIVE_EXEC_FILE = "../../image_pipeline_local/cuda/build/image_pipeline"
+export const CUDA_NATIVE_IMAGE_OUT_SMALL_DIRECTORY = IMAGE_OUT_SMALL_DIRECTORY // "$HOME/grcuda/projects/demos/web_demo/frontend/images/thumb/"
+export const CUDA_NATIVE_IMAGE_OUT_BIG_DIRECTORY = IMAGE_OUT_BIG_DIRECTORY //"$HOME/grcuda/projects/demos/web_demo/frontend/images/full_res/"
+
+export const DEBUG = false
+
diff --git a/demos/image_pipeline_web/backend/src/utils.ts b/demos/image_pipeline_web/backend/src/utils.ts
new file mode 100644
index 00000000..a0178d6d
--- /dev/null
+++ b/demos/image_pipeline_web/backend/src/utils.ts
@@ -0,0 +1,202 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import cv from "opencv4nodejs"
+import fs from "fs"
+
+import {
+  RESIZED_IMG_WIDTH,
+  BW,
+  RESIZED_IMG_WIDTH_OUT_LARGE,
+  RESIZED_IMG_WIDTH_OUT_SMALL,
+  IMAGE_IN_DIRECTORY,
+  IMAGE_OUT_BIG_DIRECTORY,
+  IMAGE_OUT_SMALL_DIRECTORY,
+  MOCK_OPTIONS,
+  CDEPTH
+} from "./options"
+
+
+export const _sleep = (ms: number) => {
+  return new Promise((resolve) => {
+    setTimeout(resolve, ms);
+  });
+}
+
+export const _gaussianKernel = (buffer: any, diameter: number, sigma: number) => {
+  const mean = diameter / 2;
+  let sum_tmp = 0;
+  for (let x = 0; x < diameter; x++) {
+    for (let y = 0; y < diameter; y++) {
+      const val = Math.exp(-0.5 * (Math.pow(x - mean, 2) + Math.pow(y - mean, 2)) / Math.pow(sigma, 2));
+      buffer[x][y] = val;
+      sum_tmp += val;
+    }
+  }
+  // Normalize;
+  for (let x = 0; x < diameter; x++) {
+    for (let y = 0; y < diameter; y++) {
+      buffer[x][y] /= sum_tmp;
+    }
+  }
+}
+
+export const _getDelayJitter = (computationType: string) => {
+
+  const {
+    DELAY_JITTER_ASYNC,
+    DELAY_JITTER_SYNC,
+    DELAY_JITTER_NATIVE
+  } = MOCK_OPTIONS
+
+  switch (computationType) {
+    case "sync": {
+      return DELAY_JITTER_SYNC
+    }
+    case "async": {
+      return DELAY_JITTER_ASYNC
+    }
+    case "cuda-native": {
+      return DELAY_JITTER_NATIVE
+    }
+    case "race-sync": {
+      return DELAY_JITTER_SYNC
+    }
+    case "race-async": {
+      return DELAY_JITTER_ASYNC
+    }
+    case "race-cuda-native": {
+      return DELAY_JITTER_NATIVE
+    }
+  }
+
+}
+
+export async function loadImage(imgName: string | number, resizeWidth = RESIZED_IMG_WIDTH, resizeHeight = RESIZED_IMG_WIDTH, fileFormat = ".jpg") {
+  const imagePath = `${IMAGE_IN_DIRECTORY}/${imgName}${fileFormat}`
+  const image = await cv.imreadAsync(imagePath, BW ? cv.IMREAD_GRAYSCALE : cv.IMREAD_COLOR)
+  return image
+}
+
+export async function storeImageInner(img: cv.Mat, imgName: string | number, resolution: number, kind: string, imgFormat: string = ".jpg", blackAndWhite: boolean = BW) {
+  const imgResized = img.resize(resolution, resolution);
+  const buffer = await cv.imencodeAsync('.jpg', imgResized, [cv.IMWRITE_JPEG_QUALITY, 80])
+  const writeDirectory = kind === "full_res" ? IMAGE_OUT_BIG_DIRECTORY : IMAGE_OUT_SMALL_DIRECTORY
+  fs.writeFileSync(`${writeDirectory}/${imgName}${imgFormat}`, buffer);
+}
+
+// Store the output of the image processing into 2 images,
+// with low and high resolution;
+export async function storeImage(img: cv.Mat, imgName: string | number, resizedImageWidthLarge = RESIZED_IMG_WIDTH_OUT_LARGE, resizedImageWidthSmall = RESIZED_IMG_WIDTH_OUT_SMALL, blackAndWhite: boolean = BW) {
+  storeImageInner(img, imgName, resizedImageWidthLarge, "full_res", ".jpg", blackAndWhite);
+  storeImageInner(img, imgName, resizedImageWidthSmall, "thumb", ".jpg", blackAndWhite);
+}
+
+export function _intervalToMs(start: number, end: number) {
+  return (end - start) / 1e6;
+}
+
+export const copyFrom = (arrayFrom: any, arrayTo: any) => {
+  for (let i = 0; i < arrayTo.length; ++i) {
+    arrayTo[i] = arrayFrom[i]
+  }
+}
+
+// Beziér curve defined by 3 points.
+// The input is used to map points of the curve to the output LUT,
+// and can be used to combine multiple LUTs.
+// By default, it is just [0, 1, ..., 255];
+function spline3(input: any, lut: any, P: any) {
+  for (let i = 0; i < CDEPTH; i++) {
+    const t = i / CDEPTH;
+    const x = Math.pow((1 - t), 2) * P[0] + 2 * t * (1 - t) * P[1] + Math.pow(t, 2) * P[2];
+    lut[i] = input[(x * CDEPTH) >> 0]; // >> 0 is an evil hack to cast float to int;
+  }
+}
+
+// Beziér curve defined by 5 points;
+function spline5(input: any, lut: any, P: any) {
+  for (let i = 0; i < CDEPTH; i++) {
+    const t = i / CDEPTH;
+    const x = Math.pow((1 - t), 4) * P[0] + 4 * t * Math.pow((1 - t), 3) * P[1] + 6 * Math.pow(t, 2) * Math.pow((1 - t), 2) * P[2] + 4 * Math.pow(t, 3) * (1 - t) * P[3] + Math.pow(t, 4) * P[4];
+    lut[i] = input[(x * CDEPTH) >> 0];
+  }
+}
+
+function lut_r(lut: any) {
+  // Create a temporary LUT to swap values;
+  let lut_tmp = new Array(CDEPTH).fill(0);
+
+  // Initialize LUT;
+  for (let i = 0; i < CDEPTH; i++) {
+    lut[i] = i;
+  }
+  // Apply 1st curve;
+  const P = [0.0, 0.2, 1.0];
+  spline3(lut, lut_tmp, P);
+  // Apply 2nd curve;
+  const P2 = [0.0, 0.3, 0.5, 0.99, 1];
+  spline5(lut_tmp, lut, P2);
+}
+
+function lut_g(lut: any) {
+  // Create a temporary LUT to swap values;
+  let lut_tmp = new Array(CDEPTH).fill(0);
+  // Initialize LUT;
+  for (let i = 0; i < CDEPTH; i++) {
+    lut[i] = i;
+  }
+  // Apply 1st curve;
+  const P = [0.0, 0.01, 0.5, 0.99, 1];
+  spline5(lut, lut_tmp, P);
+  // Apply 2nd curve;
+  const P2 = [0.0, 0.1, 0.5, 0.75, 1];
+  spline5(lut_tmp, lut, P2);
+}
+
+function lut_b(lut: any) {
+  // Create a temporary LUT to swap values;
+  let lut_tmp = new Array(CDEPTH).fill(0);
+  // Initialize LUT;
+  for (let i = 0; i < CDEPTH; i++) {
+    lut[i] = i;
+  }
+  // Apply 1st curve;
+  const P = [0.0, 0.01, 0.5, 0.99, 1];
+  spline5(lut, lut_tmp, P);
+  // Apply 2nd curve;
+  const P2 = [0.0, 0.25, 0.5, 0.70, 1];
+  spline5(lut_tmp, lut, P2);
+}
+
+// Initialize LUTs;
+export const LUT = [new Array(CDEPTH).fill(0), new Array(CDEPTH).fill(0), new Array(CDEPTH).fill(0)];
+lut_r(LUT[0]);
+lut_g(LUT[1]);
+lut_b(LUT[2]);
diff --git a/demos/image_pipeline_web/backend/tsconfig.json b/demos/image_pipeline_web/backend/tsconfig.json
new file mode 100644
index 00000000..35cc4e49
--- /dev/null
+++ b/demos/image_pipeline_web/backend/tsconfig.json
@@ -0,0 +1,16 @@
+{
+  "compilerOptions": {
+    "module": "commonjs",
+    "esModuleInterop": true,
+    "target": "es6",
+    "noImplicitAny": true,
+    "moduleResolution": "node",
+    "sourceMap": true,
+    "outDir": "dist",
+    "baseUrl": ".",
+    "paths": {
+      "*": ["node_modules/*"]
+    }
+  },
+  "include": ["src/**/*"]
+}
diff --git a/demos/image_pipeline_web/frontend/css/style.css b/demos/image_pipeline_web/frontend/css/style.css
new file mode 100644
index 00000000..84a172f6
--- /dev/null
+++ b/demos/image_pipeline_web/frontend/css/style.css
@@ -0,0 +1,34 @@
+.blur  {
+  filter: blur(10px);
+}
+
+#overlay {
+  position: fixed;
+  display: none;
+  top: 50%;
+  left: 50%;
+  margin-left: -512px;
+  margin-top: -512px;
+  background: rgba(255,255,255,0);
+  z-index: 999;
+  scale: 0.5;
+}
+
+.image-pad {
+  padding: 10px;
+  margin: 5px
+}
+
+.image {
+  transition: all .2s ease-out;
+}
+
+.image:hover {
+  transform: scale(1.5);
+}
+
+#close-lightbox {
+  position: absolute;
+  top: 100%;
+  right: 0%;
+}
diff --git a/demos/image_pipeline_web/frontend/images/dataset1024 b/demos/image_pipeline_web/frontend/images/dataset1024
new file mode 120000
index 00000000..0f241993
--- /dev/null
+++ b/demos/image_pipeline_web/frontend/images/dataset1024
@@ -0,0 +1 @@
+../../../../grcuda-data/datasets/web_demo/images/dataset1024
\ No newline at end of file
diff --git a/demos/image_pipeline_web/frontend/images/dataset512 b/demos/image_pipeline_web/frontend/images/dataset512
new file mode 120000
index 00000000..025a9f64
--- /dev/null
+++ b/demos/image_pipeline_web/frontend/images/dataset512
@@ -0,0 +1 @@
+../../../../grcuda-data/datasets/web_demo/images/dataset512
\ No newline at end of file
diff --git a/demos/image_pipeline_web/frontend/images/description/async/1.png b/demos/image_pipeline_web/frontend/images/description/async/1.png
new file mode 100644
index 00000000..509163f5
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/1.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/async/2.png b/demos/image_pipeline_web/frontend/images/description/async/2.png
new file mode 100644
index 00000000..90d3e956
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/2.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/async/3.png b/demos/image_pipeline_web/frontend/images/description/async/3.png
new file mode 100644
index 00000000..62f9f5d0
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/3.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/async/blurred.png b/demos/image_pipeline_web/frontend/images/description/async/blurred.png
new file mode 100644
index 00000000..4d560787
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/blurred.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/async/combined.png b/demos/image_pipeline_web/frontend/images/description/async/combined.png
new file mode 100644
index 00000000..afad7f32
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/combined.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/async/pipeline-async.png b/demos/image_pipeline_web/frontend/images/description/async/pipeline-async.png
new file mode 100644
index 00000000..e7b9c51a
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/pipeline-async.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/async/sharpened.png b/demos/image_pipeline_web/frontend/images/description/async/sharpened.png
new file mode 100644
index 00000000..26246794
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/async/sharpened.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/cuda-native/pipeline-sync.png b/demos/image_pipeline_web/frontend/images/description/cuda-native/pipeline-sync.png
new file mode 100644
index 00000000..dc868f40
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/cuda-native/pipeline-sync.png differ
diff --git a/demos/image_pipeline_web/frontend/images/description/sync/pipeline-sync.png b/demos/image_pipeline_web/frontend/images/description/sync/pipeline-sync.png
new file mode 100644
index 00000000..dc4517e3
Binary files /dev/null and b/demos/image_pipeline_web/frontend/images/description/sync/pipeline-sync.png differ
diff --git a/demos/image_pipeline_web/frontend/index.html b/demos/image_pipeline_web/frontend/index.html
new file mode 100644
index 00000000..fe9b9d4d
--- /dev/null
+++ b/demos/image_pipeline_web/frontend/index.html
@@ -0,0 +1,157 @@
+<!DOCTYPE html>
+<html lang="en">
+
+<head>
+  <meta charset="UTF-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>GrCUDA@SeptemberRSE</title>
+  <link rel="stylesheet" href="css/style.css">
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.0/dist/css/bootstrap.min.css">
+  <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.1.0/dist/js/bootstrap.min.js"></script>
+</head>
+
+<body>
+  <div id="main-container">
+    <div class="jumbotron container position-sticky">
+      <div class="row">
+        <div class="col-sm-8">
+          <h1 class="display-3">SeptemberRSE</h1>
+          <p class="lead">Image Processing pipeline</p>
+        </div>
+        <div class="col-sm-4 mt-4">
+          <div>
+            <select id="computation-type" class="form-select form-select-sm" aria-label="select computation type"
+              name="Computation Type">
+              <option value="sync">Sync</option>
+              <option value="async">Async</option>
+              <option value="cuda-native">Cuda Native</option>
+              <option value="race-mode">Race Mode!</option>
+            </select>
+            <button id="btn-send-msg-ws" class="btn btn-outline-primary mt-2">Start computation!</button>
+            <div id="progress-bar" class="mt-2"></div>
+          </div>
+        </div>
+
+        <div class="row">
+          <div class="col-sm-12">
+            <hr class="my-4">
+            <div id="container-info" class="">
+              <div class="row" id="sync-pipeline-description">
+                <div class="col-sm-8">
+                  <h3 class="display-4">Computation Mode: Sync</h3>
+                  <p class="lead">In this demo, we bring your photo collection back in time and give it a nice vintage look that everybody loves!</p>
+                  <p>But there's a lot going on behind the scenes. 
+                  First of all, we make the subject pop! Through a complex pipeline of gaussian blur, edge-detection and sharpening filters, we can identify the subject contour and make it sharper, while slightly blurrying the background and other smooth textures.
+                  Then, we apply a retro touch to the pictures, with a custom vintage LUT. </p>
+                </div>
+                
+
+                <div class="col-sm-4">
+                <img src="./images/description/sync/pipeline-sync.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+                </div>
+                </div>
+                
+                <div class="row" id="sync-pipeline-description">
+                <div class="col-sm-8">
+                  <p class="lead">In the <b>Sync</b> pipeline, we adopt the original GrCUDA implementation.</p>
+                  <p> In this version, every computation is executed on the default CUDA stream, meaning that we don't see any overlap between computations and data transfer, or even between multiple image processing requests. 
+                  As a result, a lot of performance is left on the table and most GPU resources are wasted.
+                  </p>
+                </div>
+                
+
+                <div class="col-sm-4">
+                <img src="./images/description/async/1.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+                </div>
+                </div>
+                
+                <div class="row" id="sync-pipeline-description">
+                <div class="col-sm-8">
+                  <p class="lead">In the <b>Async</b> pipeline, we show you the power of our new GrCUDA scheduler.</p>
+                  <p>On the surface, the code you write is 100% identical to the SYNC pipeline.
+                  However, all computations happens asynchronously: requests are overlapped, and so are GPU computations and data transfer.
+                  Moreover, we transparently perform many other optimizations, such as prefetching data to the GPU to be even faster.
+                  Just by making better use of wasted resources, we get a large 30% speedup with no code change whatsoever. Pretty impressive, isn't it?</p>
+                </div>
+                
+
+                <div class="col-sm-4">
+                <img src="./images/description/async/2.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+                </div>
+                </div>
+                
+                <div class="row" id="sync-pipeline-description">
+                <div class="col-sm-8">
+                  <p class="lead">But how does GrCUDA fare against code written by a ninja programmer in C++, with direct access to the CUDA API?</p>
+                  <p>In the <b>Native</b> pipeline, we build an entirely separate CUDA application to load and process images, and call it from JavaScript. 
+                  It is significantly more complex, with a lot of programming overhead (e.g. to handle input options). 
+                  Is it worth having direct access to all the lowest level CUDA APIs? It turns out that GrCUDA provides the same perfomrance, with much simpler code!</p>
+                </div>
+                
+
+                <div class="col-sm-4">
+                <img src="./images/description/async/3.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+                </div>
+                </div>
+            </div>
+
+
+          </div>
+        </div>
+
+      </div>
+    </div>
+
+    <div class="container">
+
+      <br>
+      <div id="images"></div>
+    </div>
+
+    <div id="race-mode" class="container">
+      <div class="row">
+        <div class="col-sm-4">
+          <div id="progress-bar-sync" class="justify-content-center"></div>
+          <div class="row d-flex justify-content-center">
+            <div id="image-gallery-sync"></div>
+          </div>
+          
+        </div>
+        <div class="col-sm-4">
+          <div id="progress-bar-async" class="justify-content-center"></div>
+          <div class="row d-flex justify-content-center">
+            <div id="image-gallery-async"></div>
+          </div>
+          
+        </div>
+        <div class="col-sm-4">
+          <div id="progress-bar-cuda-native" class="justify-content-center"></div>
+          <div class="row d-flex justify-content-center">
+            <div id="image-gallery-cuda-native"></div>
+          </div>
+          
+        </div>
+      </div>
+    </div>
+
+  </div>
+  <div id="overlay">
+  </div>
+ 
+
+
+  </div>
+
+
+
+</body>
+
+<!--
+  File inclusion
+  At the bottom of the page so we are sure that the page is already loaded when we try to access it from javascript
+-->
+<script type="text/javascript" src="js/templates.js"></script>
+<script type="text/javascript" src="js/index.js"></script>
+
+</html>
\ No newline at end of file
diff --git a/demos/image_pipeline_web/frontend/js/index.js b/demos/image_pipeline_web/frontend/js/index.js
new file mode 100644
index 00000000..8f53737f
--- /dev/null
+++ b/demos/image_pipeline_web/frontend/js/index.js
@@ -0,0 +1,269 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+const websockets = {
+  "sync": new WebSocket("ws://localhost:8080"),
+  "async": new WebSocket("ws://localhost:8083"),
+  "cuda-native": new WebSocket("ws://localhost:8082"),
+}
+
+const sendWSMessage = document.getElementById("btn-send-msg-ws")
+const progressBar = document.getElementById("progress-bar")
+const imageGallery = document.getElementById("images")
+const selectElement = document.getElementById("computation-type")
+const containerInfo = document.getElementById("container-info")
+const raceModeContainer = document.getElementById("race-mode")
+const progressBarSync = document.getElementById("progress-bar-sync")
+const progressBarAsync = document.getElementById("progress-bar-async")
+const progressBarCudaNative = document.getElementById("progress-bar-cuda-native")
+
+const imageGallerySync = document.getElementById("image-gallery-sync")
+const imageGalleryAsync = document.getElementById("image-gallery-async")
+const imageGalleryCudaNative = document.getElementById("image-gallery-cuda-native")
+
+
+const COMPUTATION_MODES = ["race-sync", "race-async", "race-cuda-native"]
+
+const progressBarsRace = {
+  "race-sync": progressBarSync,
+  "race-async": progressBarAsync,
+  "race-cuda-native": progressBarCudaNative
+}
+
+const imageGalleriesRace = {
+  "race-sync": imageGallerySync,
+  "race-async": imageGalleryAsync,
+  "race-cuda-native": imageGalleryCudaNative
+}
+
+const imageGalleriesRaceContent = {
+  "race-sync": "",
+  "race-async": "",
+  "race-cuda-native": ""
+}
+
+const progressBarRaceColor = {
+  "race-sync": "progress-bar-striped",
+  "race-async": "progress-bar-striped",
+  "race-cuda-native": "progress-bar-striped"
+}
+
+const labelMap = {
+  "race-sync": "Sync",
+  "race-async": "Async",
+  "race-cuda-native": "Cuda Native",
+  "sync": "Sync",
+  "async": "Async",
+  "cuda-native": "Cuda Native"
+}
+
+let progressSync = 0
+let progressAsync = 0
+let progressNative = 0
+
+let lastProgress = 0
+
+const progressBarsCompletionAmount = {
+
+}
+
+let imageGalleryContent = ""
+
+for(const wsKey of Object.keys(websockets)) {
+  console.log(wsKey)
+  websockets[wsKey].addEventListener("open", (evt) => {
+    console.log(`Connection to websocket for computation mode ${wsKey}`)
+  })
+
+  websockets[wsKey].addEventListener("message", (evt) => {
+    const data = JSON.parse(evt.data)
+  
+    if (data.type === "progress") {
+      processProgressMessage(evt)
+    }
+  
+    if (data.type === "image") {
+      processImageMessage(evt)
+    }
+
+    if (data.type === "time"){
+      processExecutionTimeMessage(evt)
+    }
+  })
+}
+
+sendWSMessage.onclick = () => {
+  clearAll()
+  const { value: computationType } = document.getElementById("computation-type")
+  console.log(`Beginning computation on ${computationType}`)
+
+  lastProgress = 0
+  Object.keys(progressBarsCompletionAmount).forEach(k => progressBarsCompletionAmount[k] = 0)
+
+  if(computationType !== "race-mode") {
+    websockets[computationType].send(computationType)
+  } else {
+    Object.keys(websockets).forEach(ct => websockets[ct].send(`race-${ct}`))
+  }
+
+  progressBar.innerHTML = window.getProgressBarTemplate(0, false)
+
+  containerInfo.innerHTML = ""
+}
+
+const clearAll = () => {
+  progressBar.innerHTML = ""
+  imageGallery.innerHTML = ""
+  imageGalleryContent = ""
+
+  Object.keys(imageGalleriesRaceContent)
+    .forEach(key => imageGalleriesRaceContent[key] = "")
+
+  Object.keys(imageGalleriesRace)
+    .forEach(key => imageGalleriesRace[key].innerHTML = "")
+  
+  COMPUTATION_MODES
+    .forEach(cm => progressBarsRace[cm].innerHTML = "")
+}
+
+selectElement.onchange = () => {
+  const { value: computationType } = document.getElementById("computation-type")
+
+  // Remove progressbar if present
+  clearAll()
+
+  console.log(`Value changed to ${computationType}`)
+
+  switch (computationType) {
+    case "sync": {
+      containerInfo.innerHTML = window.getSyncTemplate()
+      break
+    }
+    case "async": {
+      containerInfo.innerHTML = window.getAsyncTemplate()
+      break
+    }
+    case "cuda-native": {
+      containerInfo.innerHTML = window.getCudaNativeTemplate()
+      break
+    }
+    case "race-mode": {
+      containerInfo.innerHTML = window.getRaceModeTemplate()
+      break
+    }
+  }
+}
+
+
+openLightBox = (imageId) => {
+  const mainContainer = document.getElementById("main-container")
+  const overlayImage = document.getElementById('overlay');
+  const paddedImageId = ("0000" + imageId).slice(-4)
+  const imageElement = window.getImageLightBoxTemplate(paddedImageId, imageId)
+         
+  overlayImage.innerHTML = imageElement
+  overlayImage.style.display = 'block';
+  mainContainer.setAttribute('class', 'blur');
+  const currentImage = document.getElementById(`${imageId}-full-res`)
+  currentImage.onclick = () => {
+    const mainContainer = document.getElementById("main-container")
+    const overlayImage = document.getElementById('overlay');
+    overlayImage.style.display = 'none';
+    mainContainer.removeAttribute('class', 'blur');
+  }
+}
+
+const processExecutionTimeMessage = (evt) => {
+  const data = JSON.parse(evt.data)
+  const { data: executionTime, computationType } = data
+  console.log(`${computationType} took: ${executionTime / 1000}s`)
+  const formattedExecutionTime = executionTime / 1000
+  document.getElementById(`${labelMap[computationType]}-execution-time`).innerHTML = `
+    <b> took ${formattedExecutionTime.toFixed(2)}s</b>
+  `
+}
+
+const processImageMessage = (evt) => {
+  const data = JSON.parse(evt.data)
+  const { images, computationType } = data
+
+    console.log(`Received: ${images}`)
+
+    if (!computationType.includes("race")) {
+
+      for (const image of images) {
+        const imageId = image.split("/").pop().replace(".jpg", "")
+        imageGalleryContent += window.getGalleryImageContentTemplate(image, imageId)
+      }
+
+      imageGallery.innerHTML = imageGalleryContent
+
+    } else {
+
+      imageGalleriesRaceContent[computationType] = images.reduce((rest, image) => {
+        const imageId = image.split("/").pop().replace(".jpg", "")
+        const imgContent = window.getGalleryImageContentTemplate(image, imageId)
+        return rest + imgContent
+      }, imageGalleriesRaceContent[computationType])
+
+      imageGalleriesRace[computationType].innerHTML = imageGalleriesRaceContent[computationType]
+    }
+}
+
+const processProgressMessage = (evt) => {
+  const data = JSON.parse(evt.data)
+  const { data: progressData, computationType } = data
+  
+  if (!computationType.includes("race")) {
+    if(lastProgress > 99.99) return
+    lastProgress = Math.max(progressData, lastProgress)
+
+    if (progressData < 99.99) {
+      progressBar.innerHTML = window.getProgressBarTemplate(lastProgress, false)
+    } else {
+      progressBar.innerHTML =  window.getProgressBarTemplate(lastProgress, true)
+    }
+
+  } else {
+
+    progressBar.innerHTML = ""
+    if(progressBarsCompletionAmount[computationType] > 99.99) return
+    progressBarsCompletionAmount[computationType] = Math.max(progressData, progressBarsCompletionAmount[computationType] || 1)
+    if (progressData > 99.99) {
+      progressBarRaceColor[computationType] = "bg-success"
+    } else {
+      progressBarRaceColor[computationType] = "progress-bar-striped"
+    }
+
+    const label = labelMap[computationType]
+    progressBarsRace[computationType].innerHTML = window.getProgressBarWithWrapperTemplate(label, progressBarsCompletionAmount, progressBarRaceColor, computationType)
+  }
+}
+
+console.log("JS is loaded.")
\ No newline at end of file
diff --git a/demos/image_pipeline_web/frontend/js/templates.js b/demos/image_pipeline_web/frontend/js/templates.js
new file mode 100644
index 00000000..8dd2fa00
--- /dev/null
+++ b/demos/image_pipeline_web/frontend/js/templates.js
@@ -0,0 +1,263 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+window.getHeader = (computationMode) => `
+  <h3 class="display-4">Computation Mode: ${computationMode}</h3>
+`
+
+window.getSyncTemplate = () => `
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  ${window.getHeader("Sync")}
+  <p class="lead">In this demo, we bring your photo collection back in time and give it a nice vintage look that everybody loves!</p>
+  <p>But there's a lot going on behind the scenes. 
+  First of all, we make the subject pop! Through a complex pipeline of gaussian blur, edge-detection and sharpening filters, we can identify the subject contour and make it sharper, while slightly blurrying the background and other smooth textures.
+  Then, we apply a retro touch to the pictures, with a custom vintage LUT. </p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/sync/pipeline-sync.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  <p class="lead">In the <b>Sync</b> pipeline, we adopt the original GrCUDA implementation.</p>
+  <p> In this version, every computation is executed on the default CUDA stream, meaning that we don't see any overlap between computations and data transfer, or even between multiple image processing requests. 
+  As a result, a lot of performance is left on the table and most GPU resources are wasted.
+  </p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/1.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  <p class="lead">In the <b>Async</b> pipeline, we show you the power of our new GrCUDA scheduler.</p>
+  <p>On the surface, the code you write is 100% identical to the SYNC pipeline.
+  However, all computations happens asynchronously: requests are overlapped, and so are GPU computations and data transfer.
+  Moreover, we transparently perform many other optimizations, such as prefetching data to the GPU to be even faster.
+  Just by making better use of wasted resources, we get a large 30% speedup with no code change whatsoever. Pretty impressive, isn't it?</p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/2.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  <p class="lead">But how does GrCUDA fare against code written by a ninja programmer in C++, with direct access to the CUDA API?</p>
+  <p>In the <b>Native</b> pipeline, we build an entirely separate CUDA application to load and process images, and call it from JavaScript. 
+  It is significantly more complex, with a lot of programming overhead (e.g. to handle input options). 
+  Is it worth having direct access to all the lowest level CUDA APIs? It turns out that GrCUDA provides the same perfomrance, with much simpler code!</p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/3.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+`
+window.getAsyncTemplate = () => `
+  <div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+    ${window.getHeader("Async")}
+    <p class="lead">In this demo, we bring your photo collection back in time and give it a nice vintage look that everybody loves!</p>
+    <p>But there's a lot going on behind the scenes. 
+    First of all, we make the subject pop! Through a complex pipeline of gaussian blur, edge-detection and sharpening filters, we can identify the subject contour and make it sharper, while slightly blurrying the background and other smooth textures.
+    Then, we apply a retro touch to the pictures, with a custom vintage LUT. </p>
+</div>
+
+
+<div class="col-sm-4">
+  <img src="./images/description/async/pipeline-async.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+    <p class="lead">In the <b>Sync</b> pipeline, we adopt the original GrCUDA implementation.</p>
+    <p> In this version, every computation is executed on the default CUDA stream, meaning that we don't see any overlap between computations and data transfer, or even between multiple image processing requests. 
+    As a result, a lot of performance is left on the table and most GPU resources are wasted.
+    </p>
+</div>
+
+<div class="col-sm-4">
+  <img src="./images/description/async/1.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+    <p class="lead">In the <b>Async</b> pipeline, we show you the power of our new GrCUDA scheduler.</p>
+    <p>On the surface, the code you write is 100% identical to the SYNC pipeline.
+    However, all computations happens asynchronously: requests are overlapped, and so are GPU computations and data transfer.
+    Moreover, we transparently perform many other optimizations, such as prefetching data to the GPU to be even faster.
+    Just by making better use of wasted resources, we get a large 30% speedup with no code change whatsoever. Pretty impressive, isn't it?</p>
+</div>
+
+
+<div class="col-sm-4">
+  <img src="./images/description/async/2.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+    <p class="lead">But how does GrCUDA fare against code written by a ninja programmer in C++, with direct access to the CUDA API?</p>
+    <p>In the <b>Native</b> pipeline, we build an entirely separate CUDA application to load and process images, and call it from JavaScript. 
+    It is significantly more complex, with a lot of programming overhead (e.g. to handle input options). 
+    Is it worth having direct access to all the lowest level CUDA APIs? It turns out that GrCUDA provides the same perfomrance, with much simpler code!</p>
+</div>
+
+
+<div class="col-sm-4">
+  <img src="./images/description/async/3.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+`
+window.getCudaNativeTemplate = () => `
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  ${window.getHeader("Cuda Native")}
+  <p class="lead">In this demo, we bring your photo collection back in time and give it a nice vintage look that everybody loves!</p>
+  <p>But there's a lot going on behind the scenes. 
+  First of all, we make the subject pop! Through a complex pipeline of gaussian blur, edge-detection and sharpening filters, we can identify the subject contour and make it sharper, while slightly blurrying the background and other smooth textures.
+  Then, we apply a retro touch to the pictures, with a custom vintage LUT. </p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/pipeline-async.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  <p class="lead">In the <b>Sync</b> pipeline, we adopt the original GrCUDA implementation.</p>
+  <p> In this version, every computation is executed on the default CUDA stream, meaning that we don't see any overlap between computations and data transfer, or even between multiple image processing requests. 
+  As a result, a lot of performance is left on the table and most GPU resources are wasted.
+  </p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/1.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  <p class="lead">In the <b>Async</b> pipeline, we show you the power of our new GrCUDA scheduler.</p>
+  <p>On the surface, the code you write is 100% identical to the SYNC pipeline.
+  However, all computations happens asynchronously: requests are overlapped, and so are GPU computations and data transfer.
+  Moreover, we transparently perform many other optimizations, such as prefetching data to the GPU to be even faster.
+  Just by making better use of wasted resources, we get a large 30% speedup with no code change whatsoever. Pretty impressive, isn't it?</p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/2.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+
+<div class="row" id="sync-pipeline-description">
+<div class="col-sm-8">
+  <p class="lead">But how does GrCUDA fare against code written by a ninja programmer in C++, with direct access to the CUDA API?</p>
+  <p>In the <b>Native</b> pipeline, we build an entirely separate CUDA application to load and process images, and call it from JavaScript. 
+  It is significantly more complex, with a lot of programming overhead (e.g. to handle input options). 
+  Is it worth having direct access to all the lowest level CUDA APIs? It turns out that GrCUDA provides the same perfomrance, with much simpler code!</p>
+</div>
+
+
+<div class="col-sm-4">
+<img src="./images/description/async/3.png" class="img-fluid" style="max-width: 100%; height: auto;" alt="Responsive image">
+</div>
+</div>
+`
+
+window.getRaceModeTemplate = () => `
+      <div class="row">
+          <div class="col-sm-12">
+            <div id="container-info" class="">
+              <div class="row" id="race-mode-pipeline-description">
+                <div class="col-sm-8">
+                    <h3 class="display-4">Race Mode</h3>
+                    <p class="lead">In this mode we run the three pipelines in parallel to see which one processes images the fastest.</p>
+                    <p> The three pipelines are handled by three separate backend and execute on three different GPUs to independently evaluate their execution times.</p>
+                </div>
+              </div>
+            </div>
+        </div>
+`
+
+
+window.getImageLightBoxTemplate = (paddedImageId, imageId) => `<img src="./images/full_res/${paddedImageId}.jpg" id="${imageId}-full-res" onclick="openLightBox('${imageId}')">`
+window.getGalleryImageContentTemplate = (image, imageId) => `<img class="image-pad image" src="${image}" id="${imageId}" onclick="openLightBox('${imageId}')">`
+
+window.getProgressBarTemplate = (progressData, completed) => {
+  if (!completed) {
+    return `<div class="progress">
+            <div style="width: ${progressData}%" class="progress-bar progress-bar-striped progress-bar-animated" role="progressbar" aria-valuenow="${progressData}" aria-valuemin="0" aria-valuemax="100">${Math.round(progressData)}%</div>
+          </div>`
+  } else {
+    return `<div class="progress">
+              <div style="width: ${progressData}%" class="progress-bar bg-success" role="progressbar" aria-valuenow="100" aria-valuemin="0" aria-valuemax="100">100%</div>
+            </div>`
+  }
+}
+window.getProgressBarWithWrapperTemplate = (
+  label, 
+  progressBarsCompletionAmount, 
+  progressBarRaceColor, 
+  computationType
+) => `
+    <div class="m-3">
+      <div class="row">
+        <div class="col-sm-12 mb-3">
+          <span> Compute Mode: ${label} </span>
+          <span id="${label}-execution-time" ></span>
+        </div>
+      </div>
+      <div class="row">
+        <div class="col-sm-12">
+          <div class="progress">
+            <div style="width: ${progressBarsCompletionAmount[computationType]}%" class="progress-bar ${progressBarRaceColor[computationType]}" role="progressbar" aria-valuenow="${progressBarsCompletionAmount[computationType]}" aria-valuemin="0" aria-valuemax="100">${Math.round(progressBarsCompletionAmount[computationType])}%</div>
+          </div>
+        </div>
+      </div>
+    </div>  
+    `
+
diff --git a/demos/image_pipeline_web/run_demo.sh b/demos/image_pipeline_web/run_demo.sh
new file mode 100755
index 00000000..2ae9911a
--- /dev/null
+++ b/demos/image_pipeline_web/run_demo.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+# Build backend 
+echo "Starting backend"
+cd backend 
+npm run runall &
+
+# Run frontend
+echo "Starting frontend"
+python3 -m http.server 8085 --directory ../frontend
+
+
+
diff --git a/demos/image_pipeline_web/setup_demo.sh b/demos/image_pipeline_web/setup_demo.sh
new file mode 100755
index 00000000..f5a9f0e1
--- /dev/null
+++ b/demos/image_pipeline_web/setup_demo.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+
+## This script assumes that the current machine was set up using the 
+## setup_machine_from_scratch.sh script
+
+echo "Building GrCUDA"
+cd ../../
+./install.sh
+
+cd -
+
+# Base Dependency Install 
+sudo apt-get install cmake libopencv-dev -y
+
+# For grcuda-data repo
+echo "Initializing and downloading GrCUDA Data store repo"
+git submodule init
+cd ../../grcuda-data
+git submodule update --remote
+
+cd -
+
+# Create symbolic link for the images
+echo "Creating symbolic link for the images"
+cd frontend/images
+ln -s ../../../../grcuda-data/datasets/web_demo/images/dataset512 dataset512
+ln -s ../../../../grcuda-data/datasets/web_demo/images/dataset1024 dataset1024
+mkdir full_res 
+mkdir thumb
+cd -
+
+# Compile cuda binary
+echo "Compiling CUDA binary"
+mkdir ../image_pipeline_local/cuda/build
+cd ../image_pipeline_local/cuda/build
+cmake ..
+make
+
+cd -
+
+# Build backend 
+echo "Building and running backend"
+cd backend 
+npm i
+npm run build
+
+
diff --git a/examples/mandelbrot/README.md b/demos/mandelbrot/README.md
similarity index 83%
rename from examples/mandelbrot/README.md
rename to demos/mandelbrot/README.md
index e0b7421d..683d530e 100644
--- a/examples/mandelbrot/README.md
+++ b/demos/mandelbrot/README.md
@@ -1,10 +1,10 @@
 # Mandelbrot Web Application with Express
 
-This example demonstrates how grCUDA can be used in a Node.js
+This example demonstrates how GrCUDA can be used in a Node.js
 web application with the Express framework.
 
 The code is described in the
-[NVIDIA Developer Blog on grCUDA](https://devblogs.nvidia.com/grcuda-a-polyglot-language-binding-for-cuda-in-graalvm/).
+[NVIDIA Developer Blog on GrCUDA](https://devblogs.nvidia.com/grcuda-a-polyglot-language-binding-for-cuda-in-graalvm/).
 For more details, see the blog.
 
 ## How to run the Example
diff --git a/examples/mandelbrot/app.js b/demos/mandelbrot/app.js
similarity index 100%
rename from examples/mandelbrot/app.js
rename to demos/mandelbrot/app.js
diff --git a/examples/mandelbrot/license.txt b/demos/mandelbrot/license.txt
similarity index 100%
rename from examples/mandelbrot/license.txt
rename to demos/mandelbrot/license.txt
diff --git a/examples/mandelbrot/package.json b/demos/mandelbrot/package.json
similarity index 100%
rename from examples/mandelbrot/package.json
rename to demos/mandelbrot/package.json
diff --git a/examples/tensorrt/README.md b/demos/tensorrt/README.md
similarity index 97%
rename from examples/tensorrt/README.md
rename to demos/tensorrt/README.md
index 1df5b63f..97ea03da 100644
--- a/examples/tensorrt/README.md
+++ b/demos/tensorrt/README.md
@@ -1,8 +1,8 @@
-# grCUDA and TensorRT
+# GrCUDA and TensorRT
 
 This example uses TensorFlow 1.x to train a LeNet5 model on the MNIST dataset. An inference engine is then created
 from the trained module using TensorRT through its Python API. This engine is serialized to a file.
-The engine is subsequently instantiated from a Node.js application using grCUDA.
+The engine is subsequently instantiated from a Node.js application using GrCUDA.
 
 The serialization of the frozen model to a Protobuf file is only supported in TensorFlow 1.x.
 As per Tensor-RT 7.0, its provided end-to-end examples are only working under TensorFlow 1.x.
@@ -141,7 +141,7 @@ Now, we show how to instantiate the serialized engine in a native
 C++ application (`cpp/load_and_sample.cc`) using the C++ TensorRT
 inference library.
 This is an optional step provided here only for completeness. It
-does not use grCUDA C-wrapper library `libtrt.so` nor grCUDA.
+does not use GrCUDA C-wrapper library `libtrt.so` nor GrCUDA.
 
 ```console
 $ cd cpp
@@ -202,12 +202,12 @@ output tensor: 10 elements
 
 ```
 
-## Instantiation of Inference Engine from grCUDA
+## Instantiation of Inference Engine from GrCUDA
 
-First, build that the grCUDA wrapper library `libtrt` for TensorRT.
+First, build that the GrCUDA wrapper library `libtrt` for TensorRT.
 
 ```console
-$ cd <grCUDA repo root>../tensorrt
+$ cd <GrCUDA repo root>../tensorrt
 $ mkdir build
 $ cd build
 $ cmake .. -DTENSORRT_DIR=/usr/local/TensorRT-7.0.0.11/
diff --git a/examples/tensorrt/cpp/CMakeLists.txt b/demos/tensorrt/cpp/CMakeLists.txt
similarity index 100%
rename from examples/tensorrt/cpp/CMakeLists.txt
rename to demos/tensorrt/cpp/CMakeLists.txt
diff --git a/examples/tensorrt/cpp/cmake/FindTensorRT.cmake b/demos/tensorrt/cpp/cmake/FindTensorRT.cmake
similarity index 100%
rename from examples/tensorrt/cpp/cmake/FindTensorRT.cmake
rename to demos/tensorrt/cpp/cmake/FindTensorRT.cmake
diff --git a/examples/tensorrt/cpp/load_and_sample.cc b/demos/tensorrt/cpp/load_and_sample.cc
similarity index 100%
rename from examples/tensorrt/cpp/load_and_sample.cc
rename to demos/tensorrt/cpp/load_and_sample.cc
diff --git a/examples/tensorrt/data/download_mnist_test_digits.py b/demos/tensorrt/data/download_mnist_test_digits.py
similarity index 100%
rename from examples/tensorrt/data/download_mnist_test_digits.py
rename to demos/tensorrt/data/download_mnist_test_digits.py
diff --git a/examples/tensorrt/models/.gitkeep b/demos/tensorrt/models/.gitkeep
similarity index 100%
rename from examples/tensorrt/models/.gitkeep
rename to demos/tensorrt/models/.gitkeep
diff --git a/examples/tensorrt/python/build_engine.py b/demos/tensorrt/python/build_engine.py
similarity index 100%
rename from examples/tensorrt/python/build_engine.py
rename to demos/tensorrt/python/build_engine.py
diff --git a/examples/tensorrt/python/load_and_sample.py b/demos/tensorrt/python/load_and_sample.py
similarity index 100%
rename from examples/tensorrt/python/load_and_sample.py
rename to demos/tensorrt/python/load_and_sample.py
diff --git a/examples/tensorrt/python/model.py b/demos/tensorrt/python/model.py
similarity index 100%
rename from examples/tensorrt/python/model.py
rename to demos/tensorrt/python/model.py
diff --git a/examples/tensorrt/tensorrt_example.js b/demos/tensorrt/tensorrt_example.js
similarity index 99%
rename from examples/tensorrt/tensorrt_example.js
rename to demos/tensorrt/tensorrt_example.js
index a9ba6e2a..1876b8a4 100644
--- a/examples/tensorrt/tensorrt_example.js
+++ b/demos/tensorrt/tensorrt_example.js
@@ -25,7 +25,7 @@
 
 const fs = require('fs')
 
-// get grCUDA root namespace and trt namespace object
+// get GrCUDA root namespace and trt namespace object
 const cu = Polyglot.eval('grcuda', 'CU')
 const trt = Polyglot.eval('grcuda', 'CU::TRT')
 
diff --git a/docs/bindings.md b/docs/bindings.md
index 3fc85939..b023e6e8 100644
--- a/docs/bindings.md
+++ b/docs/bindings.md
@@ -3,7 +3,7 @@
 GPU kernels and host function can be executed as function calls.
 The corresponding functions are callable objects that are bound
 to the respective kernel or host functions.
-grCUDA provides different ways to define these bindings:
+GrCUDA provides different ways to define these bindings:
 
 - `bind(shareLibraryFile, functionNameAndSignature)` returns a callable
   object to the specified host function defined in the shared library (.so file).
@@ -11,20 +11,20 @@ grCUDA provides different ways to define these bindings:
   to specified kernel function defined in PTX or cubin file.
 - `bindall(targetNamespace, fileName, nidlFileName)` registers all functions
   listed in the NIDL (Native Interface Definition Language) for the
-  specified binary file into the target namespace of grCUDA.
+  specified binary file into the target namespace of GrCUDA.
 
 The first two approaches are useful to implement the one-off binding to
 a native function, be it a kernel or a host function. `bindall()` is use
 to bind multiple functions from the same binary or PTX file. This tutorial shows
 how to call existing native host functions and kernels from GraalVM languages
-through grCUDA.
+through GrCUDA.
 
 ## Binding and Invoking prebuilt Host Functions
 
 Host functions can be bound from existing shared libraries by `bind()` or
 `bindall()`. The former returns one single native function as a callable object
 whereas later binds can be used to bind multiple functions into a specified
-namespace within grCUDA.
+namespace within GrCUDA.
 
 This simple example shows how to call two host functions from a shared library.
 One function is defined a C++ namespace. The other function is defined as
@@ -73,7 +73,7 @@ Build the shared library (Linux).
 nvcc -o libincrement.so -std=c++11 --shared -Xcompiler -fPIC increment.cu
 ```
 
-`bind()` can be used to "import" a single function into grCUDA as shown in
+`bind()` can be used to "import" a single function into GrCUDA as shown in
 the following NodeJS/JavaScript example.
 
 ```javascript
@@ -107,7 +107,7 @@ for (const el of deviceArray) {
 
 `bind()` takes the name (or path) of the shared library. The second argument
 specifies the signature in NIDL format. Add the keyword `cxx` for C++ style functions. The C++ namespace can be specified using `::`. Without `cxx`
-grCUDA assumes C linkage of the function and does not apply any name mangling.
+GrCUDA assumes C linkage of the function and does not apply any name mangling.
 `bind()` returns the function objects as callables, i.e., `TruffleObject`
 instances for which `isExecutable()` is `true`.
 
@@ -191,7 +191,7 @@ nvcc -cubin -gencode=arch=compute_75,code=sm_75 \
    -o increment.cubin increment_kernels.cu
 ```
 
-`bindkernel()` "imports" a single kernel function into grCUDA. `bindkernel()`
+`bindkernel()` "imports" a single kernel function into GrCUDA. `bindkernel()`
 returns the kernel as a callable object. It can be called like a function.
 The parameters are the kernel grid size and as optional the amount dynamic shared
 memory. This is analogous to the kernel launch configuration in CUDA that is
@@ -275,7 +275,7 @@ If the a kernel function is not declared with the `extern "C"`
 `nvcc` generates C++ symbols for kernel functions. Such kernels can be enclosed
 in a `kernels` scope in the NIDL file and subsequently bound in one step.
 As in `hostfuncs` for C++ host functions, a C++ namespace can also be
-specified in `kernels`. grCUDA the searches all functions within the scope
+specified in `kernels`. GrCUDA the searches all functions within the scope
 in this namespace.
 
 Kernel function defined with `extern "C"` can bound in a `ckernels` scope.
@@ -288,7 +288,7 @@ e.g., `increment`.
 
 ## Runtime-compilation of GPU Kernels from CUDA C/C++
 
-grCUDA can also compile GPU kernels directly from CUDA C/C++
+GrCUDA can also compile GPU kernels directly from CUDA C/C++
 source code passed as a host-string argument to
 `buildkernel(..)`. The signature of the function is:
 
@@ -339,7 +339,7 @@ print(device_array)
 ## Launching Kernels
 
 Once a kernel function is bound to a callable host-object or registered as
-a function within grCUDA, it can be launched like a function with two argument lists (for exceptions in Ruby and Java and Ruby see the examples below).
+a function within GrCUDA, it can be launched like a function with two argument lists (for exceptions in Ruby and Java and Ruby see the examples below).
 
 ```test
 kernel(num_blocks, block_size)(arg1, ..., argN)
@@ -355,9 +355,7 @@ The first argument list corresponds to the launch configuration, i.e.,
 the kernel grid (number of blocks) and the block sizes (number of
 threads per block).
 
-grCUDA currently only supports synchronous kernel launches,
-i.e., there is an implicit `cudaDeviceSynchronize()` after every
-launch.
+GrCUDA now also supports asynchronous kernel launches, thanks to a computation DAG that allows scheduling parallel computations on different streams and avoid synchronization when not necessary.
 
 __Examples:__
 
@@ -372,8 +370,8 @@ configured_kernel = kernel(num_blocks, block_size)
 configured_kernel(out_arr, in_ar, num_elements)
 ```
 
-grCUDA also supports 2D and 3D kernel grids that are specified
-with the `dim3` in CUDA C/C++. In grCUDA `num_blocks` and `block_size`
+GrCUDA also supports 2D and 3D kernel grids that are specified
+with the `dim3` in CUDA C/C++. In GrCUDA `num_blocks` and `block_size`
 can be integers for 1-dimensional kernels or host language sequences
 of length 1, 2, or 3 (Lists or Tuples in Python, Arrays in JavaScript
 and Ruby, and vectors in R)
diff --git a/docs/grcuda-scheduler-architecture.md b/docs/grcuda-scheduler-architecture.md
new file mode 100644
index 00000000..fc2f6b06
--- /dev/null
+++ b/docs/grcuda-scheduler-architecture.md
@@ -0,0 +1,155 @@
+# Extending GrCUDA with a dynamic computational DAG
+
+This is an ever-changing design document that tracks the state of the asynchronous GrCUDA scheduler, as published in [DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime](https://ieeexplore.ieee.org/abstract/document/9460491). 
+We do our best to keep this document updated and reflect the latest changes to GrCUDA. 
+If you find any inconsistency, please report it as a GitHub issue.
+
+The main idea is to **represent GrCUDA computations as vertices of a DAG**, connected using their dependencies (e.g. the output of a kernel is used as input in another one).
+ * The DAG allows scheduling parallel computations on different streams and avoid synchronization when not necessary
+ * See `projects/resources/python/examples` and `projects/resources/python/benchmark/bench` for simple examples of how this technique can be useful
+ 
+**Differences w.r.t. existing techniques** (e.g. TensorFlow or [CUDA Graphs](https://devblogs.nvidia.com/cuda-graphs/)):
+ 1. The DAG creation is automatic, instead of being built by the user
+ 2. The DAG is built at runtime, not at compile time or eagerly. 
+ This means that we don't have to worry about the control flow of the host program, but only about data dependencies, 
+ as we dynamically add and schedule new vertices/computations as the user provides them. 
+ We can also collect profiling information and adjust the DAG creation based on that (e.g. how many CUDA streams we need, or how large each GPU block should be)
+
+**How it works, in a few words**    
+ * The class `GrCUDAExecutionContext` tracks GPU computational elements (e.g. `kernels`) declarations and invocations
+ * When a new computation is created, or when it is called, it notifies `GrCUDAExecutionContext` so that it updates the `DAG` by computing the data dependencies of the new computation
+ * `GrCUDAExecutionContext` uses the DAG to understand if the new computation can start immediately, or it must wait for other computations to finish
+ * Different computations are overlapped using different CUDA streams, assigned by the `GrCUDAStreamManager` based on dependencies and free resources
+ * Computations on the GPU are asynchronous and are scheduled on streams without explicit synchronization points, as CUDA guarantee that computations are stream-ordered
+ * Synchronziation between streams happens with CUDA events, without blocking the host CPU thread
+ * If a computation is done by the CPU (e.g. an array read) we synchronize only the necessary streams, and the host is blocked until the required data is available 
+ 
+## What's already there
+
+* The DAG supports kernel invocation, and array accesses (both `DeviceArray` and `MultiDimDeviceArray`)
+    * Kernels are executed in parallel, on different streams, whenever possible
+* **Main classes used by the scheduler**
+    1. `GrCUDAExecutionContext`: takes care of scheduling and executing computations, it is the director of the orchestration and manages the DAG
+    2. `GrCUDAComputationalElement`: abstract class that wraps GrCUDA computations, e.g. kernel executions and array accesses. 
+    It provides `GrCUDAExecutionContext` with functions used to compute dependencies or decide if the computation must be done synchronously (e.g. array accesses)
+    3. `ExecutionDAG`: the DAG representing the dependencies between computations, it is composed of vertices that wrap each `GrCUDAComputationalElement`
+    4. `GrCUDAStreamManager`: class that handles the creation and the assignment of streams to kernels, and the synchronization between different streams or the host thread
+    5. `GrCUDADevicesManager`: class that encapsulates the status of the multi-GPU system.
+    6. `DeviceSelectionPolicy`: class tha tencapsulates new scheduling heuristics to select the best device for each new computation, using information such as data locality and the current load of the device. GrCUDA currently supports 5 scheduling heuristics with increasing complexity:
+        * `ROUND_ROBIN`: simply rotate the scheduling between GPUs. Used as initialization strategy of other policies;
+        * `STREAM_AWARE`: assign the computation to the device with the fewest busy stream, i.e. select the device with fewer ongoing computations;
+        * `MIN_TRANSFER_SIZE`: select the device that requires the least amount of bytes to be transferred, maximizing data locality;
+        * `MINMIN_TRANSFER_TIME`: select the device for which the minimum total transfer time would be minimum; 
+        * `MINMAX_TRANSFER_TIME` select the device for which the maximum total transfer time would be minimum. 
+    
+* **Basic execution flow**
+    1. The host language (i.e. the user) calls an `InteropLibrary` object that can be associated to a `GrCUDAComputationalElement`, e.g. a kernel execution or an array access
+    2. A new `GrCUDAComputationalElement` is created and registered to the `GrCUDAExecutionContext`, to represent the computation
+    3. `GrCUDAExecutionContext` adds the computation to the DAG and computes its dependencies
+    4. Based on the dependencies, the `GrCUDAExecutionContext` associates a stream to the computation through `GrCUDAStreamManager`. If using multiple GPUs, the choice of the right device on which to execute a given computation is done by the `DeviceSelectionPolicy`, leveraging info from the DAG and the `GrCUDADevicesManager`.
+    5. `GrCUDAExecutionContext` executes the computation on the chosen stream, performing synchronization if necessary
+    * GPU computations do not require synchronization w.r.t. previous computations on the stream where they executed, as CUDA guarantees stream-ordered execution.
+     CUDA streams are synchronized with (asynchronous) CUDA events, without blocking the host. 
+     CPU computations that require a GPU result are synchronized with `cudaStreamSynchronize` only on the necessary streams
+    6. In case of subsequent array accesses, we skip the scheduling part as accesses are synchronous, and minimize overheads
+    7. From the point of view of GrCUDA, asynchronous GPU computations are considered **active** until the CPU requires the result of either them or their children.
+     Active computations are used in dependency computations (unless all their parameters have been *covered* by children) and to determine which streams are free.
+* The CUDA stream interface has been added to GrCUDA, and is accessible by the users (not recommended, but possible)
+    * Users can create/destroy streams, and assign streams to kernels
+    * The CUDA events API is also available
+    * The `cudaStreamAttachMemAsync` is also exposed, to exclusively associate a managed memory array to a given stream. 
+    This is used, on Pre-Pascal GPUs, to access arrays on CPU while a kernel is using other arrays on GPU
+* Most of the new code is unit-tested and integration-tested, and there is a Python benchmarking suite to measure execution time with different settings
+    * For example, the file `projects/resources/python/benchmark/bench/bench_8` is a fairly complex image processing pipeline that automatically manages up to 4 different streams.
+* **Streams** are managed internally by the GrCUDA runtime: we keep track of existing streams that are currently empty, and schedule computations on them in a FIFO order.
+ New streams are created only if no existing stream is available
+* **Read-only** input arguments can be specified with the `const` keyword; they will be ignored in the dependency computations if possible:
+ for example, if there are 2 kernels that use the same read-only input array, they will be executed concurrently. 
+
+## Open questions
+
+### Questions on API design (i.e. how do we provide the best user experience)
+
+1. How do we track scalar values in library function outputs? ([API Design, point 5](#api-design))
+    * It is not clear if such library exists, for now we have not seen such situation.
+2. How can user specify options cleanly? ([API Design, point 2](#api-design))
+    * Using only context startup options is limiting, but it simplify the problem (we don't have to worry about changing how the DAG is built at runtime)
+    * If we want provide more flexibility, we can add functions to the DSL, but that's not very clean
+
+*** 
+
+## Detailed development notes
+
+### API Design
+    
+Dependencies are inferred automatically, instead of being manually specified by the user using handles
+ 1. Automatic dependency inferring is more interesting from a research perspective, and *cleaner* for end-users
+     * One option is to perform synchronization *white-listing*: have explicit sync points after every kernel call, and remove dependencies if possible.
+      **Pro:** it should be better for guaranteeing correctness. **Cons:** finding if we *do not* have a dependency is more complex than finding if we have one
+     * The other option is *black-listing*, i.e. do not have any sync point and add them if a dependency is found.
+      This is the option currently used: it is simpler, faster, and provides identical results to the other approach
+ 2. The API needs ways to modify the scheduling policy, if desired (e.g. go back to fully synchronized execution)
+     * Context startup option? Easy, but cannot be modified
+     * Expose a function in the GrCUDA DSL? More flexibility, but changing options using the DSL is not very clean
+ 3. How do we identify if a **parameter is read-only**? If two kernels use the same parameter but only read from it, they can execute in parallel
+     * This is not trivial: LLVM can understand, for example, if a scalar value is read-only, but doing that with an array is not always possible
+     * Users might have to specify which parameters are read-only in the kernel signature, which is still better than using explicit handles
+     * For now, we let programmers manually specify read-only array arguments using the `const` keyword, as done in `CUDA`
+ 4. How do we handle scalar values? We could also have dependencies due to scalar values (e.g. a computation is started only if the error in the next iteration is above a threashold)
+     * Currently, only reads from `DeviceArray` (and similar) return scalar values, and they must be done synchronously, as the result is immediately exposed to the guest language. 
+     * Array reads (and writes) are done synchronously by the host, and we guarantee that no kernel that uses the affected array is running
+     * Kernels do not return scalar values, and scalar outputs are stored in a size-1 array (which we can treat as any other array)
+     * Then the programmer can pass the size-1 array to another computation (handled like any array), or extract the value with an array read that triggers synchronization
+     * Scalar values are only problematic when considering library functions that return them
+     * One idea could be to *box* scalar values with Truffle nodes and store the actual value using a `Future`.
+     If the user reads or writes the value, we wait for the GPU computation to end. Then the scalar value can be unboxed to avoid further overheads.
+     * But running library functions on streams is problematic, so this solution might not be required
+ 6. Library functions: library functions are more complex to handle as they could also have code running on the host side.
+    * They also do not expose streams, so it could be difficult to pipeline them
+    * In some cases they might expose streams in the signature, we can probably find them by parsing the signature
+    * They can also return scalars
+    * If we run them on threads, we parallelize at least the CPU side
+
+### What is a computational element in GrCUDA?
+
+`bindkernel`, `buildkernel` functions create a `Kernel` object that contains information about the signature and code
+ * `Kernel` is an executable `InteropLibrary` class that creates a `ConfiguredKernel` that contains information about the number of blocks, shared memory size, etc...
+    * Kernel arguments are provided to `ConfiguredKernel` when it's executed, although they are also passed to the corresponding `Kernel`
+        
+`DeviceArray` accesses can be done in any point, and are not represented as kernels (as they happen on CPU side, using managed memory)
+ * If a `DeviceArray` access happens between two kernels, we must keep the sync point
+ * Similar considerations for `MultiDimDeviceArray`. We don't need to deal with the outer dimensions, as only the innermost level accesses managed memory
+ * Accesses to managed memory are added to the DAG only if they require synchronization. 
+ If an access follows another access (e.g. when initializing an array) we can skip the scheduling and execute it immediately, without scheduling overhead
+ 
+Pre-registered libraries, such as RAPIDS, can be called like `dbscan_func = polyglot.eval(language="grcuda", string="ML::cumlDpDbscanFit")`
+ * They are added to a namespace just like `buildkernel`
+ * They are retrieved using directly a `CallNode`, so we need to observe that too
+ * They are called accessing the `CUMLRegistry`, and other registries, as they aren't kernels, but `ExternalFunctions`
+ * `ExternalFunctions` are callable, and arguments are passed directly to them
+    
+Library functions (non-kernels) can also be loaded, using `BindFunction`
+ * This loads the function using NFI, and returns a callable object
+ * They can also return scalar values (see [API Design, point 6](#api-design))
+
+Invocation to computational elements are wrapped in classes that extend a generic `GrCUDAComputationalElement`.
+`GrCUDAComputationalElement` is used to build the vertices of the DAG and exposes interfaces to compute data dependencies with other `GrCUDAComputationalElements` and to schedule the computation
+ 
+### Other notes on the internal GrCUDA architecture
+
+These notes relate to the structure of the original GrCUDA repository. You can skip them if you are already familiar with it!
+
+The `nodes` package contains the basic Truffle nodes that define the language
+ * Not relevant at the moment, as we can deal with already-parsed functions (e.g. `buildkernel`) and `InteropLibrary` objects
+ * But it might be required to add nodes to handle scalar values
+    
+The `function` package contains functions that can be invoked through the DSL, such as `buildkernel`
+ * We might want to add some function to enable the user to change the runtime behaviour ()
+
+`Namespace` handling: the `Namespace` class maintains a tree of functions (e.g. `buildkernel`) and other namespaces (e.g. `ML`)
+ * When `CallNode` is executed, we look for a function whose name matches the identifier of the `CallNode`
+ * If a function has a namespace, like `ML::cumlDpDbscanFit`, it is decomposed in multiple pieces (`ML` and `cumlDpDbscanFit`), and it is retrieved with a tree visit in the namespace
+ * Additional namespaces are created in `GrCUDAContext`, and a registry like `CUMLRegistry` adds functions to the namespace
+     * Each function in the registry is added to the namespace as an `ExternalFunction` 
+     
+
diff --git a/docs/grcuda.md b/docs/grcuda.md
index 42c99801..ac4648bc 100644
--- a/docs/grcuda.md
+++ b/docs/grcuda.md
@@ -1,12 +1,9 @@
-# grCUDA
+# GrCUDA
 
-grCUDA exposes existing GPU kernel and host functions that accept device
-objects as parameters to all GraalVM languages through the Truffle Interop Library.
-grCUDA represents device objects as flat arrays of primitive types and makes them
-accessible as array-like TruffleObjects to other Truffle languages.
+GrCUDA exposes existing GPU kernels and host functions that accept device objects as parameters to all GraalVM languages through the Truffle Interop Library.
+GrCUDA represents device objects as flat arrays of primitive types and makes them accessible as array-like TruffleObjects to other Truffle languages.
 
-grCUDA itself is an expression-oriented language. Most expressions in grCUDA,
-however, simply evaluate to function objects that can then be used in the host languages.
+GrCUDA itself is an expression-oriented language. Most expressions in GrCUDA, however, simply evaluate to function objects that can then be used in the host languages.
 
 Contents:
 
@@ -19,46 +16,40 @@ Contents:
 
 ## Device Arrays
 
-Device arrays are flat multi-dimensional arrays of primitive types. The arrays are
-can be accessed from GPU kernels and native host functions that accept GPU pointers
-as parameter. The device arrays can also accessed from host languages through the
-array constructs available in these host languages. The memory for the device
-array is CUDA-managed memory. Device arrays can be allocated
-through *array allocation expressions* and through the built-in `DeviceArray`
+Device arrays are flat multi-dimensional arrays of primitive types. 
+The arrays are can be accessed from GPU kernels and native host functions that accept GPU pointers as parameter. 
+The device arrays can also be accessed from host languages through the array constructs available in these host languages. 
+The memory for the device array is CUDA-managed memory. 
+Device arrays can be allocated through *array allocation expressions* and through the built-in `DeviceArray`
 constructor function (see [built-in functions](#built-in-functions)).
 
 ### Lifetime Considerations for the current implementation
 
-Device arrays are tied to garbage collector of the VM. Their underlying memory
-is allocated off-heap through the CUDA Runtime. Only a stub object containing the
-pointer to the off-heap memory, type, and size information etc, is kept on-heap.
+Device arrays are tied to garbage collector of the VM. 
+Their underlying memory is allocated off-heap through the CUDA Runtime. 
+Only a stub object containing the pointer to the off-heap memory, type, and size information etc, is kept on-heap.
 Therefore, the heap utilization does not reflect the off-heap utilization.
-A large device array does not have a large on-heap footprint. The garbage
-collection pass may not be initiated even though the memory utilization (off-heap)
-is high.
-
-Just as for other native bindings, grCUDA is not able to prevent GPU kernels or
-host functions from capturing pointers to device array objects
-or their elements. Because device arrays are managed by the garbage collector,
-capturing of references in native code can potentially lead to danging references.
-
-The CUDA-managed memory of the device array can be **freed explicitly** by calling
-the `free()` method of `DeviceArray`. This will release the allocated memory
-through `cudaFree()`. Once the underlying memory is freed, the `DeviceArray`
-enters a defunct state. All subsequent accesses with throw an exception. The
-Boolean property `isMemoryFreed` of `DeviceArray` can be checked whether the device
-array's memory buffer has already be freed.
+A large device array does not have a large on-heap footprint. 
+The garbage collection pass may not be initiated even though the memory utilization (off-heap) is high.
+
+Just as for other native bindings, GrCUDA is not able to prevent GPU kernels or host functions from capturing pointers to device array objects or their elements. 
+Because device arrays are managed by the garbage collector,
+capturing of references in native code can potentially lead to dangling references.
+
+The CUDA-managed memory of the device array can be **freed explicitly** by calling the `free()` method of `DeviceArray`. 
+This will release the allocated memory through `cudaFree()`.
+Once the underlying memory is freed, the `DeviceArray` enters a defunct state. 
+All subsequent accesses with throw an exception. 
+The Boolean property `isMemoryFreed` of `DeviceArray` can be checked whether the device array's memory buffer has already be freed.
 
 ### Array Allocation Expressions
 
-Device arrays are allocated using a syntax that is similar to arrays C/C++. Multi-dimensional
-arrays use row-major (C style) layout. Use `DeviceArray(type, size, 'F')`
-to create arrays with column-major (Fortran style) order.
+Device arrays are allocated using a syntax that is similar to arrays C/C++. 
+Multi-dimensional arrays use row-major (C style) layout. 
+Use `DeviceArray(type, size, 'F')` to create arrays with column-major (Fortran style) order.
 
-The array sizes must be compile-time constants, i.e., either an integer
-literal or a constant expression. One way to define the array size in array
-allocation expressions from host languages is through string interpolation
-(e.g., template strings in JavaScript).
+The array sizes must be compile-time constants, i.e., either an integer literal or a constant expression. 
+One way to define the array size in array allocation expressions from host languages is through string interpolation (e.g., template strings in JavaScript).
 
 ```C
 double[100]          // array of 10 double-precision elements
@@ -68,7 +59,7 @@ float[1000][28][28]  // array of 1000 x 28 x 28 float elements
 
 The supported data types are:
 
-| grCUDA Type  | Truffle (Java) Type | Compatible C++ Types
+| GrCUDA Type  | Truffle (Java) Type | Compatible C++ Types
 |--------------|---------------------|----------------------
 | `boolean`    | `boolean`           | `bool`, `unsigned char`
 | `char`       | `byte`              | `char`, `signed char`
@@ -78,8 +69,7 @@ The supported data types are:
 | `float`      | `float`             | `float`
 | `double`     | `double`            | `double`
 
-The polyglot expression returns a `DeviceArray` object for
-one-dimensional arrays or a `MultDimDeviceArray` for a multi-dimensional array to the host.
+The polyglot expression returns a `DeviceArray` object for one-dimensional arrays or a `MultDimDeviceArray` for a multi-dimensional array to the host.
 
 **Example in Python:**
 
@@ -133,12 +123,9 @@ matrix.getArrayElement(4).setArrayElement(3, 42.0);
 
 ## Function Invocations
 
-Function invocations are evaluated inside grCUDA the return values are passed back
-to the host language. The argument expressions that can be used in grCUDA language
-strings are currently limited to literals and constant integer expressions. For
-general function invocation, first create a grCUDA expression that returns a
-[callable](#callables) back to the host language and then invoke the callable from
-the host language.
+Function invocations are evaluated inside GrCUDA, then the return values are passed back to the host language. 
+The argument expressions that can be used in GrCUDA language strings are currently limited to literals and constant integer expressions. 
+For general function invocation, first create a GrCUDA expression that returns a [callable](#callables) back to the host language and then invoke the callable from the host language.
 
 **Example in JavaScript:**
 
@@ -154,12 +141,10 @@ const deviceArray = Polyglot.eval('grcuda', 'DeviceArray("float", 1000)')
 
 Callables are Truffle objects that can be invoked from the host language.
 A common usage pattern is to submit polyglot expression that return callables.
-Identifiers are inside a namespace. CUDA Functions reside in the root namespace
-(see [built-in functions](#built-in-functions)).
+Identifiers are inside a namespace. 
+CUDA Functions reside in the root namespace (see [built-in functions](#built-in-functions)).
 
-For device arrays and GPU pointers that are passed as arguments
-to callables, grCUDA automatically passes the underlying
-pointers to the native host or kernel functions.
+For device arrays and GPU pointers that are passed as arguments to callables, GrCUDA automatically passes the underlying pointers to the native host or kernel functions.
 
 **Example in Python:**
 
@@ -190,11 +175,12 @@ inc_kernel(160, 128)(out_ints, in_ints, 100)
 
 ## Function Namespaces
 
-grCUDA organizes functions in a hierarchical namespace, i.e., namespaces can be nested.
-The user can register additional kernel and host function into this namespace. The
-registration is not persisted across instantiations of the grCUDA language context.
+GrCUDA organizes functions in a hierarchical namespace, i.e. namespaces can be nested.
+The user can register additional kernel and host function into this namespace.
+The registration is not persisted across instantiations of the GrCUDA language context.
 
-The root namespace is called `CU`. It can be received through a polyglot expression.
+The root namespace is called `CU`. 
+It can be received through a polyglot expression.
 
 **Example in Python:**
 
@@ -206,26 +192,24 @@ cu = polyglot.eval(language='grcuda', string='CU')
 
 ## Kernel and Host Function Signatures
 
-grCUDA needs to know the signature of native kernel functions and native host
-functions to correctly pass arguments. The signature is expressed in
-*NIDL (Native Interface Definition Language)* syntax.
-The NIDL specification is used in `bind()` for host functions and in `bindkernel()`
-for kernel functions. Multiple host functions or multiple kernel function specifications
-can be combined into one single `.nidl` file, which can then be passed to `bindall()`. This call
-registers the bindings to all listed functions in the grCUDA namespace.
-
-The NIDL signature contains the name of the function and the parameters with their
-types. Host functions also have a return type. Since GPU kernel functions always return `void`,
-the return type is omitted for kernel functions in the NIDL syntax. The parameters are primitive
-types passed by-value or are pointers to primitive values. The parameter names can be chosen
-arbitrarily. The parameter names are used to improve error messages. They do not affect the
-execution.
-
-A complete list of all supported NIDL types and their mapping to Truffle and C++ types can be found
-in the [NIDL type mapping document](typemapping.md).
-
-Pointers are used to refer to device arrays. Device array objects passed as arguments to kernels
-or host functions automatically decay into pointers.
+GrCUDA needs to know the signature of native kernel functions and native host functions to correctly pass arguments. 
+The signature is expressed in *NIDL (Native Interface Definition Language)* syntax.
+The NIDL specification is used in `bind()` for host functions and in `bindkernel()` for kernel functions.
+Multiple host functions or multiple kernel function specifications can be combined into one single `.nidl` file, which can then be passed to `bindall()`.
+This call registers the bindings to all listed functions in the GrCUDA namespace.
+
+The NIDL signature contains the name of the function and the parameters with their types.
+Host functions also have a return type.
+Since GPU kernel functions always return `void`, the return type is omitted for kernel functions in the NIDL syntax.
+The parameters are primitive types passed by-value or are pointers to primitive values.
+The parameter names can be chosen arbitrarily.
+The parameter names are used to improve error messages. 
+They do not affect the execution.
+
+A complete list of all supported NIDL types and their mapping to Truffle and C++ types can be found in the [NIDL type mapping document](typemapping.md).
+
+Pointers are used to refer to device arrays. 
+Device array objects passed as arguments to kernels or host functions automatically decay into pointers.
 Pointer are typed and have a direction `<direction> pointer <element type>`.
 Allowed keywords for pointer directions are `in`, `out`, and `inout`.
 
@@ -235,6 +219,11 @@ NIDL              | C++
 `out pointer T`   | `T*`
 `inout pointer T` | `T*`
 
+With the `async` execution policy and the `with-const` DependencyPolicy, using the `const` and `in` signals that the input will not be modified by the kernel.
+As such, the scheduler will optimize its execution, for example by overlapping it with other kernels that access the same data but do not modify them.
+GrCUDA does not currently check that the arrays are actually unmodified! 
+It is responsibility of the user to use `const/in` correctly.
+
 **Example Host Function Signatures:**
 
 ```text
@@ -242,9 +231,9 @@ saxpy(n: sint32, alpha: float, xs: in pointer float, ys: inout pointer float): v
 ```
 
 The signature declares a host function `saxpy` that takes a signed 32-bit int `n`, a
-single precision floating point value `alpha`, wa device memory pointer `xs` to
+single precision floating point value `alpha`, a device memory pointer `xs` to
 constant float, a device memory pointer `ys` to float. The function does not return a value and,
-therefore, has the return type `void`. grCUDA looks for a C-style function definition
+therefore, has the return type `void`. GrCUDA looks for a C-style function definition
 in the shared library, i.e., it searches for the symbol `saxpy`.
 
 A matching function would be defined in C++ code as:
@@ -256,13 +245,13 @@ void saxpy(int n, float alpha, const float* xs, float *ys) { ... }
 
 If the function is implemented in C++ without C-style export, its symbol name is mangled.
 In this case, prefix the function definition with the keyword `cxx` which will instruct
-grCUDA to search for a C++ mangled symbol name.
+GrCUDA to search for a C++ mangled symbol name.
 
 ```text
 cxx predict(samples: in pointer double, labels: out pointer sint32, num_samples: uint32, weights: in pointer float, num_weights: uint32): sint32
 ```
 
-This function returns a signed 32-bit integer. Due to the `cxx` keyword, grCUDA looks for the
+This function returns a signed 32-bit integer. Due to the `cxx` keyword, GrCUDA looks for the
 symbol with the mangled name `_Z7predictPKdPijPKfj` in the shared library.
 
 C++ namespaces are also supported. The namespaces can be specified (using `::`).
@@ -293,22 +282,22 @@ update_kernel(gradient: in pointer float, weights: inout pointer float, num_weig
 This signature declaration refers to a kernel that takes a pointer `gradient` to constant float,
 a pointer `weights` to float, and a value `num_weights` as an unsigned 32-bit int.
 Note that there return type specification is missing because CUDA kernel functions do not return
-any value. grCUDA looks for the function symbol `update_kernel` in the cubin or PTX file.
+any value. GrCUDA looks for the function symbol `update_kernel` in the cubin or PTX file.
 `nvcc` uses C++ name mangling by default. The `update_kernel` function would therefore have to be
 defined as `extern "C"` in the CUDA source code.
 
-grCUDA can be instructed search for the C++ mangled name by adding `cxx` keyword.
+GrCUDA can be instructed search for the C++ mangled name by adding `cxx` keyword.
 
 ```text
 cxx predict_kernel(samples: in pointer double, labels: out pointer sint32, num_samples: uint32, weights: in pointer float, num_weights: uint32)
 ```
 
-grCUDA then searches for the symbol `_Z14predict_kernelPKdPijPKfj`.
+GrCUDA then searches for the symbol `_Z14predict_kernelPKdPijPKfj`.
 
 ### Syntax NIDL Specification Files
 
 Multiple declaration of host and kernel functions can be specified in a NIDL file and
-bound into grCUDA namespace in one single step.
+bound into GrCUDA namespace in one single step.
 
 The functions are collected in binding groups within `{  }`. The following binding groups
 exist:
@@ -414,7 +403,7 @@ cu.myfoo.c_incr_inplace_host(deviceArray, deviceArray.length)
 
 ## Built-in Functions
 
-The built-in functions are located in the root namespace of grCUDA.
+The built-in functions are located in the root namespace of GrCUDA.
 The functions are accessible directly in polyglot expression or, alternatively,
 through the `CU` root namespace object.
 
@@ -449,13 +438,13 @@ A complete example is given in the [bindings tutorial](docs/bindings.md).
 Multiple host and kernel functions can be grouped and imported into rCUDA in one single step.
 The signatures of host and kernel functions are specified NIDL files using
 [NIDL syntax](#kernel-and-host-function-signatures). All listed functions are registered in
-the grCUDA `targetNamespace`.
+the GrCUDA `targetNamespace`.
 
 ```text
 bindall(targetNamespace, fileName, nidlFileName)
 ```
 
-`targetNamespace`: grCUDA namespace into which all functions listed in the NIDL file that
+`targetNamespace`: GrCUDA namespace into which all functions listed in the NIDL file that
   are imported from `fileName` are registered.
 
 `nidlFileName`: name of the NIDL that contains specification for the host or kernel
@@ -701,7 +690,7 @@ warpSize
 
 A subset of the functions of the
 [CUDA Runtime API](https://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
-are exported into the root namespace of grCUDA i.e., the `CU` object.
+are exported into the root namespace of GrCUDA i.e., the `CU` object.
 
 ```python
 opaquePointerToDeviceMemory = polyglot.eval(language='grcuda', string='cudaMalloc(1000))')
@@ -809,7 +798,7 @@ The letter S, D, C, Z designate the data type:
 
 Exposed functions from RAPIDS cuML (`libcuml.so`) are preregistered with
 in the namespace `ML`. The `cumlHandle_t` argument, is implicitly provided by
-grCUDA and, thus, must be omitted in the polyglot callable.
+GrCUDA and, thus, must be omitted in the polyglot callable.
 
 The cuML function registry can be disabled by setting `--grcuda.CuMLEnabled=false`.
 The absolute path to the `libcuml.so` shared library must be specified in  `--grcuda.CuMLLibrary=`.
@@ -819,7 +808,7 @@ Current set of **preregistered** cuML functions:
 - `void ML::cumlSpDbscanFit(DeviceArray input, int num_rows, int num_cols, float eps, int min_samples, DeviceArray labels, size_t max_bytes_per_chunk, int verbose)`
 - `void ML::cumlDbDbscanFit(DeviceArray input, int num_rows, int num_cols, double eps, int min_samples, DeviceArray labels, size_t max_bytes_per_chunk, int verbose)`
 
-## Grammar of grCUDA
+## Grammar of GrCUDA
 
 ```text
 expr ::= arrayExpression | funcCall | callable
diff --git a/docs/logging.md b/docs/logging.md
new file mode 100644
index 00000000..0fb95d75
--- /dev/null
+++ b/docs/logging.md
@@ -0,0 +1,104 @@
+# Logging in GrCUDA
+
+Support for logging in Truffle languages and instruments is made by the TruffleLogger class.
+Different levels of logging are provided by the Level class, to differentiate the importance of occurring errors or warnings. This gives the possibility to decide up to which severity it is convenient to have them printed on the `stdout` or on a log file.
+
+Using the logger from another language (e.g. Python)
+```bash
+graalpython --jvm --polyglot --log.grcuda.com.nvidia.grcuda.level=ALL my_script.py
+```
+
+## Logging Levels
+
+The logging Level objects are ordered and are specified by ordered integers. Enabling logging at a given level also enables logging at all higher levels.
+The levels in descending order are:
+- **SEVERE** (highest value)
+- **WARNING**
+- **INFO**
+- **CONFIG**
+- **FINE**
+- **FINER**
+- **FINEST** (lowest value)
+In addition there is a level **OFF** that can be used to turn off logging, and a level **ALL** that can be used to enable logging of all messages.
+[TruffleLogger](https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/TruffleLogger.html) and [Level](https://docs.oracle.com/javase/7/docs/api/java/util/logging/Level.html) class are already implemented in Java. Click for further information.
+
+## Available Loggers
+
+GrCUDA exposes different types of loggers, each one with its own functionality. The GrCUDALogger class is implemented to have access to loggers of interest when specific features are needed.
+Main examples of loggers in GrCUDALogger follow:
+- **GRCUDA_LOGGER** : all the logging action in GrCUDA can refer to this principal logger;
+
+```java
+public static final String GRCUDA_LOGGER = "com.nvidia.grcuda";
+```
+
+- **RUNTIME_LOGGER** : referral for each logging action in runtime project of GrCUDA;
+
+```java
+public static final String RUNTIME_LOGGER = "com.nvidia.grcuda.runtime";
+```
+
+- **EXECUTIONCONTEXT_LOGGER** : referral for each logging action in exectution context project of runtime;
+
+```java
+public static final String EXECUTIONCONTEXT_LOGGER = "com.nvidia.grcuda.runtime.executioncontext";
+```
+
+If further loggers are needed to be implemented, it can be easily done by adding them to the GrCUDALogger class, being sure of respecting the name convention, like in the following example.
+
+```java
+public static final String NEW_LOGGER = "com.nvidia.grcuda.new";
+```
+
+### Using available loggers
+
+To use the available loggers in the code, follow the instructions below:
+1. create the specific logger in the project's class as TruffleLogger type object.
+
+```java
+public static final TruffleLogger LOGGER_NAME = GrCUDALogger.getLogger(GrCUDALogger.LOGGER_NAME);
+```
+2. set the *logger_level* to the message (severe, warning, info...).
+
+```java
+LOGGER_NAME.logger_level("message");
+```
+
+As alternative of step 2. it is also possible to directly associate logging level to messages by using the following form:
+
+```java
+GrCUDALogger.getLogger(GrCUDALogger.LOGGER_NAME).logger_level("*message*");
+```
+
+## Loggers Configuration
+
+All loggers are set to level INFO by default.
+It is possible to modify the level of all the messages in a file with graal options from the command line.
+In particular, it is possible to specify a unique output file for all the logger messages.
+Set the *path_to_file* (see examples below).
+
+```bash
+-- log.file=path_to_file
+```
+It is also possible to specify the *logger_level* for each logger (see all possible levels above).
+
+```bash
+--log.grcuda.com.nvidia.grcuda.chosen_logger.level=logger_level
+```
+
+In the following we provide some examples of logging, using the benchmark b1 (all its options are set to the default value):
+
+- sets all the loggers of GrCUDA to ALL, printed on `stdout`.
+```bash
+graalpython --jvm --polyglot --log.grcuda.com.nvidia.grcuda.level=ALL benchmark_main.py -d -b b1
+```
+
+- sets all the loggers of GrCUDA to ALL, saved on file b1.log in the same folder from which the command is launched.
+```bash
+graalpython --jvm --polyglot --log.grcuda.com.nvidia.grcuda.level=ALL --log.file=./b1.log benchmark_main.py -d -b b1
+```
+
+- set all the loggers of grcuda.runtime to ALL and all the other loggers of GrCUDA to OFF, saved on file b1.log in the root folder of grcuda *GRCUDA_HOME*.
+```bash
+graalpython --jvm --polyglot --log.grcuda.com.nvidia.grcuda.level=OFF --log.grcuda.com.nvidia.grcuda.runtime.level=ALL --log.file=$GRCUDA_HOME/b1.log benchmark_main.py -d -b b1
+```
diff --git a/docs/typemapping.md b/docs/typemapping.md
index f69b5e2b..2ae17e65 100644
--- a/docs/typemapping.md
+++ b/docs/typemapping.md
@@ -1,6 +1,6 @@
 # Type Mapping for C++
 
-grCUDA needs to interact with tree different type systems:
+GrCUDA needs to interact with tree different type systems:
 
 - The Java types of the Truffle interop protocol (`boolean`, `byte`, `short`, `int`, `long`, `float`, `double`, `String`)
   as described in [InteropLibrary](https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/interop/InteropLibrary.html).
@@ -10,7 +10,7 @@ grCUDA needs to interact with tree different type systems:
 - C interface types from TruffleNFI (`void`, `sint8`, `uint8`, `sint16`, `uint16`, `sint32`, `uint32`, `sint64`, `uint64`,
     `float`, `double`, `pointer`, `object`, `string`).
   [NativeSimpleType.java](https://github.com/oracle/graal/blob/master/truffle/src/com.oracle.truffle.nfi.spi/src/com/oracle/truffle/nfi/spi/types/NativeSimpleType.java)
-  defines these types. grCUDA uses TruffleNFI to invoke native host functions. While TruffleNFI is designed for C APIs, the C++ types are necessary create the mangled symbols names to invoke with TruffleNFI.
+  defines these types. GrCUDA uses TruffleNFI to invoke native host functions. While TruffleNFI is designed for C APIs, the C++ types are necessary create the mangled symbols names to invoke with TruffleNFI.
 
 Types for clang/gcc under Linux (LP64)
 
@@ -44,7 +44,7 @@ void *             |  8  | Pv  | out pointer void   | pointer  | n/a
 const void *       |  8  | PKv | in pointer void    | pointer  | n/a
 const char *       |  8  | PKc | string             | string   | String
 
-(*) not supported in grCUDA or NFI
+(*) not supported in GrCUDA or NFI
 (+) does not support all values (NFI limitation)
 
 ## Synonymous NIDL types
diff --git a/grcuda-data b/grcuda-data
new file mode 160000
index 00000000..b8081260
--- /dev/null
+++ b/grcuda-data
@@ -0,0 +1 @@
+Subproject commit b80812609e829930a2bfc894ee0e22b99f6a1297
diff --git a/install.sh b/install.sh
new file mode 100755
index 00000000..4ce6b793
--- /dev/null
+++ b/install.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+mx build;
+
+# Install for Java 8+;
+mkdir -p $GRAAL_HOME/languages/grcuda;
+cp $GRCUDA_HOME/mxbuild/dists/jdk1.8/grcuda.jar $GRAAL_HOME/languages/grcuda/.;
+cp $GRCUDA_HOME/mxbuild/dists/grcuda.jar $GRAAL_HOME/languages/grcuda/.;
+
+# Compute interconnection graph (connection_graph.csv)
+cd $GRCUDA_HOME/projects/resources/connection_graph
+./run.sh
+cd $GRCUDA_HOME
diff --git a/mx.grcuda/suite.py b/mx.grcuda/suite.py
index 3c79d822..e5e4bf83 100644
--- a/mx.grcuda/suite.py
+++ b/mx.grcuda/suite.py
@@ -1,5 +1,5 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
+# Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
 # are met:
@@ -8,10 +8,13 @@
 #  * Redistributions in binary form must reproduce the above copyright
 #    notice, this list of conditions and the following disclaimer in the
 #    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
 #    contributors may be used to endorse or promote products derived
 #    from this software without specific prior written permission.
-#
+
 # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
@@ -39,8 +42,8 @@
     "groupId": "com.nvidia.grcuda",
 
     "developer": {
-        "name": "grCUDA Developers",
-        "organization": "grCUDA Developers",
+        "name": "GrCUDA Developers",
+        "organization": "GrCUDA Developers",
     },
 
 
@@ -53,7 +56,7 @@
         "suites": [
             {
                 "name": "truffle",
-                "version": "c541f641249fb5d615aa8e375ddc950d3b5b3715",
+                "version": "84541b16ae8a8726a0e7d76c7179d94a57ed84ee",
                 "subdir": True,
                 "urls": [
                     {"url": "https://github.com/oracle/graal", "kind": "git"},
@@ -109,7 +112,7 @@
             "subDir": "projects",
             "license": ["BSD-3"],
             "sourceDirs": ["src"],
-            "javaCompliance": "1.8",
+            "javaCompliance": "8+",
             "annotationProcessors": ["truffle:TRUFFLE_DSL_PROCESSOR"],
             "dependencies": [
                 "truffle:TRUFFLE_API",
@@ -125,10 +128,10 @@
             "dependencies": [
                 "com.nvidia.grcuda",
                 "mx:JUNIT",
-                "truffle:TRUFFLE_TEST"
+                "truffle:TRUFFLE_TEST",
             ],
             "checkstyle": "com.nvidia.grcuda",
-            "javaCompliance": "1.8",
+            "javaCompliance": "8+",
             "annotationProcessors": ["truffle:TRUFFLE_DSL_PROCESSOR"],
             "workingSets": "Truffle,CUDA",
             "testProject": True,
@@ -157,11 +160,12 @@
                 "sdk:GRAAL_SDK",
             ],
             "sourcesPath": "grcuda.src.zip",
-            "description": "grCUDA",
+            "description": "GrCUDA",
+            "javaCompliance": "8+",
         },
 
         "GRCUDA_UNIT_TESTS": {
-            "description": "grCUDA unit tests",
+            "description": "GrCUDA unit tests",
             "dependencies": [
                 "com.nvidia.grcuda.test",
             ],
@@ -172,6 +176,7 @@
             ],
             "sourcesPath": "grcuda.tests.src.zip",
             "testDistribution": True,
+            "javaCompliance": "8+",
         },
     },
 }
diff --git a/oci_setup/.gitignore b/oci_setup/.gitignore
new file mode 100644
index 00000000..78669407
--- /dev/null
+++ b/oci_setup/.gitignore
@@ -0,0 +1,2 @@
+tmp_oci_setup/
+*.json
\ No newline at end of file
diff --git a/oci_setup/oci_connect.sh b/oci_setup/oci_connect.sh
new file mode 100755
index 00000000..11f0055f
--- /dev/null
+++ b/oci_setup/oci_connect.sh
@@ -0,0 +1,77 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/bin/bash
+
+# Script used to connect to OCI instances.
+# As we have to create new instances very often,
+# the public key of new instances is updated,
+# requiring us to delete the previous one we have.
+# This script semplifies the process,
+# so you can connect as "./oci_connect.sh"
+
+# IP address of the OCI instance;
+OCI_IP=152.67.254.100  # Some default IP, change it to whatever you have;
+# Path to the .ssh folder;
+SSH_FOLDER=~/.ssh
+# Path to the private SSH key used to connect to OCI,
+# relative to ${SSH_FOLDER};
+PRIVATE_SSH_KEY_PATH=id_rsa
+
+# Flags used to set debug (print commands),
+# OCI IP, SSH folder and SSH private key path;
+for arg in "$@"
+do
+    case $arg in
+        -d|--debug)  # Debug flag;
+        set -x
+        shift
+        ;;
+        -i=*|--ip=*)  # OCI IP address;
+        OCI_IP="${arg#*=}"
+        shift
+        ;;
+        -s=*|--ssh_folder=*)  # SSH folder;
+        SSH_FOLDER="${arg#*=}"
+        shift
+        ;;
+        -k=*|--ssh_key=*)  # SSH key;
+        PRIVATE_SSH_KEY_PATH="${arg#*=}"
+        shift
+        ;;
+        *)  # Ignore other flags;
+        shift
+        ;;
+    esac
+done
+
+# Remove the outdated public SSH key of the OCI instance;
+ssh-keygen -f ${SSH_FOLDER}/known_hosts -R ${OCI_IP}
+# Connect to the OCI instance (assuming a default Ubuntu installation);
+ssh -i ${SSH_FOLDER}/${PRIVATE_SSH_KEY_PATH} -o StrictHostKeyChecking=no ubuntu@${OCI_IP}
\ No newline at end of file
diff --git a/oci_setup/oci_setup_instance.py b/oci_setup/oci_setup_instance.py
new file mode 100644
index 00000000..ae7c4461
--- /dev/null
+++ b/oci_setup/oci_setup_instance.py
@@ -0,0 +1,275 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Script used to simplify the creation of GPU instances on OCI.
+The script must be used inside your OCI console.
+You can specify an existing boot volume, an existing public IP to use, 
+and a public key in your possession to use for login.
+Settings can be specified through a separate JSON file,
+whose keys are the same as specified in the CONFIG section of this script;
+
+Created on Mon Jan 25 11:43:34 2021
+@author: aparravi
+"""
+
+import json
+import os
+import argparse
+from datetime import datetime
+from pathlib import Path
+import subprocess
+import shutil
+
+##############################
+# CONFIG #####################
+##############################
+
+# Here you can specify some default values, 
+# if you don't want to specify them through a separate file;
+CONFIG = dict(
+    REGION = "rNHZ:US-SANJOSE-1-AD-1",
+    VM = "VM.Standard.E2.1",
+    NUM_GPUS = 0,
+    PUBLIC_IP = "152.67.254.100",
+    # OCID of the Compartment;
+    COMPARTMENT = "ocid1.compartment.oc1.your.comparment.ocid",
+    # OCID of the Public Subnet;
+    SUBNET = "ocid1.subnet.oc1.us-sanjose-1.your.subnet.ocid",
+    # OCID of the Boot Volume;
+    BOOT_VOLUME = "ocid1.bootvolume.oc1.us-sanjose-1.your.bootvolume.ocid",
+    # Public key employed when creating the instance the first time (with a "fresh" Boot Volume).
+    # If you use a Boot Volume created by another user, make sure to add your public key to ~/.ssh/authorized_keys
+    SSH_KEY = "ssh-rsa your-key",
+)
+##############################
+# SETUP ######################
+##############################
+
+DEBUG = False
+
+# Map GPU number to default instance shapes;
+NUM_GPU_TO_SHAPE = {
+    0: CONFIG["VM"],
+    1: "VM.GPU3.1",
+    2: "VM.GPU3.2",
+    4: "VM.GPU3.4",
+    8: "BM.GPU3.8",
+}
+
+DEFAULT_SETUP_JSON = """
+{{
+  "compartmentId": "{}",
+  "sourceBootVolumeId": "{}",
+  "sshAuthorizedKeys": "{}",
+  "subnetId": "{}",
+  "assignPublicIp": false
+}}
+"""
+
+# Temporary directory where data are stored;
+DEFAULT_TEMP_DIR = "tmp_oci_setup"
+
+# OCI commands;
+OCI_LAUNCH_INSTANCE = "oci compute instance launch --from-json file://{} --wait-for-state RUNNING"
+OCI_OBTAIN_VNIC = "oci compute instance list-vnics --limit 1 --instance-id {}"
+OCI_OBTAIN_PRIVATE_IP = "oci network private-ip list --vnic-id {}"
+OCI_OBTAIN_PUBLIC_IP = "oci network public-ip get --public-ip-address {}"
+OCI_UPDATE_PUBLIC_IP = "oci network public-ip update --public-ip-id {} --private-ip-id {}"
+
+##############################
+##############################
+
+def log_message(message: str) -> None:
+    date = datetime.now()
+    date_str = date.strftime("%Y-%m-%d-%H-%M-%S-%f")
+    print(f"[{date_str} oci-setup] {message}")
+
+
+def parse_shape_name(shape: str) -> str:
+    if shape == CONFIG["VM"]:
+        return "cpu-default"
+    elif "gpu" in shape.lower():
+        gpu_count = shape.split(".")[-1]
+        return f"gpu-{gpu_count}"
+    else:
+        return shape.replace(".", "-")
+
+
+def create_instance_launch_dict(shape: str, json_setup: str, debug: bool=DEBUG) -> dict:
+    instance_launch_dict = json.loads(json_setup)
+    # Add region;
+    instance_launch_dict["availabilityDomain"] = CONFIG["REGION"]
+    # Add shape;
+    instance_launch_dict["shape"] = shape
+    # Create hostname and display name;
+    hostname = display_name = parse_shape_name(shape)
+    instance_launch_dict["hostname"] = f"grcuda-{hostname}"
+    instance_launch_dict["displayName"] = f"grcuda-{display_name}"
+    if debug:
+        log_message(instance_launch_dict)
+    return instance_launch_dict
+
+
+def run_oci_command(command_template: str, *command_format_args, debug: bool=DEBUG) -> dict:
+    # Setup the OCI command;
+    oci_command = command_template.format(*command_format_args)
+    if debug:
+        log_message(f"launching OCI command: {oci_command}")
+    # Launch the OCI command;
+    try:
+        result = subprocess.run(oci_command, shell=True, env=os.environ, check=True,
+                                stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    except subprocess.CalledProcessError as e:
+        if debug:
+            log_message(f"caught exception {e.output} during OCI command")
+        exit(-1)
+    if result.stderr:
+        if debug:
+            log_message("OCI command completed with error")
+            log_message(result.stderr)
+        exit(-1)
+    # Everything is ok, we extract the result as a dictionary.
+    # There might be other stuff printed along with the JSON, so we have to remove it;
+    res_tmp = result.stdout.decode("utf-8")
+    res_tmp = res_tmp[res_tmp.index("{"):]  # Delete everything until the first "{";
+    res_tmp = res_tmp[:-res_tmp[::-1].index("}")]  # Delete everything after the last "}";
+    return json.loads(res_tmp)
+
+
+def launch_instance(instance_launch_dict: dict, debug: bool=DEBUG) -> str:
+    # We have to store the dictionary to a temporary file;
+    launch_json_file_name = os.path.join(DEFAULT_TEMP_DIR, instance_launch_dict["displayName"] + ".json") 
+    if debug:
+        log_message(f"storing temporary launch JSON into {launch_json_file_name}")
+    # Create temporary folder;
+    Path(DEFAULT_TEMP_DIR).mkdir(parents=True, exist_ok=True)
+    # Store dictionary to JSON;
+    with open(launch_json_file_name, "w") as f:
+        json.dump(instance_launch_dict, f)
+    # Setup the launch command;
+    result = run_oci_command(OCI_LAUNCH_INSTANCE, launch_json_file_name, debug=debug)
+    # Extract the instance OCID for later use;
+    instance_ocid = result["data"]["id"]
+    if debug:
+        log_message(f"created instance with OCID={instance_ocid}")
+
+    # Remove the temporary configuration file;
+    os.remove(launch_json_file_name)
+    if len(os.listdir(DEFAULT_TEMP_DIR)) == 0: 
+        shutil.rmtree(DEFAULT_TEMP_DIR)  # Remove the folder if it is empty;
+
+    return instance_ocid
+
+
+def attach_reserved_public_ip(instance_ocid: str, debug: bool=DEBUG) -> None:
+    # We have to obtain the VNIC attached to the instance (assume only 1 VNIC is available);
+    result = run_oci_command(OCI_OBTAIN_VNIC, instance_ocid, debug=debug)
+    # Extract the VNIC OCID;
+    vnic_ocid = result["data"][0]["id"]
+    if debug:
+        log_message(f"obtained VNIC with OCID={vnic_ocid}")
+    # Obtain the private address OCID associated to the VNIC;
+    result = run_oci_command(OCI_OBTAIN_PRIVATE_IP, vnic_ocid, debug=debug)
+    # Extract the private IP OCID;
+    private_ip_ocid = result["data"][0]["id"]
+    if debug:
+        log_message(f"obtained private IP with OCID={private_ip_ocid}")
+    # Obtain the public IP OCID;
+    result = run_oci_command(OCI_OBTAIN_PUBLIC_IP, CONFIG["PUBLIC_IP"], debug=debug)
+    # Extract the VNIC OCID;
+    public_ip_ocid = result["data"]["id"]
+    if debug:
+        log_message(f"obtained public IP with OCID={public_ip_ocid}")
+    # Assign the reserved public IP;
+    run_oci_command(OCI_UPDATE_PUBLIC_IP, public_ip_ocid, private_ip_ocid, debug=debug)
+    if debug:
+        log_message(f"assigned public IP {CONFIG['PUBLIC_IP']}")
+
+
+def update_config_with_json(json_path: str, debug: bool=DEBUG) -> None:
+    try:
+        if debug:
+            log_message(f"loading configuration file {json_path}")
+        with open(json_path) as f:
+            json_config = json.load(f)
+            for k in CONFIG.keys():
+                if k in json_config:
+                    CONFIG[k] = json_config[k]
+    except Exception as e:
+        log_message(f"warning: failed to load configuration file {json_path}, using default values")
+        log_message(f"  encountered exception: {e}")
+
+##############################
+##############################
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Setup OCI instances from the command line")
+    parser.add_argument("-d", "--debug", action="store_true", help="If present, print debug messages", default=DEBUG)
+    parser.add_argument("-g", "--num_gpus", metavar="N", type=int, default=CONFIG["NUM_GPUS"], help="Number of GPUs present in the instance")
+    parser.add_argument("-p", "--print_config", action="store_true", help="If present, print the configuration options provided as input to the setup", default=False)
+    parser.add_argument("-j", "--json", type=str, help="Load the configuration provided in the following JSON")
+
+    # 1. Parse the input arguments;
+    args = parser.parse_args()
+    debug = args.debug
+    json_path = args.json
+    if json_path:
+        update_config_with_json(json_path, debug)  # Try loading the JSON configuraiton, if specified;
+    num_gpus = args.num_gpus
+
+    if debug and args.print_config:
+        log_message(f"provided input configuration:")
+        for k, v in CONFIG.items():
+            log_message(f"  {k}" + ("\t\t" if len(k) < 7 else "\t") + f"= {v}")
+
+    # 2. Select shape;
+    NUM_GPU_TO_SHAPE[0] = CONFIG["VM"]
+    if num_gpus in NUM_GPU_TO_SHAPE:
+        shape = NUM_GPU_TO_SHAPE[num_gpus]
+    else:
+        shape = NUM_GPU_TO_SHAPE[0]
+    if debug:
+        log_message(f"using {num_gpus} GPUs")
+        log_message(f"selected shape {shape}")
+    json_setup = DEFAULT_SETUP_JSON.format(CONFIG["COMPARTMENT"], CONFIG["BOOT_VOLUME"], CONFIG["SSH_KEY"], CONFIG["SUBNET"])
+
+    # 3. Obtain configuration dictionary;
+    instance_launch_dict = create_instance_launch_dict(shape, json_setup, debug)
+
+    # 4. Launch the instance;
+    instance_id = launch_instance(instance_launch_dict, debug)
+
+    # 5. Attach the reserved public IP to the instance;
+    attach_reserved_public_ip(instance_id, debug)
+
+    if debug:
+        log_message("setup completed successfully!")
diff --git a/oci_setup/oci_terminate_instance.sh b/oci_setup/oci_terminate_instance.sh
new file mode 100755
index 00000000..ee81be56
--- /dev/null
+++ b/oci_setup/oci_terminate_instance.sh
@@ -0,0 +1,64 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/bin/bash
+
+# This script must be executed from the OCI console to terminate a running instance,
+# while preserving the boot volume for later use. 
+# The script comes in handy as GPU instances are paid even when shut down,
+# so it common to terminate them after use, while preserving the boot volume so it can be reused;
+
+# The name of the instance to terminate is passed as the first argument of the script;
+DISPLAY_NAME=$1
+
+# OCID of the compartment where the instance is, substitute with OCID of your compartment.
+# You can specify it as optional parameter, otherwise a default value is used;
+if [ -z "$2" ]
+then
+      COMPARTMENT_ID=ocid1.compartment.oc1.your.comparment.ocid
+else
+      COMPARTMENT_ID=$2
+fi
+
+# Get instance id;
+INSTANCE_ID=$(oci compute instance list -c $COMPARTMENT_ID --lifecycle-state RUNNING --display-name $DISPLAY_NAME --query data[0].id --raw-output)
+
+if [ -z "$INSTANCE_ID" ]
+then
+    echo "INSTANCE_ID not found using DISPLAY_NAME=${DISPLAY_NAME} and COMPARTMENT_OCID=${COMPARTMENT_ID}"; exit -1; 
+fi
+
+# Print info (name, id) about the instance to terminate;
+echo display-name=${DISPLAY_NAME}
+echo comparment-ocid=${COMPARTMENT_ID}
+echo id=$INSTANCE_ID
+
+# Terminate instance (the terminate command automatically asks for confirmation).
+# Set --preserve-boot-volume to false if you want to permanently erase the boot volume attached to the instance;
+oci compute instance terminate --instance-id $INSTANCE_ID --preserve-boot-volume true --wait-for-state TERMINATED
\ No newline at end of file
diff --git a/oci_setup/setup_graalvm.sh b/oci_setup/setup_graalvm.sh
new file mode 100644
index 00000000..134e8215
--- /dev/null
+++ b/oci_setup/setup_graalvm.sh
@@ -0,0 +1,55 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/bin/bash
+
+# Use this script to setup GraalVM for GrCUDA. It is necessary to have 
+# GraalVM already downloaded, this script is meant to be executed after
+# setup_machine_from_scratch.sh to finalize and complete the GrCUDA setup.
+
+ACTIVATE_GRAALPYTHON_ENV=true
+
+# setup GraalVM;
+gu install native-image
+gu install llvm-toolchain
+gu install python
+gu install nodejs
+gu rebuild-images polyglot
+
+# create environment for Graalpython and set it up;
+graalpython -m venv ~/graalpython_venv
+source ~/graalpython_venv/bin/activate
+graalpython -m ginstall install setuptools
+graalpython -m ginstall install Cython
+graalpython -m ginstall install numpy
+
+if [ "$ACTIVATE_GRAALPYTHON_ENV" = true ] ; then
+    echo 'source ~/graalpython_venv/bin/activate' >> ~/.bashrc
+    source  ~/.bashrc
+fi
diff --git a/oci_setup/setup_machine_from_scratch.sh b/oci_setup/setup_machine_from_scratch.sh
new file mode 100755
index 00000000..86cd1ba5
--- /dev/null
+++ b/oci_setup/setup_machine_from_scratch.sh
@@ -0,0 +1,188 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/bin/bash
+
+# You can use this script to setup a clean machine with Ubuntu 20.04 to use GrCUDA. 
+# We install GraalVM, Nvidia's drivers, CUDA, conda, and download GrCUDA.
+# To install GrCUDA, run `cd $GRCUDA_HOME; ./install.sh` after running this script.
+
+# Installation flags (change them to customize your installation);
+INSTALL_CUML=false
+INSTALL_RECENT_CMAKE=false
+INSTALL_ON_NVSWITCH_SYSTEM=false
+
+# basic update on a newly created machine;
+sudo apt update
+sudo apt upgrade -y
+# library needed later to run: gu rebuild-images polyglot and setting up graalpython;
+sudo apt install build-essential -y
+sudo apt install lib32z1-dev -y
+sudo apt install unzip -y
+sudo apt install -y python-ctypes
+sudo apt install -y curl
+
+# clone repositories (GraalVM, MX).
+#   We use the freely available GraalVM CE.
+#   At the bottom of this guide, it is explained how to install EE;
+git clone https://github.com/oracle/graal.git
+git clone https://github.com/graalvm/mx.git
+
+# checkout commit of GraalVM corresponding to the release (21.3);
+cd graal
+git checkout 84541b16ae8a8726a0e7d76c7179d94a57ed84ee
+cd ..
+
+# checkout commit of mx compatible with versions of other tools;
+cd mx
+git checkout 722b86b8ef87fbb297f7e33ee6014bbbd3f4a3a8
+cd ..
+
+# # download the GraalVM release build (22.1.0) and the corresponding JVM;
+wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-22.1.0/graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+wget https://github.com/graalvm/labs-openjdk-11/releases/download/jvmci-22.1-b01/labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
+# extract them;
+tar xfz graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+tar xfz labsjdk-ce-11.0.15+2-jvmci-22.1-b01-linux-amd64.tar.gz
+# remove temporary files;
+rm graalvm-ce-java11-linux-amd64-22.1.0.tar.gz
+rm labsjdk-ce-11.0.15+3-jvmci-22.1-b01-linux-amd64.tar.gz
+
+# install CUDA and Nvidia drivers;
+# -> option 1 (more automatic, but possibly outdated);
+# sudo apt install nvidia-cuda-toolkit -y
+# sudo apt install ubuntu-drivers-common -y
+# sudo ubuntu-drivers autoinstall
+# -> option 2 (from Nvidia's website).
+wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
+sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
+sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
+sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
+sudo apt-get update
+sudo apt-get -y install cuda
+
+# Systems with NVSwitch require Nvidia's fabric manager. 
+# See https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
+if [ "$INSTALL_ON_NVSWITCH_SYSTEM" = true ] ; then
+    sudo apt-get install cuda-drivers-fabricmanager-470
+    sudo systemctl start nvidia-fabricmanager
+    sudo systemctl enable nvidia-fabricmanager
+fi
+
+# symlink for python (use it with care! some system tools require Python 2.7);
+# sudo ln -s /usr/bin/python3 /usr/bin/python
+
+# update ~/.bashrc with new variables;
+echo '' >> ~/.bashrc
+echo '########## GrCUDA Configuration ##########' >> ~/.bashrc
+echo '' >> ~/.bashrc
+echo '# CUDA;' >> ~/.bashrc
+echo 'export CUDA_DIR=/usr/local/cuda' >> ~/.bashrc
+echo 'export PATH=$CUDA_DIR/bin:$PATH' >> ~/.bashrc
+echo '# GraalVM and GrCUDA;' >> ~/.bashrc
+echo 'export PATH=~/mx:$PATH' >> ~/.bashrc
+echo 'export JAVA_HOME=~/labsjdk-ce-11.0.15-jvmci-22.1-b01' >> ~/.bashrc
+echo 'export GRAAL_HOME=~/graalvm-ce-java11-22.1.0' >> ~/.bashrc
+echo 'export PATH=$GRAAL_HOME/bin:$PATH' >> ~/.bashrc
+echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
+echo 'export GRCUDA_HOME=~/grcuda' >> ~/.bashrc
+echo '' >> ~/.bashrc
+echo '##########################################' >> ~/.bashrc
+# reload  ~/.bashrc;
+source  ~/.bashrc
+
+# install miniconda (Python is required to build with mx);
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+chmod 777 Miniconda3-latest-Linux-x86_64.sh
+./Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
+$HOME/miniconda/bin/conda init
+
+# optional: install cuML;
+if [ "$INSTALL_CUML" = true ] ; then
+    $HOME/miniconda/bin/conda create -n rapids-21.08 -c rapidsai -c nvidia -c conda-forge cuml=21.08 python=3.8 cudatoolkit=11.2 -y
+    echo 'export LIBCUML_DIR=/home/ubuntu/miniconda/envs/rapids-21.08/lib/' >> ~/.bashrc
+    source  ~/.bashrc
+fi
+
+# optional: install TensorRT - Currently not supported, it does not work with CUDA 11.7;
+
+# Install a recent version of CMake, following https://askubuntu.com/questions/355565/how-do-i-install-the-latest-version-of-cmake-from-the-command-line;
+if [ "$INSTALL_RECENT_CMAKE" = true ] ; then
+    sudo apt remove --purge --auto-remove cmake -y
+    sudo apt update && sudo apt install -y software-properties-common lsb-release && sudo apt clean all
+    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null
+    sudo apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main"
+    sudo apt update -y
+    sudo apt install kitware-archive-keyring -y
+    sudo rm /etc/apt/trusted.gpg.d/kitware.gpg
+    sudo apt update
+    sudo apt install cmake -y
+fi
+
+# Installing TensorRT cannot be automated.
+# Go to https://developer.nvidia.com/tensorrt, click "Get Started", login to an Nvidia account, download it to a local machine and upload it here;
+# sudo dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.1.6-ga-20210626_1-1_amd64.deb
+# sudo apt-key add /var/nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.1.6-ga-20210626/7fa2af80.pub
+# sudo apt-get update
+# sudo apt-get install tensorrt
+
+# reboot the machine to load the Nvidia drivers;
+sudo reboot
+
+##########################################
+##########################################
+
+# # alternative: install GraalVM EE.
+# # 1. go to https://www.oracle.com/downloads/graalvm-downloads.html?selected_tab=1
+# # 2. download Oracle GraalVM Enterprise Edition Core for Java 11. An Oracle account is required
+# # 3. transfer the tar.gz to your machine, and extract it with
+# tar -xzf graalvm-ee-java11-linux-amd64-21.3.0.tar.gz
+# rm graalvm-ee-java11-linux-amd64-21.3.0.tar.gz
+# # Install Labs-JDK to build GrCUDA;
+# wget https://github.com/graalvm/labs-openjdk-11/releases/download/jvmci-21.3-b05/labsjdk-ce-11.0.13+7-jvmci-21.3-b05-linux-amd64.tar.gz
+# tar -xzf labsjdk-ce-11.0.13+7-jvmci-21.3-b05-linux-amd64.tar.gz
+# rm labsjdk-ce-11.0.13+7-jvmci-21.3-b05-linux-amd64.tar.gz
+# cd graal
+# git checkout 84541b16ae8a8726a0e7d76c7179d94a57ed84ee
+# cd ..
+# # checkout commit of mx compatible with versions of other tools;
+# cd mx
+# git checkout 722b86b8ef87fbb297f7e33ee6014bbbd3f4a3a8
+# cd ..
+# echo '# GraalVM EE' >> ~/.bashrc
+# echo 'export JAVA_HOME=~/labsjdk-ce-11.0.13-jvmci-21.3-b05' >> ~/.bashrc
+# echo 'export GRAAL_HOME=~/graalvm-ee-java11-21.3.0' >> ~/.bashrc
+# echo 'export PATH=$GRAAL_HOME/bin:$PATH' >> ~/.bashrc
+# echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
+# # Install components. Note: this requires user interaction, and an email address associated to an Oracle account
+# gu install native-image
+# gu install llvm-toolchain
+# gu install python
+# gu install nodejs
+# gu rebuild-images polyglot
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindKernelTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindKernelTest.java
index baeabbc6..a92db996 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindKernelTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindKernelTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -33,9 +40,12 @@
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintWriter;
+
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertNotNull;
 import static org.junit.Assert.assertTrue;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.Value;
 import org.junit.BeforeClass;
@@ -86,7 +96,7 @@ public static void setupUpClass() throws IOException, InterruptedException {
 
     void testWithSignature(String... bindArgs) {
         // Build inc_kernel symbol, launch it, and check results.
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             Value bindkernel = context.eval("grcuda", "bindkernel");
             Value incKernel = bindArgs.length > 1 ? bindkernel.execute(BindKernelTest.ptxFileName, bindArgs[0], bindArgs[1])
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindTest.java
index 0c9bcc2f..8a1a0987 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BindTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -36,6 +43,8 @@
 
 import static org.junit.Assert.assertNotNull;
 import static org.junit.Assert.assertEquals;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.Value;
 import org.junit.BeforeClass;
@@ -106,7 +115,7 @@ public static void setupUpClass() throws IOException, InterruptedException {
     }
 
     public void callWithInAndOutArguments(String... bindArgs) {
-        try (Context polyglot = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().build()) {
             Value cu = polyglot.eval("grcuda", "CU");
             Value inDeviceArray = cu.getMember("DeviceArray").execute("int", numElements);
             Value outDeviceArray = cu.getMember("DeviceArray").execute("float", numElements);
@@ -134,7 +143,7 @@ public void callWithInAndOutArguments(String... bindArgs) {
     }
 
     public void callWithInoutArgument(String... bindArgs) {
-        try (Context polyglot = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().build()) {
             Value cu = polyglot.eval("grcuda", "CU");
             Value inoutDeviceArray = cu.getMember("DeviceArray").execute("int", numElements);
             for (int i = 0; i < numElements; i++) {
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BuildKernelTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BuildKernelTest.java
index 80d0091b..73700ea3 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BuildKernelTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/BuildKernelTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -27,13 +34,16 @@
  */
 package com.nvidia.grcuda.test;
 
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+
 import java.util.Random;
+
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertNotNull;
 import static org.junit.Assert.assertTrue;
-import org.graalvm.polyglot.Context;
-import org.graalvm.polyglot.Value;
-import org.junit.Test;
 
 public class BuildKernelTest {
 
@@ -55,7 +65,7 @@ public class BuildKernelTest {
     @Test
     public void testBuildKernelwithNFILegacytSignature() {
         // See if inc_kernel can be built
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value buildkernel = context.eval("grcuda", "buildkernel");
             Value incrKernel = buildkernel.execute(INCREMENT_KERNEL_SOURCE, "inc_kernel<int>", INCREMENT_KERNEL_NFI_LEGACY_SIGNATURE);
             assertNotNull(incrKernel);
@@ -68,7 +78,7 @@ public void testBuildKernelwithNFILegacytSignature() {
     @Test
     public void testBuildKernelwithNIDLSignature() {
         // See if inc_kernel can be built
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value buildkernel = context.eval("grcuda", "buildkernel");
             Value incrKernel = buildkernel.execute(INCREMENT_KERNEL_SOURCE, INCREMENT_KERNEL_NIDL_SIGNATURE);
             assertNotNull(incrKernel);
@@ -81,7 +91,7 @@ public void testBuildKernelwithNIDLSignature() {
     @Test
     public void testBuild1DKernelAndLaunch() {
         // Build inc_kernel, launch it, and check results.
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             final int numElements = 1000;
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             Value buildkernel = context.eval("grcuda", "buildkernel");
@@ -163,17 +173,17 @@ public void testBuild1DKernelAndLaunch() {
     @Test
     public void testBuild2DKernelAndLaunch() {
         // build matmult kernel, launch it on 2D grid, and check results
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value buildkernel = context.eval("grcuda", "buildkernel");
             Value matmultKernel = buildkernel.execute(MATMULT_KERNEL_SOURCE, MATMULT_KERNEL_SIGNATURE);
             assertNotNull(matmultKernel);
             assertTrue(matmultKernel.canExecute());
-            assertEquals(0, matmultKernel.getMember("launchCount").asInt());
+//            assertEquals(0, matmultKernel.getMember("launchCount").asInt());
             assertNotNull(matmultKernel.getMember("ptx").asString());
 
             // generate matrices
-            final int numARows = 256;
-            final int numACols = 192;
+            final int numARows = 128;
+            final int numACols = 128;
             final int numBRows = numACols;
             final int numBCols = 128;
             final int blockSize = 32;
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/CUBLASTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/CUBLASTest.java
deleted file mode 100644
index 47f21797..00000000
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/CUBLASTest.java
+++ /dev/null
@@ -1,251 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-package com.nvidia.grcuda.test;
-
-import static org.junit.Assert.assertEquals;
-
-import java.util.Arrays;
-import java.util.Collection;
-import java.util.function.Function;
-
-import org.junit.BeforeClass;
-import org.junit.Test;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-import org.junit.runners.Parameterized.Parameters;
-import org.graalvm.polyglot.Context;
-import org.graalvm.polyglot.Value;
-
-@RunWith(Parameterized.class)
-public class CUBLASTest {
-
-    @Parameters
-    public static Collection<Object[]> data() {
-        return Arrays.asList(new Object[][]{
-                        {'S'},
-                        {'D'},
-                        {'C'},
-                        {'Z'},
-        });
-    }
-
-    @BeforeClass
-    public static void setup() {
-        polyglot = Context.newBuilder().allowAllAccess(true).build();
-        cu = polyglot.eval("grcuda", "CU");
-    }
-
-    private static Context polyglot;
-    private static Value cu;
-
-    private final char typeChar;
-
-    public CUBLASTest(char typeChar) {
-        this.typeChar = typeChar;
-    }
-
-    /**
-     * BLAS Level-1 Test.
-     */
-    @Test
-    public void testTaxpy() {
-        // x = (0, 1, 2, ..., numDim-1)
-        // y = (0, -2, -4, ..., -2*numDim-2)
-        // y := -1 * x + y
-        // y = (0, 1, 2, ..., numDim-1)
-        boolean isComplex = (typeChar == 'C') || (typeChar == 'Z');
-        String cudaType = ((typeChar == 'D') || (typeChar == 'Z')) ? "double" : "float";
-        int numDim = 1000;
-        int numElements = isComplex ? numDim * 2 : numDim;
-        Value alpha = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
-        alpha.setArrayElement(0, -1);
-        if (isComplex) {
-            alpha.setArrayElement(1, 0);
-        }
-        Value x = cu.invokeMember("DeviceArray", cudaType, numElements);
-        Value y = cu.invokeMember("DeviceArray", cudaType, numElements);
-        assertEquals(numElements, x.getArraySize());
-        assertEquals(numElements, y.getArraySize());
-
-        for (int i = 0; i < numElements; ++i) {
-            x.setArrayElement(i, i);
-            y.setArrayElement(i, 2 * i);
-        }
-        Value taxpy = polyglot.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
-        taxpy.execute(numDim, alpha, x, 1, y, 1);
-        assertOutputVectorIsCorrect(numElements, y, (Integer i) -> i);
-    }
-
-    /**
-     * BLAS Level-2 Test.
-     */
-    @Test
-    public void testTgemv() {
-        int numDim = 10;
-        boolean isComplex = (typeChar == 'C') || (typeChar == 'Z');
-        String cudaType = ((typeChar == 'D') || (typeChar == 'Z')) ? "double" : "float";
-        int numElements = isComplex ? numDim * 2 : numDim;
-        Value alpha = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
-        Value beta = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
-        alpha.setArrayElement(0, -1);
-        beta.setArrayElement(0, 2);
-        if (isComplex) {
-            alpha.setArrayElement(1, 0);
-            beta.setArrayElement(1, 0);
-        }
-
-        // complex types require two elements along 1st dimension (since column-major order)
-        Value matrixA = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
-        Value x = cu.invokeMember("DeviceArray", cudaType, numElements);
-        Value y = cu.invokeMember("DeviceArray", cudaType, numElements);
-
-        // set matrix
-        // A: identity matrix
-        for (int j = 0; j < numDim; j++) {
-            for (int i = 0; i < numElements; i++) {
-                // complex types require two elements along 1st dimension (since column-major order)
-                Value row = matrixA.getArrayElement(i);
-                row.setArrayElement(j, ((!isComplex & (i == j)) || (isComplex && (i == (2 * j)))) ? 1.0 : 0.0);
-            }
-        }
-
-        // set vectors
-        // x = (1, 2, ..., numDim)
-        // y = (1, 2, ..., numDim)
-        for (int i = 0; i < numElements; i++) {
-            x.setArrayElement(i, i);
-            y.setArrayElement(i, i);
-        }
-        Value tgemv = polyglot.eval("grcuda", "BLAS::cublas" + typeChar + "gemv");
-        final int cublasOpN = 0;
-        tgemv.execute(cublasOpN, numDim, numDim,
-                        alpha,
-                        matrixA, numDim,
-                        x, 1,
-                        beta,
-                        y, 1);
-        assertOutputVectorIsCorrect(numElements, y, (Integer i) -> i);
-    }
-
-    /**
-     * BLAS Level-3 Test.
-     */
-    @Test
-    public void testTgemm() {
-        int numDim = 10;
-        boolean isComplex = (typeChar == 'C') || (typeChar == 'Z');
-        String cudaType = ((typeChar == 'D') || (typeChar == 'Z')) ? "double" : "float";
-        int numElements = isComplex ? numDim * 2 : numDim;
-        Value alpha = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
-        Value beta = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
-        alpha.setArrayElement(0, -1);
-        beta.setArrayElement(0, 2);
-        if (isComplex) {
-            alpha.setArrayElement(1, 0);
-            beta.setArrayElement(1, 0);
-        }
-
-        // complex types require two elements along 1st dimension (since column-major order)
-        Value matrixA = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
-        Value matrixB = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
-        Value matrixC = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
-
-        // set matrix
-        // A: identity matrix
-        for (int j = 0; j < numDim; j++) {
-            for (int i = 0; i < numElements; i++) {
-                // complex types require two elements along 1st dimension (since column-major order)
-                Value row = matrixA.getArrayElement(i);
-                row.setArrayElement(j, ((!isComplex & (i == j)) || (isComplex && (i == (2 * j)))) ? 1.0 : 0.0);
-            }
-        }
-        // B == C
-        for (int j = 0; j < numDim; j++) {
-            for (int i = 0; i < numElements; i++) {
-                Value row = matrixB.getArrayElement(i);
-                row.setArrayElement(j, i + numElements * j);
-            }
-        }
-        for (int j = 0; j < numDim; j++) {
-            for (int i = 0; i < numElements; i++) {
-                Value row = matrixC.getArrayElement(i);
-                row.setArrayElement(j, i + numElements * j);
-            }
-        }
-        Value tgemm = polyglot.eval("grcuda", "BLAS::cublas" + typeChar + "gemm");
-        final int cublasOpN = 0;
-        tgemm.execute(cublasOpN, cublasOpN, numDim, numDim, numDim,
-                        alpha,
-                        matrixA, numDim,
-                        matrixB, numDim,
-                        beta,
-                        matrixC, numDim);
-        assertOutputMatrixIsCorrect(numDim, numElements, matrixC, (Integer i) -> i);
-    }
-
-    /**
-     * Validation function for vectors.
-     */
-    private void assertOutputVectorIsCorrect(int len, Value deviceArray,
-                    Function<Integer, Integer> outFunc) {
-        boolean hasDouble = (typeChar == 'D') || (typeChar == 'Z');
-        for (int i = 0; i < len; i++) {
-            if (hasDouble) {
-                double expected = outFunc.apply(i);
-                double actual = deviceArray.getArrayElement(i).asDouble();
-                assertEquals(expected, actual, 1e-5);
-            } else {
-                float expected = outFunc.apply(i);
-                float actual = deviceArray.getArrayElement(i).asFloat();
-                assertEquals(expected, actual, 1e-5f);
-            }
-        }
-    }
-
-    /**
-     * Validation function for matrix.
-     */
-    private void assertOutputMatrixIsCorrect(int numDim, int numElements, Value matrix,
-                    Function<Integer, Integer> outFunc) {
-        boolean hasDouble = (typeChar == 'D') || (typeChar == 'Z');
-        for (int j = 0; j < numDim; j++) {
-            for (int i = 0; i < numElements; i++) {
-                int idx = i + numElements * j;
-                if (hasDouble) {
-                    double expected = outFunc.apply(idx);
-                    double actual = matrix.getArrayElement(i).getArrayElement(j).asDouble();
-                    assertEquals(expected, actual, 1e-5);
-                } else {
-                    float expected = outFunc.apply(idx);
-                    float actual = matrix.getArrayElement(i).getArrayElement(j).asFloat();
-                    assertEquals(expected, actual, 1e-5f);
-                }
-            }
-        }
-    }
-}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/CUDAEventTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/CUDAEventTest.java
new file mode 100644
index 00000000..61fa0fa3
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/CUDAEventTest.java
@@ -0,0 +1,215 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.IntStream;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+public class CUDAEventTest {
+
+    /**
+     * Simply check if we can create a CUDA event without blowing things up!
+     */
+    @Test
+    public void createEventSimpleTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createEvent = context.eval("grcuda", "cudaEventCreate");
+            Value event = createEvent.execute();
+            assertNotNull(event);
+            assertTrue(event.isNativePointer());
+        }
+    }
+
+    /**
+     * Check that we can create many different events;
+     */
+    @Test
+    public void createManyEventsTest() {
+        int numEvents = 8;
+        Set<Long> eventSet = new HashSet<>();
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createEvent = context.eval("grcuda", "cudaEventCreate");
+            IntStream.range(0, numEvents).forEach(i -> {
+                Value event = createEvent.execute();
+                eventSet.add(event.asNativePointer());
+                assertNotNull(event);
+                assertTrue(event.isNativePointer());
+            });
+            assertEquals(numEvents, eventSet.size());
+        }
+    }
+
+    @Test
+    public void eventDestroyTest() {
+        int numEvents = 8;
+        Set<Value> eventSet = new HashSet<>();
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createEvent = context.eval("grcuda", "cudaEventCreate");
+            Value destroyEvent = context.eval("grcuda", "cudaEventDestroy");
+            IntStream.range(0, numEvents).forEach(i -> {
+                Value event = createEvent.execute();
+                eventSet.add(event);
+                assertNotNull(event);
+                assertTrue(event.isNativePointer());
+            });
+            assertEquals(numEvents, eventSet.size());
+            eventSet.forEach(destroyEvent::execute);
+        }
+    }
+
+    private static final int NUM_THREADS_PER_BLOCK = 32;
+
+    private static final String SQUARE_KERNEL =
+            "extern \"C\" __global__ void square(float* x, float *y, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       y[idx] = x[idx] * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    private static final String SUM_KERNEL =
+            "extern \"C\" __global__ void square(float* x, float* y, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       x[idx] = x[idx] + y[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    /**
+     * Execute sequentially two simple kernel on non-default streams, and synchronize the execution using events;
+     */
+    @Test
+    public void syncStreamsTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream1 = createStream.execute();
+            Value stream2 = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+            // Set the custom streams;
+            Value configuredSquareKernel1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream2);
+
+            Value createEvent = context.eval("grcuda", "cudaEventCreate");
+            Value eventRecord = context.eval("grcuda", "cudaEventRecord");
+            Value streamEventWait = context.eval("grcuda", "cudaStreamWaitEvent");
+
+            configuredSquareKernel1.execute(x, y, numElements);
+
+            // Create an event to ensure that kernel 2 executes after kernel 1 is completed;
+            Value event = createEvent.execute();
+            eventRecord.execute(event, stream1);
+            streamEventWait.execute(stream2, event);
+
+            configuredSquareKernel2.execute(y, x, numElements);
+
+            Value syncStream = context.eval("grcuda", "cudaStreamSynchronize");
+            syncStream.execute(stream2);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(16.0, x.getArrayElement(i).asFloat(), 0.01);
+                assertEquals(4.0, y.getArrayElement(i).asFloat(), 0.01);
+            }
+        }
+    }
+
+    /**
+     * Execute two kernels on non-default streams, and synchronize them with events before running a third kernel;
+     * K1(Y = X^2) -> K3(Y += Z)
+     * K2(Z = X^2) /
+     */
+    @Test
+    public void joinComputationsTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream1 = createStream.execute();
+            Value stream2 = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value z = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, pointer, sint32");
+            Value sumKernel = buildkernel.execute(SUM_KERNEL, "square", "pointer, pointer, sint32");
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+            // Set the custom streams;
+            Value configuredSquareKernel1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream2);
+            Value configuredSquareKernel3 = sumKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+
+            Value createEvent = context.eval("grcuda", "cudaEventCreate");
+            Value eventRecord = context.eval("grcuda", "cudaEventRecord");
+            Value streamEventWait = context.eval("grcuda", "cudaStreamWaitEvent");
+
+            configuredSquareKernel1.execute(x, y, numElements);
+            configuredSquareKernel2.execute(x, z, numElements);
+
+            // Create an event to ensure that kernel 2 executes after kernel 1 is completed;
+            Value event = createEvent.execute();
+            eventRecord.execute(event, stream2);
+            streamEventWait.execute(stream1, event);
+
+            configuredSquareKernel3.execute(y, z, numElements);
+
+            Value syncStream = context.eval("grcuda", "cudaStreamSynchronize");
+            syncStream.execute(stream1);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(8, y.getArrayElement(i).asFloat(), 0.01);
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayCopyFunctionTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayCopyFunctionTest.java
deleted file mode 100644
index e7d7abee..00000000
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayCopyFunctionTest.java
+++ /dev/null
@@ -1,138 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-package com.nvidia.grcuda.test;
-
-import static org.junit.Assert.assertEquals;
-import org.graalvm.polyglot.Context;
-import org.graalvm.polyglot.Value;
-import org.junit.Test;
-import com.nvidia.grcuda.gpu.LittleEndianNativeArrayView;
-import com.nvidia.grcuda.gpu.OffheapMemory;
-
-public class DeviceArrayCopyFunctionTest {
-
-    @Test
-    public void testDeviceArrayCopyFromOffheapMemory() {
-        final int numElements = 1000;
-        final int numBytesPerInt = 4;
-        final int numBytes = numElements * numBytesPerInt;
-        try (OffheapMemory hostMemory = new OffheapMemory(numBytes)) {
-            // create off-heap host memory of integers: [1, 2, 3, 4, ..., 1000]
-            LittleEndianNativeArrayView hostArray = hostMemory.getLittleEndianView();
-            for (int i = 0; i < numElements; ++i) {
-                hostArray.setInt(i, i + 1);
-            }
-            try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
-                // create DeviceArray and copy content from off-heap host memory into it
-                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
-                Value deviceArray = createDeviceArray.execute("int", numElements);
-                deviceArray.invokeMember("copyFrom", hostMemory.getPointer(), numElements);
-
-                // Verify content of device array
-                for (int i = 0; i < numElements; ++i) {
-                    assertEquals(i + 1, deviceArray.getArrayElement(i).asInt());
-                }
-            }
-        }
-    }
-
-    @Test
-    public void testDeviceArrayCopyToOffheapMemory() {
-        final int numElements = 1000;
-        final int numBytesPerInt = 4;
-        final int numBytes = numElements * numBytesPerInt;
-        try (OffheapMemory hostMemory = new OffheapMemory(numBytes)) {
-            // create off-heap host memory of integers and initialize all elements to zero.
-            LittleEndianNativeArrayView hostArray = hostMemory.getLittleEndianView();
-            for (int i = 0; i < numElements; ++i) {
-                hostArray.setInt(i, i);
-            }
-            try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
-                // create DeviceArray and set its content [1, 2, 3, 4, ..., 1000]
-                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
-                Value deviceArray = createDeviceArray.execute("int", numElements);
-                for (int i = 0; i < numElements; ++i) {
-                    deviceArray.setArrayElement(i, i + 1);
-                }
-                // copy content of device array to off-heap host memory
-                deviceArray.invokeMember("copyTo", hostMemory.getPointer(), numElements);
-
-                // Verify content of device array
-                for (int i = 0; i < numElements; ++i) {
-                    assertEquals(i + 1, hostArray.getInt(i));
-                }
-            }
-        }
-    }
-
-    @Test
-    public void testDeviceArrayCopyFromDeviceArray() {
-        final int numElements = 1000;
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
-            Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
-            // create device array initialize its elements.
-            Value sourceDeviceArray = createDeviceArray.execute("int", numElements);
-            for (int i = 0; i < numElements; ++i) {
-                sourceDeviceArray.setArrayElement(i, i + 1);
-            }
-            // create destination device array initialize its elements to zero.
-            Value destinationDeviceArray = createDeviceArray.execute("int", numElements);
-            for (int i = 0; i < numElements; ++i) {
-                destinationDeviceArray.setArrayElement(i, 0);
-            }
-            destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements);
-            // Verify content of device array
-            for (int i = 0; i < numElements; ++i) {
-                assertEquals(i + 1, destinationDeviceArray.getArrayElement(i).asInt());
-            }
-        }
-    }
-
-    @Test
-    public void testDeviceArrayCopyToDeviceArray() {
-        final int numElements = 1000;
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
-            Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
-            // create device array initialize its elements.
-            Value sourceDeviceArray = createDeviceArray.execute("int", numElements);
-            for (int i = 0; i < numElements; ++i) {
-                sourceDeviceArray.setArrayElement(i, i + 1);
-            }
-            // create destination device array initialize its elements to zero.
-            Value destinationDeviceArray = createDeviceArray.execute("int", numElements);
-            for (int i = 0; i < numElements; ++i) {
-                destinationDeviceArray.setArrayElement(i, 0);
-            }
-            sourceDeviceArray.invokeMember("copyTo", destinationDeviceArray, numElements);
-            // Verify content of device array
-            for (int i = 0; i < numElements; ++i) {
-                assertEquals(i + 1, destinationDeviceArray.getArrayElement(i).asInt());
-            }
-        }
-    }
-}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceTest.java
index 10a6174b..0b7b82e6 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -30,6 +37,8 @@
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.Value;
 import org.junit.Test;
@@ -38,7 +47,7 @@ public class DeviceTest {
 
     @Test
     public void testDeviceCount() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceCount = ctx.eval("grcuda", "cudaGetDeviceCount()");
             assertTrue(deviceCount.isNumber());
             assertTrue(deviceCount.asInt() > 0);
@@ -47,7 +56,7 @@ public void testDeviceCount() {
 
     @Test
     public void testGetDevicesLengthsMatchesDeviceCount() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceCount = ctx.eval("grcuda", "cudaGetDeviceCount()");
             assertTrue(deviceCount.isNumber());
             assertTrue(deviceCount.asInt() > 0);
@@ -58,7 +67,7 @@ public void testGetDevicesLengthsMatchesDeviceCount() {
 
     @Test
     public void testGetDevicesMatchesAllGetDevice() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             Value devices = ctx.eval("grcuda", "getdevices()");
             Value getDevice = ctx.eval("grcuda", "getdevice");
             for (int i = 0; i < devices.getArraySize(); ++i) {
@@ -72,7 +81,7 @@ public void testGetDevicesMatchesAllGetDevice() {
 
     @Test
     public void testCanReadSomeDeviceProperties() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             Value devices = ctx.eval("grcuda", "getdevices()");
             for (int i = 0; i < devices.getArraySize(); ++i) {
                 Value device = devices.getArrayElement(i);
@@ -95,7 +104,7 @@ public void testCanReadSomeDeviceProperties() {
 
     @Test
     public void testCanSelectDevice() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             Value devices = ctx.eval("grcuda", "getdevices()");
             if (devices.getArraySize() > 1) {
                 Value firstDevice = devices.getArrayElement(0);
@@ -118,7 +127,7 @@ public void testCanSelectDevice() {
 
     @Test
     public void testDeviceMemoryAllocationReducesReportedFreeMemory() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             Value device = ctx.eval("grcuda", "getdevice(0)");
             Value props = device.getMember("properties");
             device.invokeMember("setCurrent");
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/GrCUDAOptionMapTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/GrCUDAOptionMapTest.java
new file mode 100644
index 00000000..4625221f
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/GrCUDAOptionMapTest.java
@@ -0,0 +1,209 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.GrCUDAOptions;
+import com.nvidia.grcuda.cudalibraries.cublas.CUBLASRegistry;
+import com.nvidia.grcuda.cudalibraries.cuml.CUMLRegistry;
+import com.nvidia.grcuda.cudalibraries.tensorrt.TensorRTRegistry;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import com.nvidia.grcuda.test.util.mock.OptionValuesMock;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.StopIterationException;
+import com.oracle.truffle.api.interop.UnknownKeyException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import org.graalvm.options.OptionKey;
+import org.graalvm.options.OptionValues;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+
+import java.util.HashMap;
+
+public class GrCUDAOptionMapTest {
+
+    private GrCUDAOptionMap optionMap;
+    private OptionValues optionValues;
+    public void initializeDefault() {
+        optionValues = new OptionValuesMock();
+        setOption(GrCUDAOptions.CuBLASEnabled, true);
+        setOption(GrCUDAOptions.CuMLEnabled, true);
+        setOption(GrCUDAOptions.ForceStreamAttach, GrCUDAOptionMap.DEFAULT_FORCE_STREAM_ATTACH);
+        setOption(GrCUDAOptions.InputPrefetch, false);
+        setOption(GrCUDAOptions.TensorRTEnabled, false);
+        setOption(GrCUDAOptions.CuBLASLibrary, CUBLASRegistry.DEFAULT_LIBRARY);
+        setOption(GrCUDAOptions.CuMLLibrary, CUMLRegistry.DEFAULT_LIBRARY);
+        setOption(GrCUDAOptions.ExecutionPolicy, GrCUDAOptionMap.DEFAULT_EXECUTION_POLICY.toString());
+        setOption(GrCUDAOptions.DependencyPolicy, GrCUDAOptionMap.DEFAULT_DEPENDENCY_POLICY.toString());
+        setOption(GrCUDAOptions.RetrieveNewStreamPolicy, GrCUDAOptionMap.DEFAULT_RETRIEVE_STREAM_POLICY.toString());
+        setOption(GrCUDAOptions.RetrieveParentStreamPolicy, GrCUDAOptionMap.DEFAULT_PARENT_STREAM_POLICY.toString());
+        setOption(GrCUDAOptions.TensorRTLibrary, TensorRTRegistry.DEFAULT_LIBRARY);
+        optionMap = new GrCUDAOptionMap(optionValues);
+    }
+
+    public void initializeNull() {
+        optionValues = new OptionValuesMock();
+        setOption(GrCUDAOptions.ExecutionPolicy, GrCUDAOptionMap.DEFAULT_EXECUTION_POLICY.toString());
+        setOption(GrCUDAOptions.DependencyPolicy, GrCUDAOptionMap.DEFAULT_DEPENDENCY_POLICY.toString());
+        setOption(GrCUDAOptions.RetrieveNewStreamPolicy, GrCUDAOptionMap.DEFAULT_RETRIEVE_STREAM_POLICY.toString());
+        setOption(GrCUDAOptions.RetrieveParentStreamPolicy, GrCUDAOptionMap.DEFAULT_PARENT_STREAM_POLICY.toString());
+        setOption(GrCUDAOptions.TensorRTLibrary, null);
+        optionMap = new GrCUDAOptionMap(optionValues);
+    }
+
+    private <T> void setOption(OptionKey<T> key, T value) {
+        optionValues.set(key, value);
+    }
+
+    @Test
+    public void testGetOption(){
+        initializeDefault();
+        assertEquals(optionMap.isCuBLASEnabled(), true);
+        assertEquals(optionMap.isForceStreamAttach(), false);
+        assertEquals(optionMap.getCuBLASLibrary(), CUBLASRegistry.DEFAULT_LIBRARY);
+        assertEquals(optionMap.getDependencyPolicy(), GrCUDAOptionMap.DEFAULT_DEPENDENCY_POLICY);
+    }
+
+    @Test(expected = UnknownKeyException.class)
+    public void testReadUnknownKey() throws UnsupportedMessageException, UnknownKeyException {
+        initializeDefault();
+        optionMap.readHashValue("NotPresent");
+    }
+
+    @Test(expected = UnsupportedMessageException.class)
+    public void testReadUnsupportedMessage() throws UnsupportedMessageException, UnknownKeyException {
+        initializeDefault();
+        optionMap.readHashValue(null);
+    }
+
+    @Test
+    public void testGetHashEntriesIterator(){
+        initializeDefault();
+        GrCUDAOptionMap.EntriesIterator hashIterator = (GrCUDAOptionMap.EntriesIterator) optionMap.getHashEntriesIterator();
+        optionMap.getOptions().forEach((key, value) -> {
+            assertTrue(hashIterator.hasIteratorNextElement());
+            try {
+                GrCUDAOptionMap.GrCUDAOptionTuple elem = hashIterator.getIteratorNextElement();
+                assertEquals(key, elem.readArrayElement(0));
+                assertEquals(value.toString(), elem.readArrayElement(1));
+            } catch (StopIterationException | InvalidArrayIndexException e) {
+                e.printStackTrace();
+            }
+        });
+    }
+
+    @Test(expected = StopIterationException.class)
+    public void testGetStopIteration() throws StopIterationException {
+        initializeDefault();
+        GrCUDAOptionMap.EntriesIterator hashIterator = (GrCUDAOptionMap.EntriesIterator) optionMap.getHashEntriesIterator();
+        do {
+            try {
+                hashIterator.getIteratorNextElement();
+            } catch (StopIterationException e) {
+                e.printStackTrace();
+            }
+        } while (hashIterator.hasIteratorNextElement());
+        hashIterator.getIteratorNextElement();
+    }
+
+    @Test(expected = NullPointerException.class)
+    public void testGetNullPointerExceptionWhenRetrievingValue() throws NullPointerException, StopIterationException {
+        initializeNull();
+        GrCUDAOptionMap.EntriesIterator hashIterator = (GrCUDAOptionMap.EntriesIterator) optionMap.getHashEntriesIterator();
+        do {
+            hashIterator.getIteratorNextElement();
+        } while (hashIterator.hasIteratorNextElement());
+    }
+
+    @Test(expected = InvalidArrayIndexException.class)
+    public void testGetInvalidIndex() throws InvalidArrayIndexException {
+        initializeDefault();
+        GrCUDAOptionMap.EntriesIterator hashIterator = (GrCUDAOptionMap.EntriesIterator) optionMap.getHashEntriesIterator();
+        try {
+            GrCUDAOptionMap.GrCUDAOptionTuple elem = hashIterator.getIteratorNextElement();
+            assertEquals(2, elem.getArraySize());
+            assertFalse(elem.isArrayElementReadable(2));
+            elem.readArrayElement(2);
+        } catch (StopIterationException e) {
+            e.printStackTrace();
+        }
+    }
+
+    @Test
+    public void testGetOptionsFunction() {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", ExecutionPolicyEnum.ASYNC.toString()).option("grcuda.EnableComputationTimers", "true").build()) {
+            // Obtain the options map;
+            Value options = ctx.eval("grcuda", "getoptions").execute();
+            // Check the we have a map;
+            assertTrue(options.hasHashEntries());
+
+            // Obtain some options;
+            assertEquals(ExecutionPolicyEnum.ASYNC.toString(), options.getHashValue("grcuda.ExecutionPolicy").asString());
+            assertTrue(Boolean.parseBoolean(options.getHashValue("grcuda.EnableComputationTimers").asString()));
+            assertFalse(Boolean.parseBoolean(options.getHashValue("grcuda.ForceStreamAttach").asString()));
+            assertEquals(GrCUDAOptionMap.DEFAULT_NUMBER_OF_GPUs, Integer.valueOf(options.getHashValue("grcuda.NumberOfGPUs").asString()));
+        }
+    }
+
+    @Test
+    public void testGetOptionsFunctionIterator() {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", ExecutionPolicyEnum.ASYNC.toString()).build()) {
+            // Obtain the options map;
+            Value options = ctx.eval("grcuda", "getoptions").execute();
+            // Get the iterator;
+            Value iterator = options.getHashEntriesIterator();
+            int optionCount = 0;
+            // Check that we can find a specific option key and value;
+            String optionKeyToFind = "grcuda.ExecutionPolicy";
+            String optionValueToFind = ExecutionPolicyEnum.ASYNC.toString();
+            boolean optionFound = false;
+            while (iterator.hasIteratorNextElement()) {
+                Value option = iterator.getIteratorNextElement();
+                assertEquals(2, option.getArraySize());
+                if (option.getArrayElement(0).asString().equals(optionKeyToFind)) {
+                    optionFound = true;
+                    assertEquals(optionValueToFind, option.getArrayElement(1).asString());
+                }
+                optionCount++;
+            }
+            assertTrue(optionFound);
+            assertTrue(iterator.isIterator());
+            assertFalse(iterator.hasIteratorNextElement());
+            assertEquals(options.getHashSize(), optionCount);
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ManglingTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ManglingTest.java
index 809212e4..d7e947f6 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ManglingTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ManglingTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -80,7 +87,7 @@ public void testMangledNames() throws Exception {
         }
         sourceGen.extractMangledNames(stdout);
 
-        // Check grCUDA-generated mangled names with compiler output
+        // Check GrCUDA-generated mangled names with compiler output
         int idx = 0;
         for (Binding binding : bindings) {
             String expectedMangledName = sourceGen.getMangledName(idx);
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ParserTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ParserTest.java
index c378cfda..9dab41d1 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ParserTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/ParserTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUBLASTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUBLASTest.java
new file mode 100644
index 00000000..1b60e359
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUBLASTest.java
@@ -0,0 +1,298 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.cudalibraries;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assume.assumeNoException;
+import static org.junit.Assume.assumeTrue;
+
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.function.Function;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.junit.Before;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class CUBLASTest {
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {ExecutionPolicyEnum.SYNC.toString(), ExecutionPolicyEnum.ASYNC.toString()},
+                {true, false},
+                {'S', 'D', 'C', 'Z'}
+        }));
+    }
+
+    private final String policy;
+    private final boolean inputPrefetch;
+    private final char typeChar;
+
+    public CUBLASTest(String policy, boolean inputPrefetch, char typeChar) {
+        this.policy = policy;
+        this.inputPrefetch = inputPrefetch;
+        this.typeChar = typeChar;
+    }
+
+    /**
+     * Set to false if we discover that cuBLAS is not available;
+     */
+    private static boolean cuBLASAvailable = true;
+
+    @Before
+    public void skipIfcuBLASNotAvailable() {
+        assumeTrue(cuBLASAvailable);
+    }
+
+
+    /**
+     * BLAS Level-1 Test.
+     */
+    @Test
+    public void testTaxpy() {
+        // x = (0, 1, 2, ..., numDim-1)
+        // y = (0, -2, -4, ..., -2*numDim-2)
+        // y := -1 * x + y
+        // y = (0, 1, 2, ..., numDim-1)
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).allowAllAccess(
+                        true).build()) {
+            Value cu = polyglot.eval("grcuda", "CU");
+            boolean isComplex = (typeChar == 'C') || (typeChar == 'Z');
+            String cudaType = ((typeChar == 'D') || (typeChar == 'Z')) ? "double" : "float";
+            int numDim = 1000;
+            int numElements = isComplex ? numDim * 2 : numDim;
+            Value alpha = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
+            alpha.setArrayElement(0, -1);
+            if (isComplex) {
+                alpha.setArrayElement(1, 0);
+            }
+            Value x = cu.invokeMember("DeviceArray", cudaType, numElements);
+            Value y = cu.invokeMember("DeviceArray", cudaType, numElements);
+            assertEquals(numElements, x.getArraySize());
+            assertEquals(numElements, y.getArraySize());
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, 2 * i);
+            }
+            Value taxpy = polyglot.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            taxpy.execute(numDim, alpha, x, 1, y, 1);
+            assertOutputVectorIsCorrect(numElements, y, (Integer i) -> i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * BLAS Level-2 Test.
+     */
+    @Test
+    public void testTgemv() {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).allowAllAccess(
+                        true).build()) {
+            Value cu = polyglot.eval("grcuda", "CU");
+            int numDim = 10;
+            boolean isComplex = (typeChar == 'C') || (typeChar == 'Z');
+            String cudaType = ((typeChar == 'D') || (typeChar == 'Z')) ? "double" : "float";
+            int numElements = isComplex ? numDim * 2 : numDim;
+            Value alpha = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
+            Value beta = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
+            alpha.setArrayElement(0, -1);
+            beta.setArrayElement(0, 2);
+            if (isComplex) {
+                alpha.setArrayElement(1, 0);
+                beta.setArrayElement(1, 0);
+            }
+
+            // complex types require two elements along 1st dimension (since column-major order)
+            Value matrixA = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
+            Value x = cu.invokeMember("DeviceArray", cudaType, numElements);
+            Value y = cu.invokeMember("DeviceArray", cudaType, numElements);
+
+            // set matrix
+            // A: identity matrix
+            for (int j = 0; j < numDim; j++) {
+                for (int i = 0; i < numElements; i++) {
+                    // complex types require two elements along 1st dimension (since column-major
+                    // order)
+                    Value row = matrixA.getArrayElement(i);
+                    row.setArrayElement(j, ((!isComplex & (i == j)) || (isComplex && (i == (2 * j)))) ? 1.0 : 0.0);
+                }
+            }
+
+            // set vectors
+            // x = (1, 2, ..., numDim)
+            // y = (1, 2, ..., numDim)
+            for (int i = 0; i < numElements; i++) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, i);
+            }
+            Value tgemv = polyglot.eval("grcuda", "BLAS::cublas" + typeChar + "gemv");
+            final int cublasOpN = 0;
+            tgemv.execute(cublasOpN, numDim, numDim,
+                            alpha,
+                            matrixA, numDim,
+                            x, 1,
+                            beta,
+                            y, 1);
+            assertOutputVectorIsCorrect(numElements, y, (Integer i) -> i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * BLAS Level-3 Test.
+     */
+    @Test
+    public void testTgemm() {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).allowAllAccess(
+                        true).build()) {
+            Value cu = polyglot.eval("grcuda", "CU");
+            int numDim = 10;
+            boolean isComplex = (typeChar == 'C') || (typeChar == 'Z');
+            String cudaType = ((typeChar == 'D') || (typeChar == 'Z')) ? "double" : "float";
+            int numElements = isComplex ? numDim * 2 : numDim;
+            Value alpha = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
+            Value beta = cu.invokeMember("DeviceArray", cudaType, isComplex ? 2 : 1);
+            alpha.setArrayElement(0, -1);
+            beta.setArrayElement(0, 2);
+            if (isComplex) {
+                alpha.setArrayElement(1, 0);
+                beta.setArrayElement(1, 0);
+            }
+
+            // complex types require two elements along 1st dimension (since column-major order)
+            Value matrixA = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
+            Value matrixB = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
+            Value matrixC = cu.invokeMember("DeviceArray", cudaType, numElements, numDim, "F");
+
+            // set matrix
+            // A: identity matrix
+            for (int j = 0; j < numDim; j++) {
+                for (int i = 0; i < numElements; i++) {
+                    // complex types require two elements along 1st dimension (since column-major
+                    // order)
+                    Value row = matrixA.getArrayElement(i);
+                    row.setArrayElement(j, ((!isComplex & (i == j)) || (isComplex && (i == (2 * j)))) ? 1.0 : 0.0);
+                }
+            }
+            // B == C
+            for (int j = 0; j < numDim; j++) {
+                for (int i = 0; i < numElements; i++) {
+                    Value row = matrixB.getArrayElement(i);
+                    row.setArrayElement(j, i + numElements * j);
+                }
+            }
+            for (int j = 0; j < numDim; j++) {
+                for (int i = 0; i < numElements; i++) {
+                    Value row = matrixC.getArrayElement(i);
+                    row.setArrayElement(j, i + numElements * j);
+                }
+            }
+            Value tgemm = polyglot.eval("grcuda", "BLAS::cublas" + typeChar + "gemm");
+            final int cublasOpN = 0;
+            tgemm.execute(cublasOpN, cublasOpN, numDim, numDim, numDim,
+                            alpha,
+                            matrixA, numDim,
+                            matrixB, numDim,
+                            beta,
+                            matrixC, numDim);
+            assertOutputMatrixIsCorrect(numDim, numElements, matrixC, (Integer i) -> i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * Validation function for vectors.
+     */
+    public static void assertOutputVectorIsCorrect(int len, Value deviceArray,
+                    Function<Integer, Integer> outFunc, char typeChar) {
+        boolean hasDouble = (typeChar == 'D') || (typeChar == 'Z');
+        for (int i = 0; i < len; i++) {
+            if (hasDouble) {
+                double expected = outFunc.apply(i);
+                double actual = deviceArray.getArrayElement(i).asDouble();
+                assertEquals(expected, actual, 1e-5);
+            } else {
+                float expected = outFunc.apply(i);
+                float actual = deviceArray.getArrayElement(i).asFloat();
+                assertEquals(expected, actual, 1e-5f);
+            }
+        }
+    }
+
+    void assertOutputVectorIsCorrect(int len, Value deviceArray,
+                                     Function<Integer, Integer> outFunc) {
+        CUBLASTest.assertOutputVectorIsCorrect(len, deviceArray, outFunc, this.typeChar);
+    }
+
+    /**
+     * Validation function for matrix.
+     */
+    private void assertOutputMatrixIsCorrect(int numDim, int numElements, Value matrix,
+                    Function<Integer, Integer> outFunc) {
+        boolean hasDouble = (typeChar == 'D') || (typeChar == 'Z');
+        for (int j = 0; j < numDim; j++) {
+            for (int i = 0; i < numElements; i++) {
+                int idx = i + numElements * j;
+                if (hasDouble) {
+                    double expected = outFunc.apply(idx);
+                    double actual = matrix.getArrayElement(i).getArrayElement(j).asDouble();
+                    assertEquals(expected, actual, 1e-5);
+                } else {
+                    float expected = outFunc.apply(idx);
+                    float actual = matrix.getArrayElement(i).getArrayElement(j).asFloat();
+                    assertEquals(expected, actual, 1e-5f);
+                }
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUBLASWithScheduleTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUBLASWithScheduleTest.java
new file mode 100644
index 00000000..14728239
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUBLASWithScheduleTest.java
@@ -0,0 +1,486 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.cudalibraries;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assume.assumeNoException;
+import static org.junit.Assume.assumeTrue;
+
+import java.util.Collection;
+import java.util.function.Function;
+
+import com.nvidia.grcuda.test.util.GrCUDATestOptionsStruct;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+
+@RunWith(Parameterized.class)
+public class CUBLASWithScheduleTest {
+
+    private static final int NUM_THREADS_PER_BLOCK = 32;
+
+    private static final String SCALE_KERNEL = "extern \"C\" __global__ void scale(double* y, const double* x, double alpha, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       y[idx] = alpha * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.getAllOptionCombinationsSingleGPU();
+    }
+
+    private final GrCUDATestOptionsStruct options;
+    private final char typeChar = 'D';
+
+    /**
+     * Set to false if we discover that cuBLAS is not available;
+     */
+    private static boolean cuBLASAvailable = true;
+
+    @Before
+    public void skipIfcuBLASNotAvailable() {
+        assumeTrue(cuBLASAvailable);
+    }
+
+    public CUBLASWithScheduleTest(GrCUDATestOptionsStruct options) {
+        this.options = options;
+    }
+
+    /**
+     * Test 2 independent kernels followed by a BLAS kernel
+     * A ---> C
+     * B --/
+     */
+    @Test
+    public void testTaxpyJoinPattern() {
+        // x = (0, 1, 2, ..., numElements-1)
+        // y = (0, 2, 4, ..., 2*(numElements-1))
+        // z = (0, 0, 0, ..., 0)
+        // z := 2 * x
+        // y := 2 * y
+        // z := -1 * z + y
+        // z = 2 * (0, 1, 2, ..., numElements-1)
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            String cudaType = "double";
+            int numElements = 1000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value alpha = deviceArrayConstructor.execute(cudaType, 1);
+            alpha.setArrayElement(0, -1);
+
+            // Create some arrays;
+            Value x = deviceArrayConstructor.execute(cudaType, numElements);
+            Value y = deviceArrayConstructor.execute(cudaType, numElements);
+            Value z = deviceArrayConstructor.execute(cudaType, numElements);
+            assertEquals(numElements, x.getArraySize());
+            assertEquals(numElements, y.getArraySize());
+            assertEquals(numElements, z.getArraySize());
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, 2 * i);
+                z.setArrayElement(i, 0);
+            }
+
+            // Define kernels;
+            Value taxpy = context.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value scaleKernel = buildkernel.execute(SCALE_KERNEL, "scale", "pointer, const pointer, double, sint32");
+            Value configuredScaleKernel = scaleKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredScaleKernel.execute(z, x, 2, numElements);
+            configuredScaleKernel.execute(y, y, 2, numElements);
+            taxpy.execute(numElements, alpha, y, 1, z, 1);
+
+            assertOutputVectorIsCorrect(numElements, z, (Integer i) -> -2 * i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * Test a BLAS kernel followed by 2 independent kernels;
+     * A--->B
+     * \-->C
+     */
+    @Test
+    public void testTaxpyForkPattern() {
+        // x = (0, 1, 2, ..., numElements-1)
+        // y = (0, 2, 4, ..., 2*(numElements-1))
+        // z = (0, 0, 0, ..., 0)
+        // x := -1 * x + y
+        // x = (0, 1, 2, ..., numElements-1)
+        // z := 2 * x
+        // y := 2 * y
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            String cudaType = "double";
+            int numElements = 1000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value alpha = deviceArrayConstructor.execute(cudaType, 1);
+            alpha.setArrayElement(0, -1);
+
+            // Create some arrays;
+            Value x = deviceArrayConstructor.execute(cudaType, numElements);
+            Value y = deviceArrayConstructor.execute(cudaType, numElements);
+            Value z = deviceArrayConstructor.execute(cudaType, numElements);
+            assertEquals(numElements, x.getArraySize());
+            assertEquals(numElements, y.getArraySize());
+            assertEquals(numElements, z.getArraySize());
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, 2 * i);
+                z.setArrayElement(i, 0);
+            }
+
+            // Define kernels;
+            Value taxpy = context.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value scaleKernel = buildkernel.execute(SCALE_KERNEL, "scale", "pointer, const pointer, double, sint32");
+            Value configuredScaleKernel = scaleKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            taxpy.execute(numElements, alpha, x, 1, y, 1);
+            configuredScaleKernel.execute(z, x, 2, numElements);
+            configuredScaleKernel.execute(y, y, 2, numElements);
+
+            assertOutputVectorIsCorrect(numElements, z, (Integer i) -> 2 * i);
+            assertOutputVectorIsCorrect(numElements, y, (Integer i) -> 2 * i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * Test a 2 independent kernels followed by a BLAS kernel followed by 2 independent kernels;
+     * A--->C--->D
+     * B---/ \-->E
+     */
+    @Test
+    public void testTaxpyJoinForkPattern() {
+        // x = (0, 1, 2, ..., numElements-1)
+        // y = (0, 2, 4, ..., 2*(numElements-1))
+        // z = (0, 0, 0, ..., 0)
+        // z := 2 * x
+        // y := 2 * y
+        // z := -1 * z + y
+        // z := -2 * z
+        // y := 2 * y
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            String cudaType = "double";
+            int numElements = 1000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value alpha = deviceArrayConstructor.execute(cudaType, 1);
+            alpha.setArrayElement(0, -1);
+
+            // Create some arrays;
+            Value x = deviceArrayConstructor.execute(cudaType, numElements);
+            Value y = deviceArrayConstructor.execute(cudaType, numElements);
+            Value z = deviceArrayConstructor.execute(cudaType, numElements);
+            assertEquals(numElements, x.getArraySize());
+            assertEquals(numElements, y.getArraySize());
+            assertEquals(numElements, z.getArraySize());
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, 2 * i);
+                z.setArrayElement(i, 0);
+            }
+
+            // Define kernels;
+            Value taxpy = context.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value scaleKernel = buildkernel.execute(SCALE_KERNEL, "scale", "pointer, const pointer, double, sint32");
+            Value configuredScaleKernel = scaleKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredScaleKernel.execute(z, x, 2, numElements); // z = 0, 2, 4, ...
+            configuredScaleKernel.execute(y, y, 2, numElements); // y = 0, 4, 8, ...
+            taxpy.execute(numElements, alpha, y, 1, z, 1); // z = 0, -2, -4, ...
+            configuredScaleKernel.execute(z, z, -1, numElements); // z = 0, 2, 4, ...
+            configuredScaleKernel.execute(y, y, 0.5, numElements); // y = 0, 2, 4, ...
+
+            assertOutputVectorIsCorrect(numElements, z, (Integer i) -> 2 * i);
+            assertOutputVectorIsCorrect(numElements, y, (Integer i) -> 2 * i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * Test a BLAS kernel followed by 2 independent kernels followed by a BLAS kernel;
+     * A--->B--->D
+     * \-->C---/
+     */
+    @Test
+    public void testTaxpyForkJoinPattern() {
+        // x = (0, 1, 2, ..., numElements-1)
+        // y = (0, 2, 4, ..., 2*(numElements-1))
+        // z = (0, 0, 0, ..., 0)
+        // x := -1 * x + y
+        // x = (0, 1, 2, ..., numElements-1)
+        // z := 2 * x
+        // y := 4 * y
+        // y := -1 * z + y
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            String cudaType = "double";
+            int numElements = 1000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value alpha = deviceArrayConstructor.execute(cudaType, 1);
+            alpha.setArrayElement(0, -1);
+
+            // Create some arrays;
+            Value x = deviceArrayConstructor.execute(cudaType, numElements);
+            Value y = deviceArrayConstructor.execute(cudaType, numElements);
+            Value z = deviceArrayConstructor.execute(cudaType, numElements);
+            assertEquals(numElements, x.getArraySize());
+            assertEquals(numElements, y.getArraySize());
+            assertEquals(numElements, z.getArraySize());
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, 2 * i);
+                z.setArrayElement(i, 0);
+            }
+
+            // Define kernels;
+            Value taxpy = context.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value scaleKernel = buildkernel.execute(SCALE_KERNEL, "scale", "pointer, const pointer, double, sint32");
+            Value configuredScaleKernel = scaleKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            taxpy.execute(numElements, alpha, x, 1, y, 1);
+            configuredScaleKernel.execute(z, x, 2, numElements);
+            configuredScaleKernel.execute(y, y, 4, numElements);
+            taxpy.execute(numElements, alpha, z, 1, y, 1);
+
+            assertOutputVectorIsCorrect(numElements, z, (Integer i) -> 2 * i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * Test a BLAS kernel followed by 2 independent kernels followed by a BLAS kernel;
+     *       /-->E-->F
+     * A--->B--->D
+     * \-->C---/
+     */
+    @Test
+    public void testTaxpyIndependentCompPattern() {
+        // x = (0, 1, 2, ..., numElements-1)
+        // y = (0, 2, 4, ..., 2*(numElements-1))
+        // z = (0, 0, 0, ..., 0)
+        // x := -1 * x + y
+        // x = (0, 1, 2, ..., numElements-1)
+        // z := 2 * x
+        // y := 4 * y
+        // y := -1 * z + y
+        // x := 2 * x
+        // x := 2 * x
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            String cudaType = "double";
+            int numElements = 1000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value alpha = deviceArrayConstructor.execute(cudaType, 1);
+            alpha.setArrayElement(0, -1);
+
+            // Create some arrays;
+            Value x = deviceArrayConstructor.execute(cudaType, numElements);
+            Value y = deviceArrayConstructor.execute(cudaType, numElements);
+            Value z = deviceArrayConstructor.execute(cudaType, numElements);
+            assertEquals(numElements, x.getArraySize());
+            assertEquals(numElements, y.getArraySize());
+            assertEquals(numElements, z.getArraySize());
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, i);
+                y.setArrayElement(i, 2 * i);
+                z.setArrayElement(i, 0);
+            }
+
+            // Define kernels;
+            Value taxpy = context.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value scaleKernel = buildkernel.execute(SCALE_KERNEL, "scale", "pointer, const pointer, double, sint32");
+            Value configuredScaleKernel = scaleKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            taxpy.execute(numElements, alpha, x, 1, y, 1);
+            configuredScaleKernel.execute(z, x, 2, numElements);
+            configuredScaleKernel.execute(y, y, 4, numElements);
+            configuredScaleKernel.execute(x, x, 2, numElements);
+            configuredScaleKernel.execute(x, x, 2, numElements);
+            taxpy.execute(numElements, alpha, z, 1, y, 1);
+
+            assertOutputVectorIsCorrect(numElements, z, (Integer i) -> 2 * i);
+            assertOutputVectorIsCorrect(numElements, x, (Integer i) -> 4 * i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * BLAS Level-3 Test, doing 2 matrix computations on independent data, and syncing them
+     * afterwards with an axpy kernel;
+     */
+    @Test
+    public void testGemmScheduling() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            Value cu = context.eval("grcuda", "CU");
+            int numDim = 100;
+            String cudaType = "double";
+            Value alpha = cu.invokeMember("DeviceArray", cudaType, 1);
+            Value beta = cu.invokeMember("DeviceArray", cudaType, 1);
+            Value alpha2 = cu.invokeMember("DeviceArray", cudaType, 1);
+            Value beta2 = cu.invokeMember("DeviceArray", cudaType, 1);
+            Value alpha3 = cu.invokeMember("DeviceArray", cudaType, 1);
+            alpha.setArrayElement(0, -1);
+            beta.setArrayElement(0, 2);
+            alpha2.setArrayElement(0, -1);
+            beta2.setArrayElement(0, 2);
+            alpha3.setArrayElement(0, -2);
+
+            Value matrixA = cu.invokeMember("DeviceArray", cudaType, numDim, numDim, "F");
+            Value matrixB = cu.invokeMember("DeviceArray", cudaType, numDim, numDim, "F");
+            Value matrixC = cu.invokeMember("DeviceArray", cudaType, numDim, numDim, "F");
+            Value matrixE = cu.invokeMember("DeviceArray", cudaType, numDim, numDim, "F");
+            Value matrixF = cu.invokeMember("DeviceArray", cudaType, numDim, numDim, "F");
+            Value matrixG = cu.invokeMember("DeviceArray", cudaType, numDim, numDim, "F");
+
+            // Initialize matrices
+            // A, E: identity matrix
+            for (int j = 0; j < numDim; j++) {
+                for (int i = 0; i < numDim; i++) {
+                    Value row = matrixA.getArrayElement(i);
+                    row.setArrayElement(j, (i == j) ? 1.0 : 0.0);
+                    row = matrixE.getArrayElement(i);
+                    row.setArrayElement(j, (i == j) ? 1.0 : 0.0);
+                }
+            }
+            // B == C == F == G
+            for (int j = 0; j < numDim; j++) {
+                for (int i = 0; i < numDim; i++) {
+                    Value row = matrixB.getArrayElement(i);
+                    row.setArrayElement(j, i + numDim * j);
+                    row = matrixC.getArrayElement(i);
+                    row.setArrayElement(j, i + numDim * j);
+                    row = matrixF.getArrayElement(i);
+                    row.setArrayElement(j, i + numDim * j);
+                    row = matrixG.getArrayElement(i);
+                    row.setArrayElement(j, i + numDim * j);
+                }
+            }
+            Value tgemm = context.eval("grcuda", "BLAS::cublas" + typeChar + "gemm");
+            Value taxpy = context.eval("grcuda", "BLAS::cublas" + typeChar + "axpy");
+            final int cublasOpN = 0;
+            // Schedule 2 GEMMs;
+            tgemm.execute(cublasOpN, cublasOpN, numDim, numDim, numDim,
+                            alpha,
+                            matrixA, numDim,
+                            matrixB, numDim,
+                            beta,
+                            matrixC, numDim);
+            // Schedule 1 axpy;
+            tgemm.execute(cublasOpN, cublasOpN, numDim, numDim, numDim,
+                            alpha2,
+                            matrixE, numDim,
+                            matrixF, numDim,
+                            beta2,
+                            matrixG, numDim);
+            taxpy.execute(numDim * numDim, alpha3, matrixC, 1, matrixG, 1);
+            assertOutputMatrixIsCorrect(numDim, numDim, matrixC, (Integer i) -> i);
+            assertOutputMatrixIsCorrect(numDim, numDim, matrixG, (Integer i) -> -i);
+        } catch (Exception e) {
+            System.out.println("warning: cuBLAS not enabled, skipping test");
+            cuBLASAvailable = false;
+            assumeNoException(e);
+        }
+    }
+
+    /**
+     * Validation function for vectors.
+     */
+    private void assertOutputVectorIsCorrect(int len, Value deviceArray,
+                    Function<Integer, Integer> outFunc) {
+        for (int i = 0; i < len; i++) {
+            double expected = outFunc.apply(i);
+            double actual = deviceArray.getArrayElement(i).asDouble();
+            assertEquals(expected, actual, 1e-5);
+        }
+    }
+
+    /**
+     * Validation function for matrix.
+     */
+    private void assertOutputMatrixIsCorrect(int numDim, int numElements, Value matrix,
+                    Function<Integer, Integer> outFunc) {
+        for (int j = 0; j < numDim; j++) {
+            for (int i = 0; i < numElements; i++) {
+                int idx = i + numElements * j;
+                double expected = outFunc.apply(idx);
+                double actual = matrix.getArrayElement(i).getArrayElement(j).asDouble();
+                assertEquals(expected, actual, 1e-5);
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUMLTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUMLTest.java
new file mode 100644
index 00000000..8a61cb16
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUMLTest.java
@@ -0,0 +1,118 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.cudalibraries;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Arrays;
+import java.util.Collection;
+
+import static org.junit.Assume.assumeNoException;
+import static org.junit.Assume.assumeTrue;
+
+@RunWith(Parameterized.class)
+public class CUMLTest {
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                        {ExecutionPolicyEnum.SYNC.toString(), ExecutionPolicyEnum.ASYNC.toString()},
+                        {true, false},
+                        {'S', 'D'}
+        }));
+    }
+
+    private final String policy;
+    private final boolean inputPrefetch;
+    private final char typeChar;
+
+    public CUMLTest(String policy, boolean inputPrefetch, char typeChar) {
+        this.policy = policy;
+        this.inputPrefetch = inputPrefetch;
+        this.typeChar = typeChar;
+    }
+
+    /**
+     * Set to false if we discover that cuML is not available;
+     */
+    private static boolean cuMLAvailable = true;
+
+    @Before
+    public void skipIfcuMLNotAvailable() {
+        assumeTrue(cuMLAvailable);
+    }
+
+    @Test
+    public void testDbscan() {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", this.policy)
+                .option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).allowAllAccess(true).build()) {
+            Value cu = polyglot.eval("grcuda", "CU");
+            int numRows = 100;
+            int numCols = 2;
+            String cudaType = (typeChar == 'D') ? "double" : "float";
+            Value input = cu.invokeMember("DeviceArray", cudaType, numRows, numCols);
+            Value labels = cu.invokeMember("DeviceArray", "int", numRows);
+            for (int i = 0; i < numRows; i++) {
+                for (int j = 0; j < numCols; j++) {
+                    input.getArrayElement(i).setArrayElement(j, i / 10 + j);
+                }
+                labels.setArrayElement(i, 0);
+            }
+            double eps = 0.5;
+            int minSamples = 5;
+            int maxBytesPerChunk = 0;
+            int verbose = 1;
+            try {
+                Value dbscan = polyglot.eval("grcuda", "ML::cuml" + typeChar + "pDbscanFit");
+                try {
+                    dbscan.execute(input, numRows, numCols, eps, minSamples, labels, maxBytesPerChunk, verbose);
+                    CUBLASTest.assertOutputVectorIsCorrect(numRows, labels, (Integer i) -> i / 10, this.typeChar);
+                } catch (Exception e) {
+                    System.out.println("warning: failed to launch cuML, skipping test");
+                    cuMLAvailable = false;
+                    assumeNoException(e);
+                }
+            } catch (Exception e) {
+                System.out.println("warning: cuML not enabled, skipping test");
+                cuMLAvailable = false;
+                assumeNoException(e);
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUSPARSETest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUSPARSETest.java
new file mode 100644
index 00000000..482211bc
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/cudalibraries/CUSPARSETest.java
@@ -0,0 +1,507 @@
+/*
+ * Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.cudalibraries;
+
+import static com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy.CUSPARSEProxySpMV.CUSPARSESpMVMatrixType;
+import static org.junit.Assert.assertEquals;
+
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.logging.Logger;
+import java.util.stream.Collectors;
+
+import com.oracle.truffle.api.TruffleLogger;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+
+@RunWith(Parameterized.class)
+public class CUSPARSETest {
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {ExecutionPolicyEnum.SYNC.toString(), ExecutionPolicyEnum.ASYNC.toString()},
+                {true, false},
+                {'S', 'C', 'D', 'Z'}
+        }));
+    }
+
+    private final String policy;
+    private final boolean inputPrefetch;
+    private final char type;
+
+    public CUSPARSETest(String policy, boolean inputPrefetch, char type) {
+        this.policy = policy;
+        this.inputPrefetch = inputPrefetch;
+        this.type = type;
+    }
+
+    private int asCudaOrdinalDataType(char type) {
+        switch (type) {
+            case 'C':
+                return CUSPARSERegistry.CUDADataType.CUDA_C_32F.ordinal();
+            case 'Z':
+                return CUSPARSERegistry.CUDADataType.CUDA_C_64F.ordinal();
+            case 'S':
+                return CUSPARSERegistry.CUDADataType.CUDA_R_32F.ordinal();
+            case 'D':
+                return CUSPARSERegistry.CUDADataType.CUDA_R_64F.ordinal();
+        }
+        throw new RuntimeException("Type \"" + type + "\" is not allowed");
+    }
+
+    /**
+     * SPARSE SpMV function test with CSR matrix.
+     */
+
+    @Test
+    public void TestSpMVCSR() {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy",
+                this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).option(
+                "grcuda.CuSPARSEEnabled", String.valueOf(true)).allowAllAccess(true).build()) {
+
+            final int numElements = 1000;
+            final boolean isComplex = this.type == 'C' || this.type == 'Z';
+            final boolean isDouble = this.type == 'D' || this.type == 'Z';
+            final int complexScaleSize = isComplex ? 2 : 1;
+            final String grcudaDataType = (this.type == 'D' || this.type == 'Z') ? "double" : "float";
+
+            Value cu = polyglot.eval("grcuda", "CU");
+
+            // creating variables for cusparse functions as DeviceArrays
+            Value alpha = cu.invokeMember("DeviceArray", grcudaDataType, complexScaleSize);
+            Value beta = cu.invokeMember("DeviceArray", grcudaDataType, complexScaleSize);
+            Value rowPtr = cu.invokeMember("DeviceArray", "int", (numElements + 1));
+            Value colIdx = cu.invokeMember("DeviceArray", "int", numElements);
+            Value nnzVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize);
+            Value dnVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize);
+            Value outVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize);
+
+            // variables initialization
+
+            alpha.setArrayElement(0, 1);
+            beta.setArrayElement(0, 0);
+            if (isComplex) {
+                alpha.setArrayElement(1, 0);
+                beta.setArrayElement(1, 0);
+            }
+
+            // populating arrays
+            float edgeValue = (float) Math.random();
+
+            for (int i = 0; i < numElements; ++i) {
+                rowPtr.setArrayElement(i, i);
+                colIdx.setArrayElement(i, i);
+                for (int j = 0; j < complexScaleSize; ++j) {
+                    if(j == 0){
+                        nnzVec.setArrayElement((i * complexScaleSize), edgeValue);
+                        dnVec.setArrayElement((i * complexScaleSize), 1.0);
+                        outVec.setArrayElement((i * complexScaleSize),0.0);
+                    } else {
+                        nnzVec.setArrayElement((i * complexScaleSize + j), 0.0);
+                        dnVec.setArrayElement((i * complexScaleSize + j), 0.0);
+                        outVec.setArrayElement((i * complexScaleSize + j), 0.0);
+                    }
+                }
+            }
+
+            rowPtr.setArrayElement(numElements, numElements);
+
+            Value cusparseSpMV = polyglot.eval("grcuda", "SPARSE::cusparseSpMV");
+
+            int cudaDataType = asCudaOrdinalDataType(this.type);
+
+            // order of the arguments should be the following
+            cusparseSpMV.execute(
+                    CUSPARSERegistry.CUSPARSEOperation.CUSPARSE_OPERATION_NON_TRANSPOSE.ordinal(),
+                    alpha,
+                    numElements,
+                    numElements,
+                    numElements,
+                    rowPtr,
+                    colIdx,
+                    nnzVec,
+                    CUSPARSERegistry.CUSPARSEIndexType.CUSPARSE_INDEX_32I.ordinal(),
+                    CUSPARSERegistry.CUSPARSEIndexBase.CUSPARSE_INDEX_BASE_ZERO.ordinal(),
+                    cudaDataType,
+                    dnVec,
+                    cudaDataType,
+                    beta,
+                    outVec,
+                    CUSPARSERegistry.CUSPARSESpMVAlg.CUSPARSE_SPMV_ALG_DEFAULT.ordinal(),
+                    CUSPARSESpMVMatrixType.SPMV_MATRIX_TYPE_CSR.ordinal());
+
+
+            for (int i = 0; i < numElements; ++i) {
+                for (int j = 0; j < complexScaleSize; j++) {
+                        if(j == 0) {
+                            assertEquals(edgeValue, outVec.getArrayElement(i * complexScaleSize).asFloat(), 1e-5);
+                        } else {
+                            assertEquals(0.0, outVec.getArrayElement(i * complexScaleSize + j).asFloat(), 1e-5);
+                        }
+                    }
+                }
+            }
+    }
+
+    /**
+     * SPARSE SpMV function test with complex data type and COO matrix
+     */
+
+    @Test
+    public void TestSpMVCOO() {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy", this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).option(
+                "grcuda.CuSPARSEEnabled", String.valueOf(true)).allowAllAccess(true).build()) {
+
+            final int numElements = 10000;
+            final boolean isComplex = this.type == 'C' || this.type == 'Z';
+            final boolean isDouble = this.type == 'D' || this.type == 'Z';
+            final int complexScaleSize = isComplex ? 2 : 1;
+            final String grcudaDataType = (this.type == 'D' || this.type == 'Z') ? "double" : "float";
+
+            // creating context variables
+            Value cu = polyglot.eval("grcuda", "CU");
+
+            // creating variables for cusparse functions as DeviceArrays
+            Value alpha = cu.invokeMember("DeviceArray", grcudaDataType, complexScaleSize);
+            Value beta = cu.invokeMember("DeviceArray", grcudaDataType, complexScaleSize);
+            Value coordX = cu.invokeMember("DeviceArray", "int", numElements);
+            Value coordY = cu.invokeMember("DeviceArray", "int", numElements);
+            Value nnzVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize);
+            Value dnVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize);
+            Value outVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize);
+
+            // variables initialization
+            alpha.setArrayElement(0, 1);
+            beta.setArrayElement(0, 0);
+
+            if (isComplex) {
+                alpha.setArrayElement(1, 0);
+                beta.setArrayElement(1, 0);
+            }
+
+            // populating arrays
+            float edgeValue = (float) Math.random();
+
+            for (int i = 0; i < numElements; i++) {
+                coordX.setArrayElement(i, i);
+                coordY.setArrayElement(i, i);
+                for (int j = 0; j < complexScaleSize; ++j) {
+                    if (j == 0){
+                        nnzVec.setArrayElement(i * complexScaleSize, edgeValue);
+                        dnVec.setArrayElement(i * complexScaleSize, 1.0);
+                        outVec.setArrayElement(i * complexScaleSize, 0.0);
+                    }
+                    if(j > 0){
+                        nnzVec.setArrayElement(i * complexScaleSize + j, 0.0);
+                        dnVec.setArrayElement(i * complexScaleSize + j, 0.0);
+                        outVec.setArrayElement(i * complexScaleSize + j, 0.0);
+                    }
+                }
+            }
+
+            Value cusparseSpMV = polyglot.eval("grcuda", "SPARSE::cusparseSpMV");
+
+            int cudaDataType = this.asCudaOrdinalDataType(this.type);
+
+            // order of the arguments should be the following
+            cusparseSpMV.execute(
+                    CUSPARSERegistry.CUSPARSEOperation.CUSPARSE_OPERATION_NON_TRANSPOSE.ordinal(),
+                    alpha,
+                    numElements,
+                    numElements,
+                    numElements,
+                    coordX,
+                    coordY,
+                    nnzVec,
+                    CUSPARSERegistry.CUSPARSEIndexType.CUSPARSE_INDEX_32I.ordinal(),
+                    CUSPARSERegistry.CUSPARSEIndexBase.CUSPARSE_INDEX_BASE_ZERO.ordinal(),
+                    cudaDataType,
+                    dnVec,
+                    cudaDataType,
+                    beta,
+                    outVec,
+                    CUSPARSERegistry.CUSPARSESpMVAlg.CUSPARSE_SPMV_ALG_DEFAULT.ordinal(),
+                    CUSPARSESpMVMatrixType.SPMV_MATRIX_TYPE_COO.ordinal());
+
+            for (int i = 0; i < numElements; ++i) {
+                for (int j = 0; j < complexScaleSize; j++) {
+                    if(j == 0) {
+                        assertEquals(edgeValue, outVec.getArrayElement(i * complexScaleSize).asFloat(), 1e-5);
+                    } else {
+                        assertEquals(0.0, outVec.getArrayElement(i * complexScaleSize + j).asFloat(), 1e-5);
+                    }
+                }
+            }
+        }
+    }
+
+    /**
+     * SPARSE Sgemvi function test
+     */
+
+    @Test
+    public void TestTGeMVI() {
+        try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy",
+                this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).option(
+                "grcuda.CuSPARSEEnabled", String.valueOf(true)).allowAllAccess(true).build()) {
+            if (this.type != 'S') {
+                System.out.println("warning: TGeMVI tests with T=" + this.type + ", ExecutionPolicy=" + this.policy + ", InputPrefetch=" + this.inputPrefetch + " are not supported, skipping test");
+                return;
+            }
+            final int numElements = 1000;
+            final boolean isComplex = this.type == 'C' || this.type == 'Z';
+            final boolean isDouble = this.type == 'D' || this.type == 'Z';
+            int complexScaleSize = isComplex ? 2 : 1;
+            final String grcudaDataType = (this.type == 'D' || this.type == 'Z') ? "double" : "float";
+
+            // creating context variables
+            Value cu = polyglot.eval("grcuda", "CU");
+
+            // creating variables for cusparse functions as DeviceArrays
+            Value alpha = cu.invokeMember("DeviceArray", grcudaDataType, complexScaleSize);
+            Value beta = cu.invokeMember("DeviceArray", grcudaDataType, complexScaleSize);
+            int rows = numElements; // m
+            int cols = numElements; // n
+            int lda = numElements; // leading dim of A
+            int nnz = 2; // number of nnz
+            Value spVec = cu.invokeMember("DeviceArray", grcudaDataType, nnz * complexScaleSize); // x
+            Value outVec = cu.invokeMember("DeviceArray", grcudaDataType, numElements * complexScaleSize); // output
+            Value matA = cu.invokeMember("DeviceArray", grcudaDataType, numElements * numElements * complexScaleSize);
+            // variables initialization
+            alpha.setArrayElement(0, 1);
+            beta.setArrayElement(0, 0);
+
+            if (isComplex) {
+                alpha.setArrayElement(1, 0);
+                beta.setArrayElement(1, 0);
+            }
+
+            Value xInd = cu.invokeMember("DeviceArray", "int", nnz); // must be the same
+
+            float edgeValue = (float) Math.random();
+            // Do this since there's an high chance that,
+            // for small enough numElements
+            // two integers might come up equal
+            List<Integer> indices = ThreadLocalRandom
+                    .current()
+                    .ints(0, numElements)
+                    .distinct()
+                    .limit(nnz)
+                    .boxed()
+                    .collect(Collectors.toList());
+
+            // fill sparse vector and related arguments
+            for (int i = 0; i < nnz; ++i) {
+                int idxNnz = indices.get(i);
+                xInd.setArrayElement(i, idxNnz); // set indices vector
+                for (int j = 0; j < complexScaleSize; ++j) {
+                    spVec.setArrayElement(i * complexScaleSize + j, j == 0 ? 1.0 : 0.0);
+                }
+            }
+
+            // fill dense matrix
+            for (int i = 0; i < numElements; i++) {
+                for (int j = 0; j < numElements; j++) {
+                    for (int k = 0; k < complexScaleSize; ++k) {
+                        matA.setArrayElement((i * numElements + j) * complexScaleSize + k, k == 0 ? edgeValue : 0.0);
+                    }
+                }
+            }
+
+            Value cusparseTgemvi = polyglot.eval("grcuda", "SPARSE::cusparse" + this.type + "gemvi");
+
+            // order of the arguments should be the following
+            // transA, m, n, alpha, A, lda, nnz, x, xInd, beta, y, idxBases
+            cusparseTgemvi.execute(
+                    CUSPARSERegistry.CUSPARSEOperation.CUSPARSE_OPERATION_NON_TRANSPOSE.ordinal(),
+                    rows,
+                    cols,
+                    alpha,
+                    matA,
+                    lda * complexScaleSize,
+                    nnz,
+                    spVec,
+                    xInd,
+                    beta,
+                    outVec,
+                    CUSPARSERegistry.CUSPARSEIndexBase.CUSPARSE_INDEX_BASE_ZERO.ordinal(),
+                    this.type);
+
+            float expectedResult = nnz * edgeValue;
+
+            for (int i = 0; i < numElements; i++) {
+                for (int j = 0; j < complexScaleSize; ++j) {
+                    if (isDouble) {
+                        assertEquals(j == 0 ? expectedResult : 0.0, outVec.getArrayElement(i * complexScaleSize + j).asDouble(), 1e-5f);
+//                         System.out.println("out_vec[" + (i * complexScaleSize + j) + "] -> " +
+//                         outVec.getArrayElement(i * complexScaleSize + j).asDouble());
+                    } else {
+                        assertEquals(j == 0 ? expectedResult : 0.0, outVec.getArrayElement(i * complexScaleSize + j).asFloat(), 1e-5f);
+//                         System.out.println("out_vec[" + (i * complexScaleSize + j) + "] -> " +
+//                         outVec.getArrayElement(i * complexScaleSize + j).asFloat());
+
+                    }
+                }
+            }
+        }
+    }
+
+    /**
+     * Libraries Integration Test
+     */
+
+// @Test
+// public void TestLibrariesIntegration() {
+// // y = M x, z = M v
+// // A = z + y, with axpy (a = 1)
+// try (Context polyglot = GrCUDATestUtil.buildTestContext().option("grcuda.ExecutionPolicy",
+// this.policy).option("grcuda.InputPrefetch", String.valueOf(this.inputPrefetch)).option(
+// "grcuda.CuSPARSEEnabled", String.valueOf(true)).allowAllAccess(true).build()) {
+// // context creation
+// Value cu = polyglot.eval("grcuda", "CU");
+//
+// // variables creation
+// int numElements = 1000;
+//
+// Value alphaX = cu.invokeMember("DeviceArray", "float", 1);
+// Value betaX = cu.invokeMember("DeviceArray", "float", 1);
+// Value alphaV = cu.invokeMember("DeviceArray", "float", 1);
+// Value betaV = cu.invokeMember("DeviceArray", "float", 1);
+// Value coordXX = cu.invokeMember("DeviceArray", "int", numElements);
+// Value coordYX = cu.invokeMember("DeviceArray", "int", numElements);
+// Value coordXV = cu.invokeMember("DeviceArray", "int", numElements);
+// Value coordYV = cu.invokeMember("DeviceArray", "int", numElements);
+// Value nnzVecX = cu.invokeMember("DeviceArray", "float", numElements);
+// Value nnzVecV = cu.invokeMember("DeviceArray", "float", numElements);
+// Value dnVecZ = cu.invokeMember("DeviceArray", "float", numElements);
+// Value outVecZ = cu.invokeMember("DeviceArray", "float", numElements);
+// Value dnVecY = cu.invokeMember("DeviceArray", "float", numElements);
+// Value outVecY = cu.invokeMember("DeviceArray", "float", numElements);
+//
+// alphaX.setArrayElement(0, 1);
+// betaX.setArrayElement(0, 0);
+// alphaV.setArrayElement(0, 1);
+// betaV.setArrayElement(0, 0);
+//
+// // initial checks
+// assertEquals(numElements, coordXX.getArraySize());
+// assertEquals(numElements, coordYX.getArraySize());
+// assertEquals(numElements, nnzVecX.getArraySize());
+// assertEquals(numElements, coordXV.getArraySize());
+// assertEquals(numElements, coordYV.getArraySize());
+// assertEquals(numElements, nnzVecV.getArraySize());
+//
+// // initialization
+//
+// float edgeValueX = (float) Math.random();
+//
+// // y = M x
+// for (int i = 0; i < numElements; i++) {
+// coordXX.setArrayElement(i, i);
+// coordYX.setArrayElement(i, i);
+// nnzVecX.setArrayElement(i, edgeValueX);
+// dnVecY.setArrayElement(i, 1.0);
+// outVecY.setArrayElement(i, 0.0);
+// }
+//
+// float edgeValueV = (float) Math.random();
+//
+// // z = M v
+// for (int i = 0; i < numElements; i++) {
+// coordXV.setArrayElement(i, i);
+// coordYV.setArrayElement(i, i);
+// nnzVecV.setArrayElement(i, edgeValueV);
+// dnVecZ.setArrayElement(i, 1.0);
+// outVecZ.setArrayElement(i, 0.0);
+// }
+//
+// Value cusparseSpMV = polyglot.eval("grcuda", "SPARSE::cusparseSpMV");
+//
+// cusparseSpMV.execute(
+// CUSPARSERegistry.cusparseOperation_t.CUSPARSE_OPERATION_NON_TRANSPOSE.ordinal(),
+// alphaX,
+// numElements,
+// numElements,
+// numElements,
+// coordXX,
+// coordYX,
+// nnzVecX,
+// CUSPARSERegistry.cusparseIndexType_t.CUSPARSE_INDEX_32I.ordinal(),
+// CUSPARSERegistry.cusparseIndexBase_t.CUSPARSE_INDEX_BASE_ZERO.ordinal(),
+// CUSPARSERegistry.cudaDataType.CUDA_R_32F.ordinal(),
+// dnVecY,
+// CUSPARSERegistry.cudaDataType.CUDA_R_32F.ordinal(),
+// betaX,
+// outVecY,
+// CUSPARSERegistry.cusparseSpMVAlg_t.CUSPARSE_SPMV_ALG_DEFAULT.ordinal());
+//
+// cusparseSpMV.execute(
+// CUSPARSERegistry.cusparseOperation_t.CUSPARSE_OPERATION_NON_TRANSPOSE.ordinal(),
+// alphaV,
+// numElements,
+// numElements,
+// numElements,
+// coordXV,
+// coordYV,
+// nnzVecV,
+// CUSPARSERegistry.cusparseIndexType_t.CUSPARSE_INDEX_32I.ordinal(),
+// CUSPARSERegistry.cusparseIndexBase_t.CUSPARSE_INDEX_BASE_ZERO.ordinal(),
+// CUSPARSERegistry.cudaDataType.CUDA_R_32F.ordinal(),
+// dnVecZ,
+// CUSPARSERegistry.cudaDataType.CUDA_R_32F.ordinal(),
+// betaV,
+// outVecZ,
+// CUSPARSERegistry.cusparseSpMVAlg_t.CUSPARSE_SPMV_ALG_DEFAULT.ordinal());
+//
+// Value saxpy = polyglot.eval("grcuda", "BLAS::cublas" + this.type + "axpy");
+//
+// saxpy.execute(numElements, alphaX, outVecY, 1, outVecZ, 1);
+//
+// for (int i = 1; i < numElements; i++) {
+// assertEquals(outVecZ.getArrayElement(i).asFloat(), edgeValueV + edgeValueX, 1e-5);
+// }
+// }
+// }
+
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/DeviceArrayCopyFunctionTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/DeviceArrayCopyFunctionTest.java
new file mode 100644
index 00000000..b410eb13
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/DeviceArrayCopyFunctionTest.java
@@ -0,0 +1,948 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.functions;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArrayView;
+import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.ArrayCopyFunctionExecutionInitializer;
+import com.nvidia.grcuda.test.util.GrCUDATestOptionsStruct;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import com.nvidia.grcuda.test.util.TestLogHandler;
+import com.nvidia.grcuda.test.util.mock.DeviceArrayMock;
+import com.nvidia.grcuda.test.util.mock.MultiDimDeviceArrayMock;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+import com.nvidia.grcuda.runtime.LittleEndianNativeArrayView;
+import com.nvidia.grcuda.runtime.OffheapMemory;
+import org.junit.experimental.runners.Enclosed;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+
+@RunWith(Enclosed.class)
+public class DeviceArrayCopyFunctionTest {
+
+    public static class DeviceArrayCopyFunctionTestNotParameterized {
+
+
+        @Test
+        public void testIfSlowPathIsChosenCorrectly() {
+            try (Context ctx = Context.newBuilder().allowAllAccess(true).allowExperimentalOptions(true).logHandler(new TestLogHandler())
+                    .option("log.grcuda.com.nvidia.grcuda.level", "SEVERE").build()) {
+                ctx.getEngine(); // ctx is required to set the logger level. Perform a access to suppress compiler warnings about it being unused;
+                DeviceArray array1d = new DeviceArrayMock();
+                DeviceArray array1d2 = new DeviceArrayMock();
+                DeviceArrayCopyFunction copyFunction = new DeviceArrayCopyFunction(array1d, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertTrue(copyFunction.canUseMemcpy(array1d2));
+
+                long[] dimensions = {2, 2};
+                MultiDimDeviceArray array2d = new MultiDimDeviceArrayMock(dimensions, false);
+                MultiDimDeviceArray array2d2 = new MultiDimDeviceArrayMock(dimensions, false);
+                copyFunction = new DeviceArrayCopyFunction(array2d, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertTrue(copyFunction.canUseMemcpy(array2d2));
+
+                // Inconsistent memory layouts;
+                MultiDimDeviceArray array2d3 = new MultiDimDeviceArrayMock(dimensions, true);
+                copyFunction = new DeviceArrayCopyFunction(array2d, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertFalse(copyFunction.canUseMemcpy(array2d3));
+
+                // We can copy from a single row, if the layout is row-major;
+                MultiDimDeviceArrayView view1 = new MultiDimDeviceArrayView(array2d2, 1, 0, 0);
+                copyFunction = new DeviceArrayCopyFunction(view1, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertTrue(copyFunction.canUseMemcpy(array2d));
+                copyFunction = new DeviceArrayCopyFunction(array1d, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertTrue(copyFunction.canUseMemcpy(array2d));
+
+                // We cannot copy from a single row, if the destination layout is column-major;
+                MultiDimDeviceArrayView view2 = new MultiDimDeviceArrayView(array2d3, 1, 0, 0);
+                copyFunction = new DeviceArrayCopyFunction(view2, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertFalse(copyFunction.canUseMemcpy(array2d));
+                copyFunction = new DeviceArrayCopyFunction(view2, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertFalse(copyFunction.canUseMemcpy(array2d3));
+                copyFunction = new DeviceArrayCopyFunction(array1d, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                assertFalse(copyFunction.canUseMemcpy(array2d3));
+            }
+        }
+
+        @Test
+        public void testDeviceToDeviceDependencies() {
+            try (Context ctx = Context.newBuilder().allowAllAccess(true).allowExperimentalOptions(true).logHandler(new TestLogHandler())
+                    .option("log.grcuda.com.nvidia.grcuda.level", "SEVERE").build()) {
+                ctx.getEngine(); // ctx is required to set the logger level. Perform a access to suppress compiler warnings about it being unused;
+                DeviceArray array1d = new DeviceArrayMock();
+                DeviceArray array1d2 = new DeviceArrayMock();
+                ArrayCopyFunctionExecutionInitializer init = new ArrayCopyFunctionExecutionInitializer(array1d, array1d2, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                List<ComputationArgumentWithValue> deps = init.initialize();
+                assertEquals(2, deps.size());
+                assertEquals(array1d, deps.get(0).getArgumentValue());
+                assertEquals(array1d2, deps.get(1).getArgumentValue());
+                assertFalse(deps.get(0).isConst());
+                assertTrue(deps.get(1).isConst());
+
+                init = new ArrayCopyFunctionExecutionInitializer(array1d, array1d2, DeviceArrayCopyFunction.CopyDirection.TO_POINTER);
+                deps = init.initialize();
+                assertEquals(2, deps.size());
+                assertEquals(array1d, deps.get(0).getArgumentValue());
+                assertEquals(array1d2, deps.get(1).getArgumentValue());
+                assertTrue(deps.get(0).isConst());
+                assertFalse(deps.get(1).isConst());
+
+                int[] array = {1, 2, 3};
+                init = new ArrayCopyFunctionExecutionInitializer(array1d, array, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+                deps = init.initialize();
+                assertEquals(1, deps.size());
+                assertEquals(array1d, deps.get(0).getArgumentValue());
+                assertFalse(deps.get(0).isConst());
+
+                int[] array2 = {1, 2, 3};
+                init = new ArrayCopyFunctionExecutionInitializer(array1d, array2, DeviceArrayCopyFunction.CopyDirection.TO_POINTER);
+                deps = init.initialize();
+                assertEquals(1, deps.size());
+                assertEquals(array1d, deps.get(0).getArgumentValue());
+                assertTrue(deps.get(0).isConst());
+            }
+        }
+
+        @Test
+        public void testDeviceArrayCopyFromOffheapMemory() {
+            final int numElements = 1000;
+            final int numBytesPerInt = 4;
+            final int numBytes = numElements * numBytesPerInt;
+            try (OffheapMemory hostMemory = new OffheapMemory(numBytes)) {
+                // create off-heap host memory of integers: [1, 2, 3, 4, ..., 1000]
+                LittleEndianNativeArrayView hostArray = hostMemory.getLittleEndianView();
+                for (int i = 0; i < numElements; ++i) {
+                    hostArray.setInt(i, i + 1);
+                }
+                try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                    // create DeviceArray and copy content from off-heap host memory into it
+                    Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                    Value deviceArray = createDeviceArray.execute("int", numElements);
+                    deviceArray.invokeMember("copyFrom", hostMemory.getPointer(), numElements);
+
+                    // Verify content of device array
+                    for (int i = 0; i < numElements; ++i) {
+                        assertEquals(i + 1, deviceArray.getArrayElement(i).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceArrayCopyToOffheapMemory() {
+            final int numElements = 1000;
+            final int numBytesPerInt = 4;
+            final int numBytes = numElements * numBytesPerInt;
+            try (OffheapMemory hostMemory = new OffheapMemory(numBytes)) {
+                // create off-heap host memory of integers and initialize all elements to zero.
+                LittleEndianNativeArrayView hostArray = hostMemory.getLittleEndianView();
+                for (int i = 0; i < numElements; ++i) {
+                    hostArray.setInt(i, i);
+                }
+                try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                    // create DeviceArray and set its content [1, 2, 3, 4, ..., 1000]
+                    Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                    Value deviceArray = createDeviceArray.execute("int", numElements);
+                    for (int i = 0; i < numElements; ++i) {
+                        deviceArray.setArrayElement(i, i + 1);
+                    }
+                    // copy content of device array to off-heap host memory
+                    deviceArray.invokeMember("copyTo", hostMemory.getPointer(), numElements);
+
+                    // Verify content of device array
+                    for (int i = 0; i < numElements; ++i) {
+                        assertEquals(i + 1, hostArray.getInt(i));
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceArrayCopyFromDeviceArray() {
+            final int numElements = 1000;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements);
+                for (int i = 0; i < numElements; ++i) {
+                    sourceDeviceArray.setArrayElement(i, i + 1);
+                }
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements);
+                for (int i = 0; i < numElements; ++i) {
+                    destinationDeviceArray.setArrayElement(i, 0);
+                }
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements);
+                // Verify content of device array
+                for (int i = 0; i < numElements; ++i) {
+                    assertEquals(i + 1, destinationDeviceArray.getArrayElement(i).asInt());
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceArrayCopyToDeviceArray() {
+            final int numElements = 1000;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements);
+                for (int i = 0; i < numElements; ++i) {
+                    sourceDeviceArray.setArrayElement(i, i + 1);
+                }
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements);
+                for (int i = 0; i < numElements; ++i) {
+                    destinationDeviceArray.setArrayElement(i, 0);
+                }
+                sourceDeviceArray.invokeMember("copyTo", destinationDeviceArray, numElements);
+                // Verify content of device array
+                for (int i = 0; i < numElements; ++i) {
+                    assertEquals(i + 1, destinationDeviceArray.getArrayElement(i).asInt());
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromDeviceArray() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                // Set each row to j;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, j);
+                    }
+                }
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals(j, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToDeviceArray() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                // Set each row to j;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, j);
+                    }
+                }
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                sourceDeviceArray.invokeMember("copyTo", destinationDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals(j, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToDeviceArrayRow() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                // Set each rows 3 to j;
+                for (int j = 0; j < numElements2; ++j) {
+                    sourceDeviceArray.getArrayElement(3).setArrayElement(j, j);
+                }
+
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements2);
+
+                sourceDeviceArray.getArrayElement(3).invokeMember("copyTo", destinationDeviceArray, sourceDeviceArray.getArrayElement(3).getArraySize());
+                // Verify content of device array
+                for (int j = 0; j < numElements2; ++j) {
+                    assertEquals(j, destinationDeviceArray.getArrayElement(j).asInt());
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromDeviceArrayRow() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+
+                // create destination device array.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements2);
+                // Set each value to j;
+                for (int j = 0; j < numElements2; ++j) {
+                    destinationDeviceArray.setArrayElement(j, 42 + j);
+                }
+
+                sourceDeviceArray.getArrayElement(3).invokeMember("copyFrom", destinationDeviceArray, sourceDeviceArray.getArrayElement(3).getArraySize());
+                // Verify content of device array
+                for (int j = 0; j < numElements2; ++j) {
+                    assertEquals(42 + j, sourceDeviceArray.getArrayElement(3).getArrayElement(j).asInt());
+                }
+                // Everything else is unmodified;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        if (i != 3) {
+                            assertEquals(i * numElements2 + j, sourceDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                        }
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromDeviceArrayF() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+
+                // Set each row to j;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals(i * numElements2 + j, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToDeviceArrayF() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+                // Set each row to j;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+
+                sourceDeviceArray.invokeMember("copyTo", destinationDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals(i * numElements2 + j, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromDeviceArrayC() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "C");
+
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "C");
+
+                // Set each row to j;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals(i * numElements2 + j, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToDeviceArrayC() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "C");
+                // Set each row to j;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, j);
+                    }
+                }
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "C");
+
+                sourceDeviceArray.invokeMember("copyTo", destinationDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals(j, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToDeviceArrayRowC() {
+            final int numElements1 = 5;
+            final int numElements2 = 7;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // Create device array initialize its elements;
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                // Initialize elements with unique values.
+                // Values are still written as (row, col), even if the storage is "C";
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                // Initialize destination array with unique values, to ensure that it's modified correctly;
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        destinationDeviceArray.getArrayElement(i).setArrayElement(j, -(i * numElements2 + j));
+                    }
+                }
+
+                // This copies the 4th row of the source array into the 4th row of the destination array;
+                sourceDeviceArray.getArrayElement(3).invokeMember("copyTo", destinationDeviceArray.getArrayElement(3), sourceDeviceArray.getArrayElement(3).getArraySize());
+
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i == 3 ? 1 : -1) * (i * numElements2 + j), destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromDeviceArrayRowC() {
+            final int numElements1 = 10;
+            final int numElements2 = 25;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // Create device array initialize its elements;
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                // Initialize elements with unique values.
+                // Values are still written as (row, col), even if the storage is "C";
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                // Initialize destination array with unique values, to ensure that it's modified correctly;
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        destinationDeviceArray.getArrayElement(i).setArrayElement(j, -(i * numElements2 + j));
+                    }
+                }
+
+                sourceDeviceArray.getArrayElement(3).invokeMember("copyFrom", destinationDeviceArray.getArrayElement(3), destinationDeviceArray.getArrayElement(3).getArraySize());
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i == 3 ? -1 : 1) * (i * numElements2 + j), sourceDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToDeviceArrayRowF() {
+            final int numElements1 = 5;
+            final int numElements2 = 7;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // Create device array initialize its elements;
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+                // Initialize elements with unique values.
+                // Values are still written as (row, col), even if the storage is "C";
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                // Initialize destination array with unique values, to ensure that it's modified correctly;
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        destinationDeviceArray.getArrayElement(i).setArrayElement(j, -(i * numElements2 + j));
+                    }
+                }
+
+                // This copies the 4th column of the source array into the 4th column of the destination array;
+                sourceDeviceArray.getArrayElement(3).invokeMember("copyTo", destinationDeviceArray.getArrayElement(3), sourceDeviceArray.getArrayElement(3).getArraySize());
+
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i == 3 ? 1 : -1) * (i * numElements2 + j), destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromDeviceArrayRowF() {
+            final int numElements1 = 5;
+            final int numElements2 = 7;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // Create device array initialize its elements;
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+                // Initialize elements with unique values.
+                // Values are still written as (row, col), even if the storage is "C";
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                // Initialize destination array with unique values, to ensure that it's modified correctly;
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2, "F");
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        destinationDeviceArray.getArrayElement(i).setArrayElement(j, -(i * numElements2 + j));
+                    }
+                }
+
+                // This copies the 4th column of the source array into the 4th column of the destination array;
+                sourceDeviceArray.getArrayElement(3).invokeMember("copyFrom", destinationDeviceArray.getArrayElement(3), sourceDeviceArray.getArrayElement(3).getArraySize());
+
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i == 3 ? -1 : 1) * (i * numElements2 + j), sourceDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceCopyExecTime() {
+            final int numElements = 1000000;
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                List<Integer> array = Arrays.asList(new Integer[numElements]);
+                long s1 = System.currentTimeMillis();
+                for (int i = 0; i < numElements; i++) {
+                    array.set(i, i);
+                }
+                long e1 = System.currentTimeMillis();
+    //            System.out.println("- init java array=" + (e1 - s1) + " ms");
+
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                // create device array initialize its elements.
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements);
+                long s2 = System.currentTimeMillis();
+                for (int i = 0; i < numElements; ++i) {
+                    sourceDeviceArray.setArrayElement(i, i);
+                }
+                long e2 = System.currentTimeMillis();
+    //            System.out.println("- init grcuda array=" + (e2 - s2) + " ms");
+
+                // create destination device array initialize its elements to zero.
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements);
+                for (int i = 0; i < numElements; ++i) {
+                    destinationDeviceArray.setArrayElement(i, 0);
+                }
+
+                long s3 = System.currentTimeMillis();
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements);
+                long e3 = System.currentTimeMillis();
+    //            System.out.println("- grcuda memcpy=" + (e3 - s3) + " ms");
+
+                long s4 = System.currentTimeMillis();
+                for (int i = 0; i < numElements; i++) {
+                    destinationDeviceArray.setArrayElement(i, array.get(i));
+                }
+                long e4 = System.currentTimeMillis();
+    //            System.out.println("- java memcpy=" + (e4 - s4) + " ms");
+
+                long s5 = System.currentTimeMillis();
+                destinationDeviceArray.invokeMember("copyFrom", array, numElements);
+                long e5 = System.currentTimeMillis();
+    //            System.out.println("- grcuda memcpy - slow path=" + (e5 - s5) + " ms");
+
+                // Verify content of device array
+                for (int i = 0; i < numElements; ++i) {
+                    assertEquals(i, destinationDeviceArray.getArrayElement(i).asInt());
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceArrayCopyFromArrayList() {
+            final int numElements = 1000000;
+            List<Integer> array = Arrays.asList(new Integer[numElements]);
+            for (int i = 0; i < numElements; ++i) {
+                array.set(i, i + 1);
+            }
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                // create DeviceArray and copy content from array list memory into it;
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                Value deviceArray = createDeviceArray.execute("int", numElements);
+
+                long s1 = System.currentTimeMillis();
+                deviceArray.invokeMember("copyFrom", array, numElements);
+                long e1 = System.currentTimeMillis();
+    //            System.out.println("- copy from java array=" + (e1 - s1) + " ms");
+
+                // Verify content of device array;
+                for (int i = 0; i < numElements; ++i) {
+                    assertEquals(i + 1, deviceArray.getArrayElement(i).asInt());
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceArrayCopyToArrayList() {
+            final int numElements = 1000000;
+            List<Integer> array = Arrays.asList(new Integer[numElements]);
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+                // create DeviceArray and copy content from array list memory into it;
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                Value deviceArray = createDeviceArray.execute("int", numElements);
+                long s1 = System.currentTimeMillis();
+                for (int i = 0; i < numElements; ++i) {
+                    deviceArray.setArrayElement(i, i + 1);
+                }
+                long e1 = System.currentTimeMillis();
+    //            System.out.println("- init device array=" + (e1 - s1) + " ms");
+
+                long s2 = System.currentTimeMillis();
+                deviceArray.invokeMember("copyTo", array, numElements);
+                long e2 = System.currentTimeMillis();
+    //            System.out.println("- copy to device array=" + (e2 - s2) + " ms");
+
+                // Verify content of java array;
+                for (int i = 0; i < numElements; ++i) {
+                    assertEquals(i + 1, array.get(i).intValue());
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyFromArrayList() {
+            final int numElements1 = 500;
+            final int numElements2 = 2000;
+            ArrayList<Integer> array = new ArrayList<>(numElements1 * numElements2);
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+
+                // create DeviceArray and copy content from array list memory into it;
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                Value deviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                // Set each value to its index + 1;
+                long s1 = System.currentTimeMillis();
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        deviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j + 1);
+                    }
+                }
+                long e1 = System.currentTimeMillis();
+    //            System.out.println("- init 2d device array=" + (e1 - s1) + " ms");
+
+                long s2 = System.currentTimeMillis();
+                deviceArray.invokeMember("copyTo", array, numElements1 * numElements2);
+                long e2 = System.currentTimeMillis();
+    //            System.out.println("- copy to 2d device array=" + (e2 - s2) + " ms");
+
+                // Verify content of java array;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        array.add(i * numElements2 + j + 1);
+                        assertEquals(i * numElements2 + j + 1, array.get(i * numElements2 + j).intValue());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testMultiDimDeviceArrayCopyToArrayList() {
+            final int numElements1 = 500;
+            final int numElements2 = 2000;
+            ArrayList<Integer> array = new ArrayList<>(numElements1 * numElements2);
+            try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
+
+                // Set each value to its index + 1;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        array.add(i * numElements2 + j + 1);
+                    }
+                }
+
+                // create DeviceArray and copy content from array list memory into it;
+                Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
+                Value deviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                long s1 = System.currentTimeMillis();
+                deviceArray.invokeMember("copyFrom", array, numElements1 * numElements2);
+                long e1 = System.currentTimeMillis();
+    //            System.out.println("- copy to device array=" + (e1 - s1) + " ms");
+
+                // Verify content of device array;
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        array.add(i * numElements2 + j + 1);
+                        assertEquals(i * numElements2 + j + 1, deviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+    }
+
+    @RunWith(Parameterized.class)
+    public static class DeviceArrayCopyFunctionParametrized {
+        @Parameterized.Parameters
+        public static Collection<Object[]> data() {
+            return GrCUDATestUtil.getAllOptionCombinationsSingleGPU();
+        }
+
+        private final GrCUDATestOptionsStruct options;
+
+        public DeviceArrayCopyFunctionParametrized(GrCUDATestOptionsStruct options) {
+            this.options = options;
+        }
+
+
+        private static final int NUM_THREADS_PER_BLOCK = 32;
+
+        private static final String ADD_ONE_KERNEL =
+                "extern \"C\" __global__ void add(int* x, int n) {\n" +
+                        "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                        "    if (idx < n) {\n" +
+                        "       x[idx] = x[idx] + 1;\n" +
+                        "    }" +
+                        "}\n";
+
+        @Test
+        public void testDeviceDeviceMemcpyDependency() {
+            final int numElements1 = 25;
+            final int numElements2 = 50;
+            final int numBlocks = (numElements1 * numElements2 + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+                Value createDeviceArray = context.eval("grcuda", "DeviceArray");
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                Value buildkernel = context.eval("grcuda", "buildkernel");
+                Value addKernel = buildkernel.execute(ADD_ONE_KERNEL, "add", "pointer, sint32");
+
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+
+                // Call kernel;
+                addKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK).execute(sourceDeviceArray, numElements1 * numElements2);
+                // Copy values from source to destination, using copyFrom;
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i * numElements2 + j) + 1, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceDeviceMemcpyDependencySingleRow() {
+            final int numElements1 = 25;
+            final int numElements2 = 50;
+            final int numBlocks = (numElements1 * numElements2 + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+                Value createDeviceArray = context.eval("grcuda", "DeviceArray");
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                Value buildkernel = context.eval("grcuda", "buildkernel");
+                Value addKernel = buildkernel.execute(ADD_ONE_KERNEL, "add", "pointer, sint32");
+
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+
+                // Call kernel;
+                addKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK).execute(sourceDeviceArray, numElements1 * numElements2);
+                // Copy values from source to destination, using copyFrom;
+                destinationDeviceArray.getArrayElement(3).invokeMember("copyFrom", sourceDeviceArray.getArrayElement(3), sourceDeviceArray.getArrayElement(3).getArraySize());
+                // Verify content of device array
+                for (int i = 0; i < destinationDeviceArray.getArrayElement(3).getArraySize(); ++i) {
+                    assertEquals(sourceDeviceArray.getArrayElement(3).getArrayElement(i).asInt(), destinationDeviceArray.getArrayElement(3).getArrayElement(i).asInt());
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceDeviceMemcpyDependencyTwoKernelsCopyFrom() {
+            final int numElements1 = 25;
+            final int numElements2 = 50;
+            final int numBlocks = (numElements1 * numElements2 + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+                Value createDeviceArray = context.eval("grcuda", "DeviceArray");
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                Value buildkernel = context.eval("grcuda", "buildkernel");
+                Value addKernel = buildkernel.execute(ADD_ONE_KERNEL, "add", "pointer, sint32");
+
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        destinationDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j + 10);
+                    }
+                }
+
+                // Call kernel;
+                addKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK).execute(sourceDeviceArray, numElements1 * numElements2);
+                addKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK).execute(destinationDeviceArray, numElements1 * numElements2);
+                // Copy values from source to destination, using copyFrom;
+                destinationDeviceArray.invokeMember("copyFrom", sourceDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i * numElements2 + j) + 1, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+
+        @Test
+        public void testDeviceDeviceMemcpyDependencyTwoKernelsCopyTo() {
+            final int numElements1 = 25;
+            final int numElements2 = 50;
+            final int numBlocks = (numElements1 * numElements2 + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+                Value createDeviceArray = context.eval("grcuda", "DeviceArray");
+                Value sourceDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+                Value destinationDeviceArray = createDeviceArray.execute("int", numElements1, numElements2);
+
+                Value buildkernel = context.eval("grcuda", "buildkernel");
+                Value addKernel = buildkernel.execute(ADD_ONE_KERNEL, "add", "pointer, sint32");
+
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        sourceDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j);
+                    }
+                }
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        destinationDeviceArray.getArrayElement(i).setArrayElement(j, i * numElements2 + j + 10);
+                    }
+                }
+
+                // Call kernel;
+                addKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK).execute(sourceDeviceArray, numElements1 * numElements2);
+                addKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK).execute(destinationDeviceArray, numElements1 * numElements2);
+                // Copy values from source to destination, using copyFrom;
+                sourceDeviceArray.invokeMember("copyTo", destinationDeviceArray, numElements1 * numElements2);
+                // Verify content of device array
+                for (int i = 0; i < numElements1; ++i) {
+                    for (int j = 0; j < numElements2; ++j) {
+                        assertEquals((i * numElements2 + j) + 1, destinationDeviceArray.getArrayElement(i).getArrayElement(j).asInt());
+                    }
+                }
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayFreeTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/DeviceArrayFreeTest.java
similarity index 82%
rename from projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayFreeTest.java
rename to projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/DeviceArrayFreeTest.java
index 4ae7216d..15ff44d7 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayFreeTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/DeviceArrayFreeTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,10 +32,11 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.test;
+package com.nvidia.grcuda.test.functions;
 
 import static org.junit.Assert.assertTrue;
 
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.PolyglotException;
 import org.graalvm.polyglot.Value;
@@ -42,7 +50,7 @@ public class DeviceArrayFreeTest {
 
     @Test
     public void testCanInvokeFreeDeviceArray() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             // create DeviceArray
             Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
             Value deviceArray = createDeviceArray.execute("int", 1000);
@@ -56,7 +64,7 @@ public void testCanInvokeFreeDeviceArray() {
 
     @Test(expected = PolyglotException.class)
     public void testDeviceArrayAccessAfterFreeThrows() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             // create DeviceArray
             Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
             Value deviceArray = createDeviceArray.execute("int", 1000);
@@ -67,7 +75,7 @@ public void testDeviceArrayAccessAfterFreeThrows() {
 
     @Test(expected = PolyglotException.class)
     public void testDeviceArrayDoubleFreeThrows() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             // create DeviceArray
             Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
             Value deviceArray = createDeviceArray.execute("int", 1000);
@@ -82,7 +90,7 @@ public void testDeviceArrayDoubleFreeThrows() {
 
     @Test
     public void testCanInvokeFreeMultiDimDeviceArray() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             // create DeviceArray
             Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
             Value deviceArray = createDeviceArray.execute("int", 100, 100);
@@ -96,7 +104,7 @@ public void testCanInvokeFreeMultiDimDeviceArray() {
 
     @Test(expected = PolyglotException.class)
     public void testMultiDimDeviceArrayAccessAfterFreeThrows() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             // create DeviceArray
             Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
             Value deviceArray = createDeviceArray.execute("int", 100, 100);
@@ -107,7 +115,7 @@ public void testMultiDimDeviceArrayAccessAfterFreeThrows() {
 
     @Test(expected = PolyglotException.class)
     public void testMultiDimDeviceArrayDoubleFreeThrows() {
-        try (Context ctx = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context ctx = GrCUDATestUtil.buildTestContext().build()) {
             // create DeviceArray
             Value createDeviceArray = ctx.eval("grcuda", "DeviceArray");
             Value deviceArray = createDeviceArray.execute("int", 100, 100);
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/map/MapFunctionTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/map/MapFunctionTest.java
index 5154e48e..b76f6a51 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/map/MapFunctionTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/functions/map/MapFunctionTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -9,7 +10,10 @@
  *  * Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
  *
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ComputationArgumentTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ComputationArgumentTest.java
new file mode 100644
index 00000000..1c0f2fca
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ComputationArgumentTest.java
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.TypeException;
+import org.junit.Test;
+
+import java.util.ArrayList;
+
+import static org.junit.Assert.assertEquals;
+
+public class ComputationArgumentTest {
+
+    @Test
+    public void testSignatureParsingOld() throws TypeException {
+        String signature = "pointer, const pointer, double, sint32";
+        Boolean[] isArray = {true, true, false, false};
+        Type[] types = {Type.NFI_POINTER, Type.NFI_POINTER, Type.DOUBLE, Type.SINT32};
+        ArrayList<ComputationArgument> params = ComputationArgument.parseParameterSignature(signature);
+        for (int i = 0; i < params.size(); i++) {
+            assertEquals(i, params.get(i).getPosition());
+            assertEquals(isArray[i], params.get(i).isArray());
+            assertEquals(types[i], params.get(i).getType());
+        }
+    }
+
+    @Test
+    public void testSignatureParsingOldWithConst() throws TypeException {
+        String signature = "pointer, const pointer, double, sint32";
+        Boolean[] isArray = {true, true, false, false};
+        Boolean[] isConst = {false, true, true, true};
+        Type[] types = {Type.NFI_POINTER, Type.NFI_POINTER, Type.DOUBLE, Type.SINT32};
+        ArrayList<ComputationArgument> params = ComputationArgument.parseParameterSignature(signature);
+        for (int i = 0; i < params.size(); i++) {
+            assertEquals(i, params.get(i).getPosition());
+            assertEquals(isArray[i], params.get(i).isArray());
+            assertEquals(types[i], params.get(i).getType());
+            assertEquals(isConst[i], params.get(i).isConst());
+        }
+    }
+
+    @Test
+    public void testSignatureParsingNIDL() throws TypeException {
+        String signature = "x: in pointer sint32, y: inout pointer float, z: out pointer float, n: sint32, n_blocks: sint64, block_size: char";
+        String[] names = {"x", "y", "z", "n", "n_blocks", "block_size"};
+        Boolean[] isArray = {true, true, true, false, false, false};
+        Boolean[] isConst = {true, false, false, true, true, true};
+        Type[] types = {Type.SINT32, Type.FLOAT, Type.FLOAT, Type.SINT32, Type.SINT64, Type.CHAR};
+        ArrayList<ComputationArgument> params = ComputationArgument.parseParameterSignature(signature);
+        for (int i = 0; i < params.size(); i++) {
+            assertEquals(names[i], params.get(i).getName());
+            assertEquals(i, params.get(i).getPosition());
+            assertEquals(isArray[i], params.get(i).isArray());
+            assertEquals(types[i], params.get(i).getType());
+            assertEquals(isConst[i], params.get(i).isConst());
+        }
+    }
+
+    @Test
+    public void testSignatureParsingWithParentheses() throws TypeException {
+        String signature = "(sint64, sint32, pointer const, const pointer, sint32, pointer, sint32): sint32\"";
+        Boolean[] isArray = {false, false, true, true, false, true, false};
+        Boolean[] isConst = {true, true, true, true, true, false, true};
+        Type[] types = {Type.SINT64, Type.SINT32, Type.NFI_POINTER, Type.NFI_POINTER, Type.SINT32, Type.NFI_POINTER, Type.SINT32};
+        ArrayList<ComputationArgument> params = ComputationArgument.parseParameterSignature(signature);
+        for (int i = 0; i < params.size(); i++) {
+            assertEquals(i, params.get(i).getPosition());
+            assertEquals(isArray[i], params.get(i).isArray());
+            assertEquals(types[i], params.get(i).getType());
+            assertEquals(isConst[i], params.get(i).isConst());
+        }
+    }
+
+    @Test(expected = TypeException.class)
+    public void testSignatureParsingWithWrongParentheses() throws TypeException {
+        String signature = "(sint64, sint32, pointer const), const pointer, sint32, pointer, sint32): sint32\"";
+        ComputationArgument.parseParameterSignature(signature);
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/CreateStreamTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/CreateStreamTest.java
new file mode 100644
index 00000000..38a9fef7
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/CreateStreamTest.java
@@ -0,0 +1,218 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.IntStream;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+public class CreateStreamTest {
+
+    /**
+     * Simply check if we can create a CUDA stream without blowing things up!
+     */
+    @Test
+    public void createStreamSimpleTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream = createStream.execute();
+            assertNotNull(stream);
+            assertTrue(stream.isNativePointer());
+        }
+    }
+
+    /**
+     * Check that we can create many different streams;
+     */
+    @Test
+    public void createManyStreamsTest() {
+        int numStreams = 8;
+        Set<Long> streamSet = new HashSet<>();
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            IntStream.range(0, numStreams).forEach(i -> {
+                Value createStream = context.eval("grcuda", "cudaStreamCreate");
+                Value stream = createStream.execute();
+                streamSet.add(stream.asNativePointer());
+                assertNotNull(stream);
+                assertTrue(stream.isNativePointer());
+            });
+        }
+        assertEquals(numStreams, streamSet.size());
+    }
+
+    private static final int NUM_THREADS_PER_BLOCK = 32;
+
+    private static final String SQUARE_KERNEL =
+            "extern \"C\" __global__ void square(float* x, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       x[idx] = x[idx] * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    /**
+     * Execute a simple kernel on a non-default stream;
+     */
+    @Test
+    public void useStreamTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream = createStream.execute();
+            assertNotNull(stream);
+            assertTrue(stream.isNativePointer());
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+            // Set the custom stream;
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream);
+            configuredSquareKernel.execute(x, numElements);
+
+            // Wait for the computations to end;
+            Value syncStream = context.eval("grcuda", "cudaDeviceSynchronize");
+            syncStream.execute();
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+            }
+        }
+    }
+
+    /**
+     * Execute two simple kernel on non-default streams;
+     */
+    @Test
+    public void useTwoStreamsTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream1 = createStream.execute();
+            Value stream2 = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+            // Set the custom streams;
+            Value configuredSquareKernel1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream2);
+
+            configuredSquareKernel1.execute(x, numElements);
+            configuredSquareKernel2.execute(y, numElements);
+
+            // Wait for the computations to end;
+            Value syncStream = context.eval("grcuda", "cudaDeviceSynchronize");
+            syncStream.execute();
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+                assertEquals(16.0, y.getArrayElement(i).asFloat(), 0.01);
+            }
+        }
+    }
+
+    /**
+     * Execute two simple kernel on non-default streams, and synchronize each stream independently;
+     */
+    @Test
+    public void syncStreamsTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream1 = createStream.execute();
+            Value stream2 = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+            // Set the custom streams;
+            Value configuredSquareKernel1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream2);
+
+            configuredSquareKernel1.execute(x, numElements);
+            configuredSquareKernel2.execute(y, numElements);
+
+            Value syncStream = context.eval("grcuda", "cudaStreamSynchronize");
+            syncStream.execute(stream1);
+            syncStream.execute(stream2);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+                assertEquals(16.0, y.getArrayElement(i).asFloat(), 0.01);
+            }
+        }
+    }
+
+
+    @Test
+    public void streamDestroyTest() {
+        int numStreams = 8;
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Set<Value> streamSet = new HashSet<>();
+            IntStream.range(0, numStreams).forEach(i -> {
+                Value createStream = context.eval("grcuda", "cudaStreamCreate");
+                Value stream = createStream.execute();
+                streamSet.add(stream);
+                assertNotNull(stream);
+            });
+            Value destroyStream = context.eval("grcuda", "cudaStreamDestroy");
+            streamSet.forEach(destroyStream::execute);
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ExecutionDAGExportTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ExecutionDAGExportTest.java
new file mode 100644
index 00000000..4b36c2e7
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ExecutionDAGExportTest.java
@@ -0,0 +1,204 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.GraphExport;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.mock.*;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Collections;
+
+import com.nvidia.grcuda.test.util.mock.ArgumentMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.nvidia.grcuda.test.util.mock.KernelExecutionMock;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+
+@RunWith(Parameterized.class)
+public class ExecutionDAGExportTest {
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return Arrays.asList(new Object[][]{
+                {RetrieveNewStreamPolicyEnum.ALWAYS_NEW},
+                {RetrieveNewStreamPolicyEnum.REUSE},
+        });
+    }
+
+    private final RetrieveNewStreamPolicyEnum policy;
+
+    public ExecutionDAGExportTest(RetrieveNewStreamPolicyEnum policy) {
+        this.policy = policy;
+    }
+
+    @Test
+    public void complexFrontierExportTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock();
+
+        // A(1,2) -> B(1) -> D(1,3) -> E(1,4) -> F(4)
+        //    \----> C(2)
+        // The final frontier is composed by C(2), D(3), E(1), F(4);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        GraphExport graphExport = new GraphExport(dag);
+        //graphExport.graphGenerator("../graphComplexFrontierExportTest");
+    }
+
+    @Test
+    public void streamSelection2ExportTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+
+        // A(1,2) -> B(1) -> D(1,3)
+        //    \----> C(2)
+        // E(4) -> F(4, 5)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(4), new ArgumentMock(5))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+        GraphExport graphExport = new GraphExport(dag);
+
+//        if (policy==RetrieveNewStreamPolicyEnum.ALWAYS_NEW){
+//            graphExport.graphGenerator("../streamSelection2ExportTestAlwaysNew");
+//        } else {
+//            graphExport.graphGenerator("../streamSelection2ExportTestReuse");
+//        }
+
+    }
+
+    @Test
+    public void streamSelectionSimpleWithSyncExportTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+        // Create 4 mock kernel executions. In this case, kernel 3 requires 1 and 2 to finish,
+        //   and kernel 4 requires kernel 3 to finish. The final frontier is composed of kernel 3 (arguments "1" and "2" are active),
+        //   and kernel 4 (argument "3" is active);
+        // A(1) -> C(1, 2, 3) -> D(3)
+        // B(2) /
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1),
+                        new ArgumentMock(2),
+                        new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+        GraphExport graphExport = new GraphExport(dag);
+
+
+//        if (policy==RetrieveNewStreamPolicyEnum.ALWAYS_NEW){
+//            graphExport.graphGenerator("../streamSelectionSimpleWithSyncExportTestAlwaysNew");
+//        } else {
+//            graphExport.graphGenerator("../streamSelectionSimpleWithSyncExportTestReuse");
+//        }
+
+    }
+
+    @Test
+    public void disjointArgumentStreamCross2Test() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT).build();
+
+        // A(1,2,7) -> D(1,3,5)
+        //          X
+        // B(3,4,8) -> E(2,4,6)
+        //          X
+        // C(5,6,9) -> F(7,8,9)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2), new ArgumentMock(7))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(3), new ArgumentMock(4), new ArgumentMock(8))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(5), new ArgumentMock(6), new ArgumentMock(9))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3), new ArgumentMock(5))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(2), new ArgumentMock(4), new ArgumentMock(6))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(7), new ArgumentMock(8), new ArgumentMock(9))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+        GraphExport graphExport = new GraphExport(dag);
+
+//        if (policy==RetrieveNewStreamPolicyEnum.ALWAYS_NEW){
+//            graphExport.graphGenerator("../disjointArgumentStreamCross2TestAlwaysNew");
+//        } else {
+//            graphExport.graphGenerator("../disjointArgumentStreamCross2TestReuse");
+//        }
+    }
+
+    @Test
+    public void syncParentsOfParentsTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT).build();
+
+        // A(1,2) -> B(1)
+        //       \-> C(2,3) -> D(2)
+        //                 \-> E(3)
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(2), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+        GraphExport graphExport = new GraphExport(dag);
+
+//        if (policy==RetrieveNewStreamPolicyEnum.ALWAYS_NEW){
+//            graphExport.graphGenerator("../syncParentsOfParentsTestAlwaysNew");
+//        } else {
+//            graphExport.graphGenerator("../syncParentsOfParentsTestReuse");
+//        }
+    }
+
+
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ExecutionDAGMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ExecutionDAGMockTest.java
new file mode 100644
index 00000000..4bd6ee21
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/ExecutionDAGMockTest.java
@@ -0,0 +1,304 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime;
+
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.mock.ArgumentMock;
+import com.nvidia.grcuda.test.util.mock.AsyncGrCUDAExecutionContextMock;
+import com.nvidia.grcuda.test.util.mock.DeviceArrayMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.nvidia.grcuda.test.util.mock.KernelExecutionMock;
+import com.nvidia.grcuda.test.util.mock.SyncExecutionMock;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+public class ExecutionDAGMockTest {
+
+    @Test
+    public void executionDAGConstructorTest() {
+        ExecutionDAG dag = new ExecutionDAG(DependencyPolicyEnum.NO_CONST);
+        assertTrue(dag.getVertices().isEmpty());
+        assertTrue(dag.getEdges().isEmpty());
+        assertTrue(dag.getFrontier().isEmpty());
+        assertEquals(0, dag.getNumVertices());
+        assertEquals(0, dag.getNumEdges());
+    }
+
+    @Test
+    public void addVertexToDAGTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock();
+        // Create two mock kernel executions;
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2), new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        assertEquals(1, dag.getNumVertices());
+        assertEquals(0, dag.getNumEdges());
+        assertEquals(1, dag.getFrontier().size());
+        assertTrue(dag.getFrontier().get(0).isFrontier());
+        assertTrue(dag.getFrontier().get(0).isStart());
+
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2), new ArgumentMock(3))).schedule();
+
+        assertEquals(2, dag.getNumVertices());
+        assertEquals(1, dag.getNumEdges());
+        assertEquals(1, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(dag.getVertices().get(1), dag.getFrontier().get(0));
+        assertFalse(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(1).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isStart());
+        // Check if the first vertex is a parent of the second;
+        assertEquals(dag.getVertices().get(0), dag.getVertices().get(1).getParentVertices().get(0));
+        // Check if the second vertex is a child of the first;
+        assertEquals(dag.getVertices().get(1), dag.getVertices().get(0).getChildVertices().get(0));
+    }
+
+    @Test
+    public void dependencyPipelineSimpleMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock();
+        // Create 4 mock kernel executions. In this case, kernel 3 requires 1 and 2 to finish,
+        //   and kernel 4 requires kernel 3 to finish. The final frontier is composed of kernel 3 (arguments "1" and "2" are active),
+        //   and kernel 4 (argument "3" is active);
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(4, dag.getNumVertices());
+        assertEquals(3, dag.getNumEdges());
+        assertEquals(2, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(2), dag.getVertices().get(3))),
+                new HashSet<>(dag.getFrontier()));
+        assertFalse(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertTrue(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        // Check if the third vertex is a child of first and second;
+        assertEquals(2, dag.getVertices().get(2).getParents().size());
+        assertEquals(new HashSet<>(dag.getVertices().get(2).getParentVertices()),
+                new HashSet<>(Arrays.asList(dag.getVertices().get(0), dag.getVertices().get(1))));
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(0).getChildVertices().get(0));
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(1).getChildVertices().get(0));
+        // Check if the fourth vertex is a child of the third;
+        assertEquals(1, dag.getVertices().get(3).getParents().size());
+        assertEquals(1, dag.getVertices().get(2).getChildren().size());
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(3).getParentVertices().get(0));
+        assertEquals(dag.getVertices().get(3), dag.getVertices().get(2).getChildVertices().get(0));
+    }
+
+    @Test
+    public void complexFrontierMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock();
+
+        // A(1,2) -> B(1) -> D(1,3) -> E(1,4) -> F(4)
+        //    \----> C(2)
+        // The final frontier is composed by C(2), D(3), E(1), F(4);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(6, dag.getNumVertices());
+        assertEquals(5, dag.getNumEdges());
+        assertEquals(4, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(2), dag.getVertices().get(3), dag.getVertices().get(4), dag.getVertices().get(5))),
+                new HashSet<>(dag.getFrontier()));
+
+        assertFalse(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        assertTrue(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(4).isStart());
+        assertTrue(dag.getVertices().get(5).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+    }
+
+    @Test
+    public void complexFrontierWithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.NO_CONST,
+                RetrieveNewStreamPolicyEnum.REUSE, RetrieveParentStreamPolicyEnum.DISJOINT);
+
+        // This time, simulate the synchronization process between kernels;
+        // A(1,2) -> B(1) -> D(1,3) -> E(1,4) -> F(4)
+        //   \-> C(2)
+        // Synchronize C, then F
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(6, dag.getNumVertices());
+        assertEquals(5, dag.getNumEdges());
+        assertEquals(4, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(2), dag.getVertices().get(3),
+                dag.getVertices().get(4), dag.getVertices().get(5))),
+                new HashSet<>(dag.getFrontier()));
+
+        assertFalse(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        assertTrue(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(4).isStart());
+        assertTrue(dag.getVertices().get(5).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertEquals(3, dag.getFrontier().size());
+        assertFalse(dag.getVertices().get(2).isFrontier());
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        assertEquals(0, dag.getFrontier().size());
+        assertFalse(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(5).isFrontier());
+    }
+
+    @Test
+    public void concurrentReadMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setArchitecturePascalOrNewer(true).build();
+
+        // This time, simulate a computation on the GPU, and a concurrent CPU read.
+        // As the array is not modified, there should be no dependency between them.
+        // However, we have to schedule the write to ensure that the GPU computation has finished before we update data;
+        DeviceArrayMock x = new DeviceArrayMock(context);
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(x, true))).schedule();
+        assertTrue(x.canSkipSchedulingRead());
+        assertFalse(x.canSkipSchedulingWrite());
+    }
+
+    @Test
+    public void concurrentReadMockTest2() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setArchitecturePascalOrNewer(false).build();
+        // This time, simulate a computation on the GPU, and a concurrent CPU read & write.
+        // As the GPU is pre-pascal, and we are running the kernel on the default stream, we must have a sync;
+        DeviceArrayMock x = new DeviceArrayMock(context);
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(x, true))).schedule();
+        assertFalse(x.canSkipSchedulingRead());
+        assertFalse(x.canSkipSchedulingWrite());
+    }
+
+    @Test
+    public void concurrentReadNoConstMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.NO_CONST)
+                .setArchitecturePascalOrNewer(true).build();
+        // This time, simulate a computation on the GPU, and a concurrent CPU read & write.
+        // As we are not considering "const", there should be a dependency;
+        DeviceArrayMock x = new DeviceArrayMock(context);
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(x, true))).schedule();
+        assertFalse(x.canSkipSchedulingRead());
+        assertFalse(x.canSkipSchedulingWrite());
+    }
+
+    // Test that if we have a kernel that uses an array read-only, and we schedule a write on CPU,
+    // the scheduling of the write is not skipped and we have a dependency between the kernel and the write;
+    @Test
+    public void writeIsNotSkippedMockTest() throws UnsupportedTypeException, InvalidArrayIndexException {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setRetrieveNewStreamPolicy(RetrieveNewStreamPolicyEnum.REUSE)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setArchitecturePascalOrNewer(true).build();
+
+        DeviceArray array1 = new DeviceArrayMock(context);
+        ExecutionDAG dag = context.getDag();
+        // K1(const A1, A2);
+        KernelExecutionMock k = new KernelExecutionMock(context, Collections.singletonList(new ComputationArgumentWithValue("array1", Type.NFI_POINTER, ComputationArgument.Kind.POINTER_IN, array1)));
+        k.schedule();
+        assertEquals(2, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedInLocation(0));
+        assertTrue(array1.isArrayUpdatedOnCPU());
+        // Write on the array;
+        array1.writeArrayElement(0, 0, null, null);
+        // Check that the array update status is tracked correctly;
+        assertEquals(1, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedOnCPU());
+        // Check the existence of a dependency;
+        assertEquals(2, dag.getNumVertices());
+        assertEquals(1, dag.getNumEdges());
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/StreamAttachTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/StreamAttachTest.java
new file mode 100644
index 00000000..77c6cc09
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/StreamAttachTest.java
@@ -0,0 +1,374 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.IntStream;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+public class StreamAttachTest {
+
+    /**
+     * Simply check if we can attach an array to a CUDA stream without blowing things up!
+     */
+    @Test
+    public void attachStreamSimpleTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream = createStream.execute();
+
+            final int numElements = 100;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            assertNotNull(streamAttach);
+            assertTrue(streamAttach.canExecute());
+            streamAttach.execute(stream, x);
+
+            // Synchronize and destroy the stream;
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+            streamSync.execute(stream);
+            streamDestroy.execute(stream);
+        }
+    }
+
+    /**
+     * Check that we can attach many different streams on different arrays;
+     */
+    @Test
+    public void attachManyStreamsTest() {
+        int numStreams = 8;
+        Set<Value> streamSet = new HashSet<>();
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+            final int numElements = 100;
+
+            IntStream.range(0, numStreams).forEach(i -> {
+                Value x = deviceArrayConstructor.execute("float", numElements);
+                Value stream = createStream.execute();
+                streamAttach.execute(stream, x);
+                streamSet.add(stream);
+            });
+            // Sync and destroy;
+            streamSet.forEach(s -> {
+                streamSync.execute(s);
+                streamDestroy.execute(s);
+            });
+        }
+    }
+
+    /**
+     * Check that we can attach multiple arrays to the same stream;
+     */
+    @Test
+    public void attachManyArraysToStreamTest() {
+        int numArrays = 4;
+
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+            final int numElements = 100;
+
+            Value stream = createStream.execute();
+
+            IntStream.range(0, numArrays).forEach(i -> {
+                Value x = deviceArrayConstructor.execute("float", numElements);
+                streamAttach.execute(stream, x);
+            });
+            // Sync and destroy;
+            streamSync.execute(stream);
+            streamDestroy.execute(stream);
+        }
+    }
+
+    /**
+     * Check that we can attach the same array to multiple streams, in sequence;
+     */
+    @Test
+    public void attachManyStreamsToArrayTest() {
+        int numStreams = 4;
+        Set<Value> streamSet = new HashSet<>();
+
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+            final int numElements = 100;
+            Value x = deviceArrayConstructor.execute("float", numElements);
+
+
+            IntStream.range(0, numStreams).forEach(i -> {
+                Value stream = createStream.execute();
+                streamAttach.execute(stream, x);
+                streamSync.execute(stream);
+                streamSet.add(stream);
+            });
+            // Sync and destroy;
+            streamSet.forEach(streamDestroy::execute);
+        }
+    }
+
+    private static final int NUM_THREADS_PER_BLOCK = 32;
+
+    private static final String SQUARE_KERNEL =
+            "extern \"C\" __global__ void square(float* x, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       x[idx] = x[idx] * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    /**
+     * Execute a simple kernel on a non-default stream with attached memory;
+     */
+    @Test
+    public void useAttachedStreamTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+
+            Value stream = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value x = deviceArrayConstructor.execute("float", numElements);
+
+            // Attach the array to the stream;
+            streamAttach.execute(stream, x);
+            streamSync.execute(stream);
+
+            // Build the kernel;
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+            // Execute the kernel on the custom stream;
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream);
+            configuredSquareKernel.execute(x, numElements);
+            streamSync.execute(stream);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+            }
+
+            streamDestroy.execute(stream);
+        }
+    }
+
+    /**
+     * Execute two simple kernel on non-default streams with attached memory. Array reads synchronize only a single stream,
+     * which would cause errors if executed on non-attached memory in pre-Pascal GPUs;
+     */
+    @Test
+    public void useTwoStreamsTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+
+            Value stream1 = createStream.execute();
+            Value stream2 = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+
+            // Attach the array to the stream;
+            streamAttach.execute(stream1, x);
+            streamSync.execute(stream1);
+            streamAttach.execute(stream2, y);
+            streamSync.execute(stream2);
+
+            // Build the kernel;
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+            // Execute the kernel on the custom stream;
+            Value configuredSquareKernel1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+            configuredSquareKernel1.execute(x, numElements);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream2);
+            configuredSquareKernel2.execute(y, numElements);
+
+            // Sync just one stream before accessing the dataM
+            Value syncStream = context.eval("grcuda", "cudaStreamSynchronize");
+            syncStream.execute(stream1);
+            assertEquals(4.0, x.getArrayElement(0).asFloat(), 0.01);
+            syncStream.execute(stream2);
+            assertEquals(16.0, y.getArrayElement(0).asFloat(), 0.01);
+
+            // Check the other values;
+            for (int i = 1; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+                assertEquals(16.0, y.getArrayElement(i).asFloat(), 0.01);
+            }
+
+            streamDestroy.execute(stream1);
+            streamDestroy.execute(stream2);
+        }
+    }
+
+    /**
+     * Execute two simple kernel on non-default streams, and synchronize each stream independently;
+     */
+    @Test
+    public void syncStreamsTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value stream1 = createStream.execute();
+            Value stream2 = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+            // Set the custom streams;
+            Value configuredSquareKernel1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream1);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream2);
+
+            configuredSquareKernel1.execute(x, numElements);
+            configuredSquareKernel2.execute(y, numElements);
+
+            Value syncStream = context.eval("grcuda", "cudaStreamSynchronize");
+            syncStream.execute(stream1);
+            syncStream.execute(stream2);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+                assertEquals(16.0, y.getArrayElement(i).asFloat(), 0.01);
+            }
+        }
+    }
+
+    /**
+     * Execute a simple kernel on a non-default stream with attached memory,
+     * then move back the memory to the global stream and execute another kernel;
+     */
+    @Test
+    public void useDefaultAttachedStreamTest() {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
+            Value createStream = context.eval("grcuda", "cudaStreamCreate");
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value streamAttach = context.eval("grcuda", "cudaStreamAttachMemAsync");
+            Value streamSync = context.eval("grcuda", "cudaStreamSynchronize");
+            Value streamDestroy = context.eval("grcuda", "cudaStreamDestroy");
+
+            Value stream = createStream.execute();
+
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value x = deviceArrayConstructor.execute("float", numElements);
+
+            // Attach the array to the stream;
+            streamAttach.execute(stream, x);
+            streamSync.execute(stream);
+
+            // Build the kernel;
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+            // Execute the kernel on the custom stream;
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK, stream);
+            configuredSquareKernel.execute(x, numElements);
+            streamSync.execute(stream);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(4.0, x.getArrayElement(i).asFloat(), 0.01);
+            }
+
+            // Reset the visibility of the array;
+            streamAttach.execute(stream, x, 0x01);
+            configuredSquareKernel.execute(x, numElements);
+
+            streamSync.execute(stream);
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(16.0, x.getArrayElement(i).asFloat(), 0.01);
+            }
+
+            // Reset the array to use the default stream, by providing just the array;
+            streamAttach.execute(x);
+            Value configuredSquareKernel2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+            configuredSquareKernel2.execute(x, numElements);
+
+            Value deviceSync = context.eval("grcuda", "cudaDeviceSynchronize");
+            deviceSync.execute();
+
+            for (int i = 0; i < numElements; i++) {
+                assertEquals(256, x.getArrayElement(i).asFloat(), 0.01);
+            }
+
+            streamDestroy.execute(stream);
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/DeviceArrayLocationMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/DeviceArrayLocationMockTest.java
new file mode 100644
index 00000000..d9b7a95b
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/DeviceArrayLocationMockTest.java
@@ -0,0 +1,157 @@
+package com.nvidia.grcuda.test.runtime.array;
+
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArrayView;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.test.util.mock.ArgumentMock;
+import com.nvidia.grcuda.test.util.mock.AsyncGrCUDAExecutionContextMock;
+import com.nvidia.grcuda.test.util.mock.DeviceArrayMock;
+import com.nvidia.grcuda.test.util.mock.DeviceArrayReadExecutionMock;
+import com.nvidia.grcuda.test.util.mock.DeviceArrayWriteExecutionMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.nvidia.grcuda.test.util.mock.KernelExecutionMock;
+import com.nvidia.grcuda.test.util.mock.MultiDimDeviceArrayMock;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+
+import java.util.Arrays;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class DeviceArrayLocationMockTest {
+
+    @Test
+    public void testIfInitializedCorrectlyPrePascal() {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder().setArchitecturePascalOrNewer(false).build();
+        DeviceArray array1 = new DeviceArrayMock(context);
+        DeviceArray array2 = new DeviceArrayMock(context);
+        assertEquals(1, array1.getArrayUpToDateLocations().size());
+        assertEquals(1, array2.getArrayUpToDateLocations().size());
+        assertEquals(array1.getArrayUpToDateLocations(), array2.getArrayUpToDateLocations());
+        assertTrue(array1.getArrayUpToDateLocations().contains(context.getCurrentGPU()));
+    }
+
+    @Test
+    public void testIfInitializedCorrectlyPostPascal() {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder().setArchitecturePascalOrNewer(true).build();
+        DeviceArray array1 = new DeviceArrayMock(context);
+        DeviceArray array2 = new DeviceArrayMock(context);
+        assertEquals(1, array1.getArrayUpToDateLocations().size());
+        assertEquals(1, array2.getArrayUpToDateLocations().size());
+        assertEquals(array1.getArrayUpToDateLocations(), array2.getArrayUpToDateLocations());
+        assertTrue(array1.isArrayUpdatedInLocation(CPUDevice.CPU_DEVICE_ID));
+    }
+
+    @Test
+    public void testIfLocationAdded() {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder()
+                .setArchitecturePascalOrNewer(true)
+                .setNumberOfAvailableGPUs(1)
+                .setNumberOfGPUsToUse(1).build();
+        DeviceArray array1 = new DeviceArrayMock(context);
+        array1.addArrayUpToDateLocations(2);
+        assertEquals(2, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedInLocation(2));
+    }
+
+    @Test
+    public void testIfLocationReset() {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder().setArchitecturePascalOrNewer(true)
+                .setNumberOfAvailableGPUs(1)
+                .setNumberOfGPUsToUse(1).build();
+        DeviceArray array1 = new DeviceArrayMock(context);
+        array1.resetArrayUpToDateLocations(2);
+        assertEquals(1, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedInLocation(2));
+    }
+
+    /**
+     * Test that, when using multi-dimensional arrays, the array views' locations are propagated correctly;
+     */
+    @Test
+    public void testMultiDimLocation() {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder().setArchitecturePascalOrNewer(true)
+                .setNumberOfAvailableGPUs(2)
+                .setNumberOfGPUsToUse(2).build();
+
+        long[] dimensions = {2, 2};
+        MultiDimDeviceArray array1 = new MultiDimDeviceArrayMock(context, dimensions, false);
+        assertTrue(array1.isArrayUpdatedOnCPU());
+        array1.addArrayUpToDateLocations(2);
+        array1.addArrayUpToDateLocations(3);
+        assertEquals(3, array1.getArrayUpToDateLocations().size());
+        // Create a view, parameters don't matter;
+        MultiDimDeviceArrayView view = new MultiDimDeviceArrayView(array1, 1, 0, 0);
+        assertEquals(array1.getArrayUpToDateLocations(), view.getArrayUpToDateLocations());
+        // Add a location to the view, check that it is propagated;
+        view.addArrayUpToDateLocations(4);
+        assertEquals(4, view.getArrayUpToDateLocations().size());
+        assertEquals(array1.getArrayUpToDateLocations(), view.getArrayUpToDateLocations());
+        // Reset locations on the view, check that the parent is updated;
+        view.resetArrayUpToDateLocations(10);
+        assertEquals(1, view.getArrayUpToDateLocations().size());
+        assertEquals(array1.getArrayUpToDateLocations(), view.getArrayUpToDateLocations());
+        // Reset locations on the parent, check that the view is updated;
+        array1.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        assertEquals(1, view.getArrayUpToDateLocations().size());
+        assertTrue(view.isArrayUpdatedOnCPU());
+        assertTrue(view.isArrayUpdatedInLocation(CPUDevice.CPU_DEVICE_ID));
+        assertEquals(array1.getArrayUpToDateLocations(), view.getArrayUpToDateLocations());
+        // Reset locations on the view (again), but also on the parent, and check consistency;
+        view.resetArrayUpToDateLocations(10);
+        array1.addArrayUpToDateLocations(2);
+        assertEquals(2, view.getArrayUpToDateLocations().size());
+        assertEquals(array1.getArrayUpToDateLocations(), view.getArrayUpToDateLocations());
+    }
+
+    /**
+     * Test that the location of arrays in a complex DAG is propagated correctly, also when using 2 GPUs;
+     */
+    @Test
+    public void complexFrontierWithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContextMock context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setArchitecturePascalOrNewer(true)
+                .setNumberOfAvailableGPUs(2)
+                .setNumberOfGPUsToUse(2).build();
+
+        DeviceArray array1 = new DeviceArrayMock(context);
+        DeviceArray array2 = new DeviceArrayMock(context);
+        DeviceArray array3 = new DeviceArrayMock(context);
+        // K1(const A1, A2);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(array1, true), new ArgumentMock(array2, false))).schedule();
+        assertEquals(2, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedInLocation(0));
+        assertTrue(array1.isArrayUpdatedOnCPU());
+        assertEquals(1, array2.getArrayUpToDateLocations().size());
+        assertTrue(array2.isArrayUpdatedInLocation(0));
+        // Set another GPU;
+        // K2(const A1, A3);
+        context.setCurrentGPU(1);
+        KernelExecutionMock k2 = new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(array1, true), new ArgumentMock(array3, false)));
+        k2.setStream(new CUDAStream(0, 0, context.getCurrentGPU()));
+        k2.schedule();
+        assertEquals(3, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedInLocation(1));
+        assertEquals(1, array3.getArrayUpToDateLocations().size());
+        assertTrue(array3.isArrayUpdatedInLocation(1));
+        assertEquals(1, array2.getArrayUpToDateLocations().size());  // A2 is unmodified;
+        assertTrue(array2.isArrayUpdatedInLocation(0));
+        // Write on 2 arrays, read on another array. The CPU will be the exclusive owner of the first 2 arrays,
+        // and share with GPU 0 the other array;
+        new DeviceArrayWriteExecutionMock(array1, 0, 0).schedule();
+        new DeviceArrayReadExecutionMock(array2, 0).schedule();
+        new DeviceArrayWriteExecutionMock(array3, 0, 0).schedule();
+        assertEquals(1, array1.getArrayUpToDateLocations().size());
+        assertTrue(array1.isArrayUpdatedOnCPU());
+        assertEquals(1, array3.getArrayUpToDateLocations().size());
+        assertTrue(array3.isArrayUpdatedOnCPU());
+        assertEquals(2, array2.getArrayUpToDateLocations().size());
+        assertTrue(array2.isArrayUpdatedOnCPU());
+        assertTrue(array2.isArrayUpdatedInLocation(0));
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/DeviceArrayTest.java
similarity index 87%
rename from projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayTest.java
rename to projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/DeviceArrayTest.java
index e0a358f7..970f3d5b 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/DeviceArrayTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/DeviceArrayTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,13 +32,16 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.test;
+package com.nvidia.grcuda.test.runtime.array;
 
 import java.util.Arrays;
 import java.util.Collection;
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertTrue;
+
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
 import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.PolyglotAccess;
 import org.graalvm.polyglot.Value;
 import org.junit.Test;
 import org.junit.runner.RunWith;
@@ -64,7 +74,7 @@ public DeviceArrayTest(String dataTypeString, Object testValue, int arrayLength)
 
     @Test
     public void testDeviceArrayCreationFromArrayExpression() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().allowPolyglotAccess(PolyglotAccess.ALL).build()) {
             Value deviceArray = context.eval("grcuda", dataTypeString + "[" + arrayLength + "]");
             assertTrue(deviceArray.hasArrayElements());
             assertEquals(arrayLength, deviceArray.getArraySize());
@@ -73,7 +83,7 @@ public void testDeviceArrayCreationFromArrayExpression() {
 
     @Test
     public void testDeviceArrayCreationFromDeviceArrayConstructor() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayFunc = context.eval("grcuda", "DeviceArray");
             Value deviceArray = deviceArrayFunc.execute(dataTypeString, arrayLength);
             assertTrue(deviceArray.hasArrayElements());
@@ -139,7 +149,7 @@ private void setElement(Value array, int index, Number value) {
 
     @Test
     public void testDeviceArrayGetValue() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArray = context.eval("grcuda", dataTypeString + "[" + arrayLength + "]");
             assertTrue(deviceArray.hasArrayElements());
             assertEquals(arrayLength, deviceArray.getArraySize());
@@ -150,7 +160,7 @@ public void testDeviceArrayGetValue() {
 
     @Test
     public void testDeviceArraySetAndGetsSetValue() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArray = context.eval("grcuda", dataTypeString + "[" + arrayLength + "]");
             assertTrue(deviceArray.hasArrayElements());
             assertEquals(arrayLength, deviceArray.getArraySize());
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/MultiDimArrayTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/MultiDimArrayTest.java
similarity index 91%
rename from projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/MultiDimArrayTest.java
rename to projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/MultiDimArrayTest.java
index 690f7579..9d92e808 100644
--- a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/MultiDimArrayTest.java
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/array/MultiDimArrayTest.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,19 +32,21 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.test;
+package com.nvidia.grcuda.test.runtime.array;
 
-import static org.junit.Assert.assertEquals;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.Value;
 import org.junit.Test;
 
+import static org.junit.Assert.assertEquals;
+
 public class MultiDimArrayTest {
 
     @Test
     public void test2DimArrayRowMajorFromConstructor() {
         // 2-dimensional array through DeviceArray constructor (row-major)
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayContructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 19;
             final int numDim2 = 53;
@@ -60,7 +69,7 @@ public void test2DimArrayRowMajorFromConstructor() {
     @Test
     public void test2DimArrayRowMajorFromPolyglotExpr() {
         // 2-dimensional array through polyglot expression "int[19][53]"
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             final int numDim1 = 19;
             final int numDim2 = 53;
             String code = String.format("int[%d][%d]", numDim1, numDim2);
@@ -83,7 +92,7 @@ public void test2DimArrayRowMajorFromPolyglotExpr() {
     @Test
     public void test2DimArrayColMajorFromConstructor() {
         // 2-dimensional array through DeviceArray constructor (column-major)
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 19;
             final int numDim2 = 53;
@@ -105,7 +114,7 @@ public void test2DimArrayColMajorFromConstructor() {
 
     @Test
     public void test3DimArrayRowMajorFromConstructor() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             // 3-dimensional array through DeviceArray constructor (row-major)
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 5;
@@ -136,13 +145,14 @@ public void test3DimArrayRowMajorFromConstructor() {
     @Test
     public void test3DimArrayColMajorFromConstructor() {
         // 3-dimensional array through DeviceArray constructor (column-major)
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 5;
             final int numDim2 = 3;
             final int numDim3 = 2;
             Value matrix = deviceArrayConstructor.execute("int", numDim1, numDim2, numDim3, "F");
             assertEquals(numDim1, matrix.getArraySize());
+            Value a = matrix.getArrayElement(0);
             assertEquals(numDim2, matrix.getArrayElement(0).getArraySize());
             assertEquals(numDim3, matrix.getArrayElement(0).getArrayElement(0).getArraySize());
             for (int i = 0; i < numDim1; i++) {
@@ -166,7 +176,7 @@ public void test3DimArrayColMajorFromConstructor() {
     @Test
     public void test4DimArrayRowMajorFromConstructor() {
         // 4-dimensional array through DeviceArray constructor (row-major)
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 7;
             final int numDim2 = 5;
@@ -203,7 +213,7 @@ public void test4DimArrayRowMajorFromConstructor() {
     @Test
     public void test4DimArrayColMajorFromConstructor() {
         // 4-dimensional array through DeviceArray constructor (column-major)
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 7;
             final int numDim2 = 5;
@@ -239,7 +249,7 @@ public void test4DimArrayColMajorFromConstructor() {
 
     @Test(expected = ArrayIndexOutOfBoundsException.class)
     public void test2DimArrayOutOfBoundsOnReadAccess() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 19;
             final int numDim2 = 53;
@@ -253,7 +263,7 @@ public void test2DimArrayOutOfBoundsOnReadAccess() {
 
     @Test(expected = ArrayIndexOutOfBoundsException.class)
     public void test2DimArrayOutOfBoundsOnWriteAccess() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 19;
             final int numDim2 = 53;
@@ -279,7 +289,7 @@ public void test2DimArrayOutOfBoundsOnWriteAccess() {
 
     @Test
     public void test2DimArrayAsKernelArgument() {
-        try (Context context = Context.newBuilder().allowAllAccess(true).build()) {
+        try (Context context = GrCUDATestUtil.buildTestContext().build()) {
             final Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
             final int numDim1 = 19;
             final int numDim2 = 53;
@@ -291,6 +301,7 @@ public void test2DimArrayAsKernelArgument() {
                     matrix.getArrayElement(i).setArrayElement(j, i * numDim2 + j);
                 }
             }
+
             final Value buildKernel = context.eval("grcuda", "buildkernel");
             final Value kernel = buildKernel.execute(INC2D_KERNEL_SOURCE, "inc2d<int>", INC2D_KERNEL_SIGNATURE);
             final int blocks = 80;
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAComputationsWithGPU.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAComputationsWithGPU.java
new file mode 100644
index 00000000..51873458
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAComputationsWithGPU.java
@@ -0,0 +1,370 @@
+package com.nvidia.grcuda.test.runtime.executioncontext;
+
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+
+/**
+ * Class that contains static functions that are used as building blocks for GrCUDA tests that use the GPU;
+ */
+public class GrCUDAComputationsWithGPU {
+
+    static final int NUM_THREADS_PER_BLOCK = 32;
+
+    static final String SQUARE_WITH_CONST =
+            "extern \"C\" __global__ void square(const float* x, float *y, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       y[idx] = x[idx] * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    static final String DIFF_KERNEL =
+            "extern \"C\" __global__ void diff(float* x, float* y, float* z, int n) {\n" +
+                    "   int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "   if (idx < n) {\n" +
+                    "      z[idx] = x[idx] - y[idx];\n" +
+                    "   }\n" +
+                    "}";
+
+    static final String REDUCE_KERNEL =
+            "extern \"C\" __global__ void reduce(float *x, float *res, int n) {\n" +
+                    "    __shared__ float cache[" + NUM_THREADS_PER_BLOCK + "];\n" +
+                    "    int i = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (i < n) {\n" +
+                    "       cache[threadIdx.x] = x[i];\n" +
+                    "    }\n" +
+                    "    __syncthreads();\n" +
+                    "    i = " + NUM_THREADS_PER_BLOCK + " / 2;\n" +
+                    "    while (i > 0) {\n" +
+                    "       if (threadIdx.x < i) {\n" +
+                    "            cache[threadIdx.x] += cache[threadIdx.x + i];\n" +
+                    "        }\n" +
+                    "        __syncthreads();\n" +
+                    "        i /= 2;\n" +
+                    "    }\n" +
+                    "    if (threadIdx.x == 0) {\n" +
+                    "        atomicAdd(res, cache[0]);\n" +
+                    "    }\n" +
+                    "}";
+
+    static final String SQUARE_INPLACE_KERNEL =
+            "extern \"C\" __global__ void square(float* x, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       x[idx] = x[idx] * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    static final String DIFF_SINGLE_KERNEL =
+            "extern \"C\" __global__ void diff(const float* x, float* z, float val, int n) {\n" +
+                    "   int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "   if (idx < n) {\n" +
+                    "      z[idx] = x[idx] - val;\n" +
+                    "   }\n" +
+                    "}";
+
+    /**
+     * Test a simple join pattern with the following shape;
+     * (X) --> (Z)
+     * (Y) -/
+     * @param context context a GrCUDA context with the specified options
+     */
+    static void simpleJoin(Context context) {
+        // FIXME: this test fails randomly with small values (< 100000, more or less),
+        //  but the same computation doesn't fail in Graalpython.
+        final int numElements = 100000;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value res = deviceArrayConstructor.execute("float", 1);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 1.0 / (i + 1));
+            y.setArrayElement(i, 2.0 / (i + 1));
+        }
+        res.setArrayElement(0, 0.0);
+
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_INPLACE_KERNEL, "square", "pointer, sint32");
+        Value diffKernel = buildkernel.execute(DIFF_KERNEL, "diff", "const pointer, const pointer, pointer, sint32");
+        Value reduceKernel = buildkernel.execute(REDUCE_KERNEL, "reduce", "const pointer, pointer, sint32");
+        assertNotNull(squareKernel);
+        assertNotNull(diffKernel);
+        assertNotNull(reduceKernel);
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredDiffKernel = diffKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredReduceKernel = reduceKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x, numElements);
+        configuredSquareKernel.execute(y, numElements);
+        configuredDiffKernel.execute(x, y, z, numElements);
+        configuredReduceKernel.execute(z, res, numElements);
+
+        // Verify the output;
+        float resScalar = res.getArrayElement(0).asFloat();
+        assertEquals(-4.93, resScalar, 0.01);
+    }
+
+    /**
+     * Test a simple join pattern with the following shape. In this case, also test the copy functions by moving data
+     * from X, Y into X2, Y2;
+     * (X) -copy-> (X2) --> (Z)
+     * (Y) -copy-> (Y2) -/
+     * @param context a GrCUDA context with the specified options
+     */
+    static void arrayCopyWithJoin(Context context) {
+        final int numElements = 100000;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value x2 = deviceArrayConstructor.execute("float", numElements);
+        Value y2 = deviceArrayConstructor.execute("float", numElements);
+        Value res = deviceArrayConstructor.execute("float", 1);
+        Value res2 = deviceArrayConstructor.execute("float", 1);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 1.0 / (i + 1));
+            y.setArrayElement(i, 2.0 / (i + 1));
+        }
+        res.setArrayElement(0, 0.0);
+
+        x2.invokeMember("copyFrom", x, numElements);
+        y2.invokeMember("copyFrom", y, numElements);
+
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_INPLACE_KERNEL, "square", "pointer, sint32");
+        Value diffKernel = buildkernel.execute(DIFF_KERNEL, "diff", "const pointer, const pointer, pointer, sint32");
+        Value reduceKernel = buildkernel.execute(REDUCE_KERNEL, "reduce", "const pointer, pointer, sint32");
+        assertNotNull(squareKernel);
+        assertNotNull(diffKernel);
+        assertNotNull(reduceKernel);
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredDiffKernel = diffKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredReduceKernel = reduceKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x2, numElements);
+        configuredSquareKernel.execute(y2, numElements);
+        configuredDiffKernel.execute(x2, y2, z, numElements);
+        configuredReduceKernel.execute(z, res, numElements);
+
+        res.invokeMember("copyTo", res2, 1);
+
+        // Verify the output;
+        float resScalar = res2.getArrayElement(0).asFloat();
+        assertEquals(-4.93, resScalar, 0.01);
+    }
+
+    /**
+     * Execute two kernels with read only arguments;
+     * (X) --> (Y)
+     *     \-> (Z)
+     * @param context a GrCUDA context with the specified options
+     */
+    static void parallelKernelsWithReadOnlyArgs(Context context) {
+        final int numElements = 10;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_WITH_CONST, "square", "const pointer, pointer, sint32");
+
+        assertNotNull(squareKernel);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 2.0);
+        }
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x, y, numElements);
+        configuredSquareKernel.execute(x, z, numElements);
+
+        // Verify the output;
+        assertEquals(4.0, y.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(4.0, z.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(4.0, y.getArrayElement(numElements - 1).asFloat(), 0.1);
+        assertEquals(4.0, z.getArrayElement(numElements - 1).asFloat(), 0.1);
+    }
+
+    /**
+     * Execute a fork, with read only arguments.
+     * Read the input array x before syncing the computation. Depending on the GPU, this might sync the device;
+     * (X) --> (Y)
+     *     \-> (Z)
+     * @param context a GrCUDA context with the specified options
+     */
+    static void simpleForkReadInput(Context context) {
+        final int numElements = 100;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_WITH_CONST, "square", "const pointer, pointer, sint32");
+
+        assertNotNull(squareKernel);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 2.0);
+        }
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x, y, numElements);
+        configuredSquareKernel.execute(x, z, numElements);
+
+        // Read the array x before syncing the computation. Depending on the GPU, this might sync the device;
+        for (int i = 0; i < numElements; ++i) {
+            assertEquals(2.0, x.getArrayElement(i).asFloat(), 0.1);
+        }
+
+        // Verify the output;
+        assertEquals(4.0, y.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(4.0, z.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(4.0, y.getArrayElement(numElements - 1).asFloat(), 0.1);
+        assertEquals(4.0, z.getArrayElement(numElements - 1).asFloat(), 0.1);
+    }
+
+    /**
+     * Execute a fork, with a read only argument in the children kernels;
+     * A(1) --> B(1r, 2)
+     *      \-> C(1r, 3)
+     * @param context a GrCUDA context with the specified options
+     */
+    static void forkWithReadOnly(Context context) {
+        final int numElements = 10;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_INPLACE_KERNEL, "square", "pointer, sint32");
+        Value diffKernel = buildkernel.execute(DIFF_SINGLE_KERNEL, "diff", "const pointer, pointer, float, sint32");
+
+        assertNotNull(squareKernel);
+        assertNotNull(diffKernel);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 2.0);
+        }
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredDiffKernel = diffKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x, numElements);
+        configuredDiffKernel.execute(x, y, 1.0, numElements);
+        configuredDiffKernel.execute(x, z, 1.0, numElements);
+
+        // Verify the output;
+        assertEquals(3.0, y.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(3.0, z.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(3.0, y.getArrayElement(numElements - 1).asFloat(), 0.1);
+        assertEquals(3.0, z.getArrayElement(numElements - 1).asFloat(), 0.1);
+        assertEquals(4.0, x.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(4.0, x.getArrayElement(0).asFloat(), 0.1);
+    }
+
+    /**
+     * Execute a diamond, i.e. a fork with read-only arguments in the children followed by a join;
+     * A(1) --> B(1r, 2) -> D(1)
+     *      \-> C(1r, 3) -/
+     * @param context a GrCUDA context with the specified options
+     */
+    static void dependencyPipelineDiamond(Context context) {
+        final int numElements = 10;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_INPLACE_KERNEL, "square", "pointer, sint32");
+        Value diffKernel = buildkernel.execute(DIFF_SINGLE_KERNEL, "diff", "const pointer, pointer, float, sint32");
+
+        assertNotNull(squareKernel);
+        assertNotNull(diffKernel);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 2.0);
+        }
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredDiffKernel = diffKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x, numElements);
+        configuredDiffKernel.execute(x, y, 1.0, numElements);
+        configuredDiffKernel.execute(x, z, 1.0, numElements);
+        configuredSquareKernel.execute(x, numElements);
+
+        // Verify the output;
+        assertEquals(16.0, x.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(16.0, x.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(3.0, y.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(3.0, z.getArrayElement(0).asFloat(), 0.1);
+        assertEquals(3.0, y.getArrayElement(numElements - 1).asFloat(), 0.1);
+        assertEquals(3.0, z.getArrayElement(numElements - 1).asFloat(), 0.1);
+    }
+
+    /**
+     * Execute a join followed by an extra kernel, using read-only arguments;
+     * (X) --> (Z) -> (R)
+     * (Y) -/
+     * @param context a GrCUDA context with the specified options
+     */
+    static void joinWithExtraKernel(Context context) {
+        final int numElements = 10000;
+        final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+        Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+        Value x = deviceArrayConstructor.execute("float", numElements);
+        Value y = deviceArrayConstructor.execute("float", numElements);
+        Value z = deviceArrayConstructor.execute("float", numElements);
+        Value w = deviceArrayConstructor.execute("float", numElements);
+        Value res = deviceArrayConstructor.execute("float", 1);
+
+        for (int i = 0; i < numElements; ++i) {
+            x.setArrayElement(i, 1.0 / (i + 1));
+        }
+        res.setArrayElement(0, 0.0);
+
+        Value buildkernel = context.eval("grcuda", "buildkernel");
+        Value squareKernel = buildkernel.execute(SQUARE_WITH_CONST, "square", "const pointer, pointer, sint32");
+        Value diffKernel = buildkernel.execute(DIFF_KERNEL, "diff", "const pointer, const pointer, pointer, sint32");
+        Value reduceKernel = buildkernel.execute(REDUCE_KERNEL, "reduce", "const pointer, pointer, sint32");
+        assertNotNull(squareKernel);
+        assertNotNull(diffKernel);
+        assertNotNull(reduceKernel);
+
+        Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredDiffKernel = diffKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+        Value configuredReduceKernel = reduceKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+        // Perform the computation;
+        configuredSquareKernel.execute(x, y, numElements);
+        configuredSquareKernel.execute(x, z, numElements);
+        configuredDiffKernel.execute(y, z, w, numElements);
+        configuredReduceKernel.execute(w, res, numElements);
+
+        // Verify the output;
+        float resScalar = res.getArrayElement(0).asFloat();
+        assertEquals(0, resScalar, 0.01);
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAExecutionContextTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAExecutionContextTest.java
new file mode 100644
index 00000000..9a0e4bf3
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAExecutionContextTest.java
@@ -0,0 +1,213 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime.executioncontext;
+
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.test.util.GrCUDATestOptionsStruct;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Collection;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+
+@RunWith(Parameterized.class)
+public class GrCUDAExecutionContextTest {
+
+    /**
+     * Tests are executed for each of the {@link AsyncGrCUDAExecutionContext} values;
+     * @return the current stream policy
+     */
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.getAllOptionCombinationsSingleGPU();
+    }
+
+    private final GrCUDATestOptionsStruct options;
+
+    public GrCUDAExecutionContextTest(GrCUDATestOptionsStruct options) {
+        this.options = options;
+    }
+
+    private static final int NUM_THREADS_PER_BLOCK = GrCUDAComputationsWithGPU.NUM_THREADS_PER_BLOCK;
+
+    private static final String SQUARE_KERNEL = GrCUDAComputationsWithGPU.SQUARE_INPLACE_KERNEL;
+
+    private static final String SQUARE_2_KERNEL = GrCUDAComputationsWithGPU.SQUARE_WITH_CONST;
+
+    @Test
+    public void dependencyKernelSimpleTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            final int numElements = 10;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+
+            assertNotNull(squareKernel);
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredSquareKernel.execute(x, numElements);
+
+            // Verify the output;
+            assertEquals(4.0, x.getArrayElement(1).asFloat(), 0.1);
+        }
+    }
+
+    @Test
+    public void dependency2KernelsSimpleTest() {
+
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            final int numElements = 10;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+
+            assertNotNull(squareKernel);
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredSquareKernel.execute(x, numElements);
+            configuredSquareKernel.execute(y, numElements);
+
+            // Verify the output;
+            assertEquals(4.0, x.getArrayElement(0).asFloat(), 0.1);
+            assertEquals(16.0, y.getArrayElement(0).asFloat(), 0.1);
+        }
+    }
+
+    @Test
+    public void simpleJoinTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.simpleJoin(context);
+        }
+    }
+
+    /**
+     * The read on "y" has to sync on the stream where the kernel is running, although that kernel doesn't use "y".
+     * This is due to the pre-Pascal limitations on managed memory accesses,
+     * and the inability to access an array while it is being used by a running kernel;
+     */
+    @Test
+    public void dependencyPipelineSimple3Test() {
+
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value z = deviceArrayConstructor.execute("float", numElements);
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_2_KERNEL, "square", "const pointer, pointer, sint32");
+            assertNotNull(squareKernel);
+
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredSquareKernel.execute(x, z, numElements);
+
+            // Verify the output;
+            assertEquals(4.0, z.getArrayElement(0).asFloat(), 0.1);
+            assertEquals(0.0, y.getArrayElement(0).asFloat(), 0.1);
+        }
+    }
+
+    /**
+     * The read on "y" has to sync on the stream where the kernel is running, although that kernel doesn't use "y".
+     * This is due to the pre-Pascal limitations on managed memory accesses,
+     * and the inability to access an array while it is being used by a running kernel.
+     * In this case, also perform an operation on y, instead of leaving it uninitialized;
+     */
+    @Test
+    public void dependencyPipelineSimple4Test() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            final int numElements = 100;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value z = deviceArrayConstructor.execute("float", numElements);
+
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+            }
+            // Access the y array;
+            y.setArrayElement(0, 0);
+
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_2_KERNEL, "square", "const pointer, pointer, sint32");
+            assertNotNull(squareKernel);
+
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredSquareKernel.execute(x, z, numElements);
+            // Verify the output;
+            assertEquals(0.0, y.getArrayElement(0).asFloat(), 0.1);
+            assertEquals(4.0, z.getArrayElement(0).asFloat(), 0.1);
+        }
+    }
+
+    @Test
+    public void dependencyPipelineWithArrayCopyTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.arrayCopyWithJoin(context);
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAExecutionContextWithConstDependencyTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAExecutionContextWithConstDependencyTest.java
new file mode 100644
index 00000000..eb5f44a1
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAExecutionContextWithConstDependencyTest.java
@@ -0,0 +1,103 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime.executioncontext;
+
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.test.util.GrCUDATestOptionsStruct;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Collection;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+
+@RunWith(Parameterized.class)
+public class GrCUDAExecutionContextWithConstDependencyTest {
+
+    /**
+     * Tests are executed for each of the {@link AsyncGrCUDAExecutionContext} values;
+     * @return the current stream policy
+     */
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.getAllOptionCombinationsSingleGPU();
+    }
+
+    private final GrCUDATestOptionsStruct options;
+
+    public GrCUDAExecutionContextWithConstDependencyTest(GrCUDATestOptionsStruct options) {
+        this.options = options;
+    }
+
+    @Test
+    public void parallelKernelsWithReadOnlyArgsTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.parallelKernelsWithReadOnlyArgs(context);
+        }
+    }
+
+    @Test
+    public void simpleForkReadInputTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.simpleForkReadInput(context);
+        }
+    }
+
+    @Test
+    public void forkWithReadOnlyTest() {
+        // Test a computation of form A(1) --> B(1r, 2)
+        //                                 \-> C(1r, 3)
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.forkWithReadOnly(context);
+        }
+    }
+
+    @Test
+    public void dependencyPipelineDiamondTest() {
+        // Test a computation of form A(1) --> B(1r, 2) -> D(1)
+        //                                 \-> C(1r, 3) -/
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.dependencyPipelineDiamond(context);
+        }
+    }
+
+    @Test
+    public void joinWithExtraKernelTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            GrCUDAComputationsWithGPU.joinWithExtraKernel(context);
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAMultiGPUExecutionContextTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAMultiGPUExecutionContextTest.java
new file mode 100644
index 00000000..642102da
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAMultiGPUExecutionContextTest.java
@@ -0,0 +1,268 @@
+package com.nvidia.grcuda.test.runtime.executioncontext;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.test.util.GrCUDATestOptionsStruct;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Collection;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assume.assumeTrue;
+
+@RunWith(Parameterized.class)
+public class GrCUDAMultiGPUExecutionContextTest {
+
+    // FIXME: add multi-gpu policies;
+
+    /**
+     * Tests are executed for each of the {@link AsyncGrCUDAExecutionContext} values;
+     * @return the current stream policy
+     */
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.getAllOptionCombinationsMultiGPU();
+    }
+
+    private final GrCUDATestOptionsStruct options;
+
+    /**
+     * Set to false if we discover that only a single GPU is available. Doing other tests is not useful;
+     */
+    private static boolean multipleGPUs = true;
+
+    public GrCUDAMultiGPUExecutionContextTest(GrCUDATestOptionsStruct options) {
+        this.options = options;
+    }
+
+    private static final int NUM_THREADS_PER_BLOCK = 32;
+
+    private static final String SQUARE_KERNEL =
+            "extern \"C\" __global__ void square(float* x, int n) {\n" +
+                    "    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n" +
+                    "    if (idx < n) {\n" +
+                    "       x[idx] = x[idx] * x[idx];\n" +
+                    "    }" +
+                    "}\n";
+
+    @Before
+    public void skipIfSingleGPU() {
+        assumeTrue(multipleGPUs);
+    }
+
+    private boolean checkIfEnoughGPUsAreAvailable(Context context) {
+        Value deviceCount = context.eval("grcuda", "cudaGetDeviceCount()");
+        if (deviceCount.asInt() < 2) {
+            // The system does not have multiple GPUs, skip all further multi-GPU tests;
+            multipleGPUs = false;
+            System.out.println("warning: only 1 GPU available, skipping further multi-GPU tests");
+            return false;
+        } else if (this.options.numberOfGPUs > deviceCount.asInt()) {
+            // If the test asks for more GPUs than available, skip it;
+            return false;
+        }
+        // We have enough GPUs for this test;
+        return true;
+    }
+
+    ////////////////////////////////////////////////////////
+    // Basic multi-GPU testing, with manual GPU selection //
+    ////////////////////////////////////////////////////////
+
+    /**
+     * Execute 2 independent kernels, 2 times in a row, manually specifying the GPU for them;
+     */
+    @Test
+    public void dependency2KernelsManualGPUChoiceTest() {
+        int numOfGPUs = 2;
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options, numOfGPUs)) {
+
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+
+            Value setDevice = context.eval("grcuda", "cudaSetDevice");
+
+            final int numElements = 10000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+            Value[] inputs = {x, y};
+
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            assertNotNull(squareKernel);
+
+            // init arrays with values
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            for (int i = 0; i < numOfGPUs; i++) {
+                setDevice.execute(i);
+                // Perform the computation, twice;
+                configuredSquareKernel.execute(inputs[i], numElements);
+                configuredSquareKernel.execute(inputs[i], numElements);
+            }
+
+            // Verify the output;
+            assertEquals(16.0, x.getArrayElement(0).asFloat(), 0.1);
+            assertEquals(256.0, y.getArrayElement(0).asFloat(), 0.1);
+        }
+    }
+
+    ///////////////////////////////////////////////////////////
+    // Basic multi-GPU testing, with automatic GPU selection //
+    ///////////////////////////////////////////////////////////
+
+    /**
+     * Execute 2 independent kernels, 2 times in a row;
+     */
+    @Test
+    public void dependency2KernelsSimpleTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+
+            final int numElements = 10;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            assertNotNull(squareKernel);
+
+            // init arrays with values
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+
+            Value configuredSquareKernel = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredSquareKernel.execute(x, numElements);
+            configuredSquareKernel.execute(y, numElements);
+
+            // Perform the computation;
+            configuredSquareKernel.execute(x, numElements);
+            configuredSquareKernel.execute(y, numElements);
+
+            // Verify the output;
+            assertEquals(16.0, x.getArrayElement(0).asFloat(), 0.1);
+            assertEquals(256.0, y.getArrayElement(0).asFloat(), 0.1);
+        }
+    }
+
+    /**
+     * Test with 3 kernels: kernel0 does not have dependencies.
+     * kernel1 is the parent of kernel2;
+     */
+    @Test
+    public void dependencyKernelsTestA() {
+
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+
+            final int numElements = 10000;
+            final int numBlocks = (numElements + NUM_THREADS_PER_BLOCK - 1) / NUM_THREADS_PER_BLOCK;
+            Value deviceArrayConstructor = context.eval("grcuda", "DeviceArray");
+            Value x = deviceArrayConstructor.execute("float", numElements);
+            Value y = deviceArrayConstructor.execute("float", numElements);
+
+            Value buildkernel = context.eval("grcuda", "buildkernel");
+            Value squareKernel = buildkernel.execute(SQUARE_KERNEL, "square", "pointer, sint32");
+            assertNotNull(squareKernel);
+
+            // init arrays with values
+            for (int i = 0; i < numElements; ++i) {
+                x.setArrayElement(i, 2.0);
+                y.setArrayElement(i, 4.0);
+            }
+
+            Value configuredK0 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+            Value configuredK1 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+            Value configuredK2 = squareKernel.execute(numBlocks, NUM_THREADS_PER_BLOCK);
+
+            // Perform the computation;
+            configuredK0.execute(x, numElements);
+            configuredK1.execute(y, numElements);
+
+            // Perform the computation;
+            configuredK2.execute(y, numElements);
+            // Verify the output;
+            assertEquals(4.0, x.getArrayElement(0).asFloat(), 0.1);
+            assertEquals(256.0, y.getArrayElement(0).asFloat(), 0.1);
+        }
+    }
+
+    //////////////////////////////////////////
+    // Call existing tests, using multi-GPU //
+    //////////////////////////////////////////
+
+    /**
+     * Test a join pattern (x) & (y) -> (z), with data in x and y being copied from other arrays;
+     */
+    @Test
+    public void dependencyPipelineWithArrayCopyTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+            GrCUDAComputationsWithGPU.arrayCopyWithJoin(context);
+        }
+    }
+
+    @Test
+    public void parallelKernelsWithReadOnlyArgsTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+            GrCUDAComputationsWithGPU.parallelKernelsWithReadOnlyArgs(context);
+        }
+    }
+
+    @Test
+    public void simpleForkReadInputTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+            GrCUDAComputationsWithGPU.simpleForkReadInput(context);
+        }
+    }
+
+    @Test
+    public void forkWithReadOnlyTest() {
+        // Test a computation of form A(1) --> B(1r, 2)
+        //                                 \-> C(1r, 3)
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+            GrCUDAComputationsWithGPU.forkWithReadOnly(context);
+        }
+    }
+
+    @Test
+    public void dependencyPipelineDiamondTest() {
+        // Test a computation of form A(1) --> B(1r, 2) -> D(1)
+        //                                 \-> C(1r, 3) -/
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+            GrCUDAComputationsWithGPU.dependencyPipelineDiamond(context);
+        }
+    }
+
+    @Test
+    public void joinWithExtraKernelTest() {
+        try (Context context = GrCUDATestUtil.createContextFromOptions(this.options)) {
+            assumeTrue(checkIfEnoughGPUsAreAvailable(context));
+            GrCUDAComputationsWithGPU.joinWithExtraKernel(context);
+        }
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAStreamPolicyMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAStreamPolicyMockTest.java
new file mode 100644
index 00000000..68c1d9a0
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/GrCUDAStreamPolicyMockTest.java
@@ -0,0 +1,208 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime.executioncontext;
+
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.GrCUDAOptions;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.GrCUDAStreamPolicy;
+import com.nvidia.grcuda.runtime.stream.policy.RoundRobinDeviceSelectionPolicy;
+import com.nvidia.grcuda.runtime.stream.policy.TransferTimeDeviceSelectionPolicy;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.mock.AsyncGrCUDAExecutionContextMock;
+import com.nvidia.grcuda.test.util.mock.DeviceListMock;
+import com.nvidia.grcuda.test.util.mock.DeviceMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDADevicesManagerMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAStreamPolicyMock;
+import com.nvidia.grcuda.test.util.mock.OptionValuesMockBuilder;
+import org.junit.Test;
+
+import java.io.File;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+
+import static org.junit.Assert.assertEquals;
+
+public class GrCUDAStreamPolicyMockTest {
+
+    private static AsyncGrCUDAExecutionContextMock createContext(int numberOfGPUs, DeviceSelectionPolicyEnum deviceSelectionPolicy) {
+        return new AsyncGrCUDAExecutionContextMock(
+                RetrieveNewStreamPolicyEnum.ALWAYS_NEW,
+                RetrieveParentStreamPolicyEnum.DISJOINT,
+                deviceSelectionPolicy,
+                true, numberOfGPUs, numberOfGPUs,
+                new GrCUDAOptionMap(new OptionValuesMockBuilder()
+                        .add(GrCUDAOptions.DependencyPolicy, DependencyPolicyEnum.WITH_CONST.toString())
+                        .add(GrCUDAOptions.InputPrefetch, false)
+                        .add(GrCUDAOptions.BandwidthMatrix, System.getenv("GRCUDA_HOME") + File.separatorChar +
+                                "projects" + File.separatorChar + "resources" + File.separatorChar +
+                                "connection_graph" + File.separatorChar + "datasets" + File.separatorChar + "connection_graph_test.csv").build())
+        );
+    }
+
+    private RoundRobinDeviceSelectionPolicy getRoundRobinPolicy(int numGPUs) {
+        GrCUDADevicesManagerMock devicesManager = new GrCUDADevicesManagerMock(new DeviceListMock(numGPUs), numGPUs);
+        return new RoundRobinDeviceSelectionPolicy(devicesManager);
+    }
+
+    @Test
+    public void roundRobinTest() {
+        RoundRobinDeviceSelectionPolicy policy = getRoundRobinPolicy(4);
+        Device d = policy.retrieve(null);
+        assertEquals(0, d.getDeviceId());
+        assertEquals(1, policy.getInternalState());
+        d = policy.retrieve(null);
+        assertEquals(1, d.getDeviceId());
+        assertEquals(2, policy.getInternalState());
+        d = policy.retrieve(null);
+        assertEquals(2, d.getDeviceId());
+        assertEquals(3, policy.getInternalState());
+        d = policy.retrieve(null);
+        assertEquals(3, d.getDeviceId());
+        assertEquals(0, policy.getInternalState());
+        d = policy.retrieve(null);
+        assertEquals(0, d.getDeviceId());
+        assertEquals(1, policy.getInternalState());
+        d = policy.retrieve(null, Collections.singletonList(new Device(0, null)));
+        assertEquals(0, d.getDeviceId());
+        assertEquals(2, policy.getInternalState());
+        d = policy.retrieve(null);
+        assertEquals(2, d.getDeviceId());
+        assertEquals(3, policy.getInternalState());
+        d = policy.retrieve(null, Collections.singletonList(new Device(3, null)));
+        assertEquals(3, d.getDeviceId());
+        assertEquals(0, policy.getInternalState());
+        d = policy.retrieve(null);
+        assertEquals(0, d.getDeviceId());
+        assertEquals(1, policy.getInternalState());
+        d = policy.retrieve(null, Arrays.asList(new Device(3, null), new Device(1, null)));
+        assertEquals(3, d.getDeviceId());
+        assertEquals(2, policy.getInternalState());
+        d = policy.retrieve(null, Arrays.asList(new Device(2, null), new Device(1, null)));
+        assertEquals(1, d.getDeviceId());
+        assertEquals(3, policy.getInternalState());
+        d = policy.retrieve(null, Arrays.asList(new Device(0, null), new Device(1, null)));
+        assertEquals(1, d.getDeviceId());
+    }
+
+    @Test
+    public void testStreamAwareRetrieve() {
+        AsyncGrCUDAExecutionContextMock context = createContext(4, DeviceSelectionPolicyEnum.STREAM_AWARE);
+        GrCUDAStreamPolicyMock streamPolicy = (GrCUDAStreamPolicyMock) context.getStreamManager().getStreamPolicy();
+        DeviceMock d = (DeviceMock) streamPolicy.getDeviceSelectionPolicy().retrieve(null);
+        assertEquals(0, d.getDeviceId());
+        assertEquals(0, d.getNumberOfBusyStreams());
+        // Add 1 busy stream on device 0;
+        d.createStream();
+        DeviceMock d1 = (DeviceMock) streamPolicy.getDeviceSelectionPolicy().retrieve(null);
+        // Add 2 busy streams on device 1;
+        d1.createStream();
+        d1.createStream();
+        assertEquals(2, d1.getNumberOfBusyStreams());
+        DeviceMock d2 = (DeviceMock) streamPolicy.getDeviceSelectionPolicy().retrieve(null);
+        assertEquals(2, d2.getDeviceId());
+        // Add 1 busy stream on device 2;
+        d2.createStream();
+        assertEquals(1, d2.getNumberOfBusyStreams());
+        DeviceMock d3 = (DeviceMock) streamPolicy.getDeviceSelectionPolicy().retrieve(null);
+        assertEquals(3, d3.getDeviceId());
+        assertEquals(0, d3.getNumberOfBusyStreams());
+        // Add 1 busy stream on device 3;
+        d3.createStream();
+        // Test retrieval on a subset of devices;
+        d2 = (DeviceMock) streamPolicy.getDeviceSelectionPolicy().retrieve(null, Arrays.asList(d2, d3));
+        assertEquals(2, d2.getDeviceId());
+        d = (DeviceMock) streamPolicy.getDeviceSelectionPolicy().retrieve(null, Arrays.asList(d, d1));
+        assertEquals(0, d.getDeviceId());
+    }
+
+    @Test
+    public void createBandwidthMatrixTest() {
+        AsyncGrCUDAExecutionContextMock context = createContext(2, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME);
+        GrCUDAStreamPolicyMock streamPolicy = (GrCUDAStreamPolicyMock) context.getStreamManager().getStreamPolicy();
+        double[][] bGold = {
+                {30, 45, 10},
+                {45, 60, 20},
+                {10, 20, 0}
+        };
+        double[][] b = ((TransferTimeDeviceSelectionPolicy) streamPolicy.getDeviceSelectionPolicy()).getLinkBandwidth();
+        for (int i = 0; i < b.length; i++) {
+            for (int j = 0; j < b[i].length; j++) {
+                assertEquals(bGold[i][j], b[i][j], 1e-6);
+            }
+        }
+    }
+
+    @Test
+    public void bandwidthComputationMinMaxTest() {
+        AsyncGrCUDAExecutionContextMock context = createContext(2, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME);
+        TransferTimeDeviceSelectionPolicy deviceSelectionPolicy = (TransferTimeDeviceSelectionPolicy) ((GrCUDAStreamPolicyMock) context.getStreamManager().getStreamPolicy()).getDeviceSelectionPolicy();
+        // If data is updated on the target device, we have infinite bandwidth (regardless of what's on the matrix diagonal);
+        double b = deviceSelectionPolicy.computeBandwidth(0, new HashSet<>(Arrays.asList(0, 1, CPUDevice.CPU_DEVICE_ID)));
+        assertEquals(Double.POSITIVE_INFINITY, b, 1e-6);
+        // If the data is updated on another device, take the worst bandwidth;
+        b = deviceSelectionPolicy.computeBandwidth(0, new HashSet<>(Arrays.asList(1, CPUDevice.CPU_DEVICE_ID)));
+        assertEquals(10, b, 1e-6);
+    }
+
+    @Test
+    public void bandwidthComputationMinMinTest() {
+        AsyncGrCUDAExecutionContextMock context = createContext(2, DeviceSelectionPolicyEnum.MINMIN_TRANSFER_TIME);
+        TransferTimeDeviceSelectionPolicy deviceSelectionPolicy = (TransferTimeDeviceSelectionPolicy) ((GrCUDAStreamPolicyMock) context.getStreamManager().getStreamPolicy()).getDeviceSelectionPolicy();
+        // If data is updated on the target device, we have infinite bandwidth (regardless of what's on the matrix diagonal);
+        double b = deviceSelectionPolicy.computeBandwidth(0, new HashSet<>(Arrays.asList(0, 1, CPUDevice.CPU_DEVICE_ID)));
+        assertEquals(Double.POSITIVE_INFINITY, b, 1e-6);
+        // If the data is updated on another device, take the worst bandwidth;
+        b = deviceSelectionPolicy.computeBandwidth(0, new HashSet<>(Arrays.asList(1, CPUDevice.CPU_DEVICE_ID)));
+        assertEquals(45, b, 1e-6);
+    }
+
+    @Test(expected = IllegalStateException.class)
+    public void bandwidthComputationWithNoUpdatedLocationTest() {
+        AsyncGrCUDAExecutionContextMock context = createContext(2, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME);
+        TransferTimeDeviceSelectionPolicy deviceSelectionPolicy = (TransferTimeDeviceSelectionPolicy) ((GrCUDAStreamPolicyMock) context.getStreamManager().getStreamPolicy()).getDeviceSelectionPolicy();
+        // If the data is not available on any device, give an error;
+        double b = deviceSelectionPolicy.computeBandwidth(0, new HashSet<>());
+    }
+
+    @Test(expected = ArrayIndexOutOfBoundsException.class)
+    public void bandwidthComputationOutOfBoundsLocationTest() {
+        AsyncGrCUDAExecutionContextMock context = createContext(2, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME);
+        TransferTimeDeviceSelectionPolicy deviceSelectionPolicy = (TransferTimeDeviceSelectionPolicy) ((GrCUDAStreamPolicyMock) context.getStreamManager().getStreamPolicy()).getDeviceSelectionPolicy();
+        // If the data is not available on any device, give an error;
+        double b = deviceSelectionPolicy.computeBandwidth(10, new HashSet<>(Collections.singletonList(1)));
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/WithConstDependencyComputationMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/WithConstDependencyComputationMockTest.java
new file mode 100644
index 00000000..4c1c29fc
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/executioncontext/WithConstDependencyComputationMockTest.java
@@ -0,0 +1,475 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime.executioncontext;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.mock.ArgumentMock;
+import com.nvidia.grcuda.test.util.mock.AsyncGrCUDAExecutionContextMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.nvidia.grcuda.test.util.mock.KernelExecutionMock;
+import com.nvidia.grcuda.test.util.mock.SyncExecutionMock;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+public class WithConstDependencyComputationMockTest {
+
+    @Test
+    public void addVertexToDAGTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.WITH_CONST);
+        // Create two mock kernel executions;
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(2))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        assertEquals(1, dag.getNumVertices());
+        assertEquals(0, dag.getNumEdges());
+        assertEquals(1, dag.getFrontier().size());
+        assertTrue(dag.getFrontier().get(0).isFrontier());
+        assertTrue(dag.getFrontier().get(0).isStart());
+
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(3))).schedule();
+
+        assertEquals(2, dag.getNumVertices());
+        assertEquals(0, dag.getNumEdges());
+        assertEquals(2, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(dag.getVertices().get(0), dag.getFrontier().get(0));
+        assertEquals(dag.getVertices().get(1), dag.getFrontier().get(1));
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(1).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertTrue(dag.getVertices().get(1).isStart());
+        // Check that no children or parents are present;
+        assertEquals(0, dag.getVertices().get(0).getChildVertices().size());
+        assertEquals(0, dag.getVertices().get(1).getParentVertices().size());
+    }
+
+
+    @Test
+    public void dependencyPipelineSimpleMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.WITH_CONST);
+        // Create 4 mock kernel executions. In this case, kernel 3 requires 1 and 2 to finish,
+        //   and kernel 4 requires kernel 3 to finish. The final frontier is composed of kernel 3 (arguments "1" and "2" are active),
+        //   and kernel 4 (argument "3" is active);
+        // A(1r) -> C(1, 2) -> D(2)
+        // B(1r) -/
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(4, dag.getNumVertices());
+        assertEquals(3, dag.getNumEdges());
+        assertEquals(2, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(2), dag.getVertices().get(3))),
+                new HashSet<>(dag.getFrontier()));
+        assertFalse(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertTrue(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        // Check if the third vertex is a child of first and second;
+        assertEquals(2, dag.getVertices().get(2).getParents().size());
+        assertEquals(new HashSet<>(dag.getVertices().get(2).getParentVertices()),
+                new HashSet<>(Arrays.asList(dag.getVertices().get(0), dag.getVertices().get(1))));
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(0).getChildVertices().get(0));
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(1).getChildVertices().get(0));
+        // Check if the fourth vertex is a child of the third;
+        assertEquals(1, dag.getVertices().get(3).getParents().size());
+        assertEquals(1, dag.getVertices().get(2).getChildren().size());
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(3).getParentVertices().get(0));
+        assertEquals(dag.getVertices().get(3), dag.getVertices().get(2).getChildVertices().get(0));
+    }
+
+    @Test
+    public void forkedComputationTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.WITH_CONST);
+
+        // A(1) --> B(1R)
+        //      \-> C(1R)
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(3, dag.getNumVertices());
+        assertEquals(2, dag.getNumEdges());
+        assertEquals(3, dag.getFrontier().size());
+
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertTrue(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+
+        assertEquals(dag.getVertices().get(0), dag.getVertices().get(1).getParentVertices().get(0));
+        assertEquals(1, dag.getVertices().get(2).getParentVertices().size());
+        assertFalse(dag.getVertices().get(2).getParentVertices().contains(dag.getVertices().get(1)));
+        assertFalse(dag.getVertices().get(1).getChildVertices().contains(dag.getVertices().get(2)));
+
+        // Add a fourth computation that depends on both B and C, and depends on both;
+        // A(1) -> B(1R) -> D(1)
+        //      \- C(1R) /
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        assertEquals(4, dag.getNumVertices());
+        assertEquals(4, dag.getNumEdges());
+        assertEquals(1, dag.getFrontier().size());
+    }
+
+    @Test
+    public void complexFrontierMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.WITH_CONST);
+
+        // A(1R,2) -> B(1) -> D(1R,3)
+        //    \----> C(2R) \----> E(1R,4) -> F(4)
+        // The final frontier is composed by A(2), B(1), C(2), D(1, 3), E(1), F(4);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2, true))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(6, dag.getNumVertices());
+        assertEquals(5, dag.getNumEdges());
+        assertEquals(6, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(dag.getVertices()), new HashSet<>(dag.getFrontier()));
+
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertTrue(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        assertTrue(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(4).isStart());
+        assertTrue(dag.getVertices().get(5).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+        // Check that D is a child of B and C and D are not connected;
+        assertEquals(dag.getVertices().get(1), dag.getVertices().get(3).getParentVertices().get(0));
+        assertFalse(dag.getVertices().get(3).getParentVertices().contains(dag.getVertices().get(2)));
+        assertFalse(dag.getVertices().get(2).getChildVertices().contains(dag.getVertices().get(3)));
+    }
+
+    @Test
+    public void complexFrontier2MockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST).build();
+
+        // A(1R,2) -> B(1) -> D(1R,3) ---------> G(1,3,4)
+        //         \- C(2R) \- E(1R,4) ----> F(4) -/
+        // The final frontier is composed by A(2), C(2R), G(1, 3, 4);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2, true))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(3), new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(7, dag.getNumVertices());
+        assertEquals(7, dag.getNumEdges());
+        assertEquals(3, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(0), dag.getVertices().get(2), dag.getVertices().get(6))),
+                new HashSet<>(dag.getFrontier()));
+
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertFalse(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        assertFalse(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(4).isStart());
+        assertFalse(dag.getVertices().get(5).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+        assertTrue(dag.getVertices().get(6).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+        // Check that D is a child of B and C and D are not connected;
+        assertEquals(dag.getVertices().get(1), dag.getVertices().get(3).getParentVertices().get(0));
+        assertFalse(dag.getVertices().get(3).getParentVertices().contains(dag.getVertices().get(2)));
+        assertFalse(dag.getVertices().get(2).getChildVertices().contains(dag.getVertices().get(3)));
+        // Check that D and E are not connected;
+        assertFalse(dag.getVertices().get(4).getParentVertices().contains(dag.getVertices().get(3)));
+        assertFalse(dag.getVertices().get(3).getChildVertices().contains(dag.getVertices().get(4)));
+        // Check that G is child exactly of D and F;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(3), dag.getVertices().get(5))), new HashSet<>(dag.getVertices().get(6).getParentVertices()));
+        // Check that E and G are not connected;
+        assertFalse(dag.getVertices().get(6).getParentVertices().contains(dag.getVertices().get(4)));
+        assertFalse(dag.getVertices().get(4).getChildVertices().contains(dag.getVertices().get(6)));
+    }
+
+    @Test
+    public void dependencyPipelineSimpleWithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST).build();
+        // Create 4 mock kernel executions. In this case, kernel 3 requires 1 and 2 to finish,
+        //   and kernel 4 requires kernel 3 to finish. The final frontier is composed of kernel 3 (arguments "1" and "2" are active),
+        //   and kernel 4 (argument "3" is active);
+        // A(1r) -> C(1, 2) -> D(2)
+        // B(1r) -/
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(4, dag.getNumVertices());
+        assertEquals(3, dag.getNumEdges());
+        assertEquals(2, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(2), dag.getVertices().get(3))),
+                new HashSet<>(dag.getFrontier()));
+        assertFalse(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertTrue(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        // Check if the third vertex is a child of first and second;
+        assertEquals(2, dag.getVertices().get(2).getParents().size());
+        assertEquals(new HashSet<>(dag.getVertices().get(2).getParentVertices()),
+                new HashSet<>(Arrays.asList(dag.getVertices().get(0), dag.getVertices().get(1))));
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(0).getChildVertices().get(0));
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(1).getChildVertices().get(0));
+        // Check if the fourth vertex is a child of the third;
+        assertEquals(1, dag.getVertices().get(3).getParents().size());
+        assertEquals(1, dag.getVertices().get(2).getChildren().size());
+        assertEquals(dag.getVertices().get(2), dag.getVertices().get(3).getParentVertices().get(0));
+        assertEquals(dag.getVertices().get(3), dag.getVertices().get(2).getChildVertices().get(0));
+
+        // Finish the computation;
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertEquals(0, dag.getFrontier().size());
+    }
+
+    @Test
+    public void forkedComputationWithSyncTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST).build();
+
+        // A(1) --> B(1R)
+        //      \-> C(1R)
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1, true))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(3, dag.getNumVertices());
+        assertEquals(2, dag.getNumEdges());
+        assertEquals(3, dag.getFrontier().size());
+
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertTrue(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+
+        assertEquals(dag.getVertices().get(0), dag.getVertices().get(1).getParentVertices().get(0));
+        assertEquals(1, dag.getVertices().get(2).getParentVertices().size());
+        assertFalse(dag.getVertices().get(2).getParentVertices().contains(dag.getVertices().get(1)));
+        assertFalse(dag.getVertices().get(1).getChildVertices().contains(dag.getVertices().get(2)));
+
+        // Add a fourth computation that depends on both B and C, and depends on both;
+        // A(1) -> B(1R) -> D(1)
+        //     \-> C(1R) /
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        assertEquals(4, dag.getNumVertices());
+        assertEquals(4, dag.getNumEdges());
+        assertEquals(1, dag.getFrontier().size());
+
+        // Finish the computation;
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        assertEquals(0, dag.getFrontier().size());
+    }
+
+    @Test
+    public void complexFrontierWithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.WITH_CONST,
+                RetrieveNewStreamPolicyEnum.REUSE, RetrieveParentStreamPolicyEnum.DISJOINT);
+
+        // A(1R,2) -> B(1) ---> D(1R,3)
+        //        \-> C(2R) \-> E(1R,4) -> F(4)
+        // The final frontier is composed by  A(1R,2), B(1), C(2R), D(1R,3), E(1R,4), F(4);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2, true))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(6, dag.getNumVertices());
+        assertEquals(5, dag.getNumEdges());
+        assertEquals(6, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(dag.getVertices()),
+                new HashSet<>(dag.getFrontier()));
+
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertTrue(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertTrue(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        assertTrue(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(4).isStart());
+        assertTrue(dag.getVertices().get(5).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+        // Check that D is a child of B and C and D are not connected;
+        assertEquals(dag.getVertices().get(1), dag.getVertices().get(3).getParentVertices().get(0));
+        assertFalse(dag.getVertices().get(3).getParentVertices().contains(dag.getVertices().get(2)));
+        assertFalse(dag.getVertices().get(2).getChildVertices().contains(dag.getVertices().get(3)));
+        // Check that D and E are not connected;
+        assertFalse(dag.getVertices().get(4).getParentVertices().contains(dag.getVertices().get(3)));
+        assertFalse(dag.getVertices().get(3).getChildVertices().contains(dag.getVertices().get(4)));
+
+        // Synchronize computations;
+        // A(1R,2) -> B(1) ---> D(1R,3)
+        //        \-> C(2R) \-> E(1R,4) -> F(4)
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertEquals(4, dag.getFrontier().size());
+
+        // Note that syncing F(4) will also sync B(1) although it's on a different stream;
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        assertEquals(1, dag.getFrontier().size());
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+        assertEquals(0, dag.getFrontier().size());
+    }
+
+    @Test
+    public void complexFrontier2WithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum.WITH_CONST,
+                RetrieveNewStreamPolicyEnum.REUSE, RetrieveParentStreamPolicyEnum.DISJOINT);
+
+        // A(1R,2) -> B(1) -> D(1R,3) ---------> G(1, 3, 4)
+        //        \-> C(2R) \-> E(1R,4) -> F(4) -/
+        // The final frontier is composed by A(1R,2), C(2R), G(1, 3, 4);
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2, true))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(3), new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check the DAG structure;
+        assertEquals(7, dag.getNumVertices());
+        assertEquals(7, dag.getNumEdges());
+        assertEquals(3, dag.getFrontier().size());
+        // Check updates to frontier and start status;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(0), dag.getVertices().get(2), dag.getVertices().get(6))),
+                new HashSet<>(dag.getFrontier()));
+
+        assertTrue(dag.getVertices().get(0).isFrontier());
+        assertTrue(dag.getVertices().get(0).isStart());
+        assertFalse(dag.getVertices().get(1).isFrontier());
+        assertFalse(dag.getVertices().get(1).isStart());
+        assertTrue(dag.getVertices().get(2).isFrontier());
+        assertFalse(dag.getVertices().get(2).isStart());
+        assertFalse(dag.getVertices().get(3).isFrontier());
+        assertFalse(dag.getVertices().get(3).isStart());
+        assertFalse(dag.getVertices().get(4).isFrontier());
+        assertFalse(dag.getVertices().get(4).isStart());
+        assertFalse(dag.getVertices().get(5).isFrontier());
+        assertFalse(dag.getVertices().get(5).isStart());
+        assertTrue(dag.getVertices().get(6).isFrontier());
+        assertFalse(dag.getVertices().get(6).isStart());
+        // Check that D is a child of B and C and D are not connected;
+        assertEquals(dag.getVertices().get(1), dag.getVertices().get(3).getParentVertices().get(0));
+        assertFalse(dag.getVertices().get(3).getParentVertices().contains(dag.getVertices().get(2)));
+        assertFalse(dag.getVertices().get(2).getChildVertices().contains(dag.getVertices().get(3)));
+        // Check that D and E are not connected;
+        assertFalse(dag.getVertices().get(4).getParentVertices().contains(dag.getVertices().get(3)));
+        assertFalse(dag.getVertices().get(3).getChildVertices().contains(dag.getVertices().get(4)));
+        // Check that G is child exactly of D and F;
+        assertEquals(new HashSet<>(Arrays.asList(dag.getVertices().get(3), dag.getVertices().get(5))), new HashSet<>(dag.getVertices().get(6).getParentVertices()));
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertEquals(1, dag.getFrontier().size());
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        assertEquals(0, dag.getFrontier().size());
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/ComplexExecutionDAGMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/ComplexExecutionDAGMockTest.java
new file mode 100644
index 00000000..b85d3818
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/ComplexExecutionDAGMockTest.java
@@ -0,0 +1,153 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime.stream;
+
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import com.nvidia.grcuda.test.util.mock.ArgumentMock;
+import com.nvidia.grcuda.test.util.mock.DeviceArrayMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.nvidia.grcuda.test.util.mock.GrCUDAStreamManagerMock;
+import com.nvidia.grcuda.test.util.mock.KernelExecutionMock;
+import com.nvidia.grcuda.test.util.mock.SyncExecutionMock;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.List;
+
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.imageMockComputation;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+@RunWith(Parameterized.class)
+public class ComplexExecutionDAGMockTest {
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {RetrieveNewStreamPolicyEnum.ALWAYS_NEW, RetrieveNewStreamPolicyEnum.REUSE},
+                {RetrieveParentStreamPolicyEnum.DISJOINT, RetrieveParentStreamPolicyEnum.SAME_AS_PARENT},
+                {DependencyPolicyEnum.WITH_CONST, DependencyPolicyEnum.NO_CONST}
+        }));
+    }
+
+    private final RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy;
+    private final RetrieveParentStreamPolicyEnum retrieveParentStreamPolicy;
+    private final DependencyPolicyEnum dependencyPolicy;
+
+    public ComplexExecutionDAGMockTest(RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy,
+                                       RetrieveParentStreamPolicyEnum retrieveParentStreamPolicy,
+                                       DependencyPolicyEnum dependencyPolicy) {
+        this.retrieveNewStreamPolicy = retrieveNewStreamPolicy;
+        this.retrieveParentStreamPolicy = retrieveParentStreamPolicy;
+        this.dependencyPolicy = dependencyPolicy;
+    }
+
+    @Test
+    public void hitsMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy).setRetrieveParentStreamPolicy(this.retrieveParentStreamPolicy)
+                .setDependencyPolicy(this.dependencyPolicy).build();
+
+        int numIterations = 10;
+        KernelExecutionMock c1 = null;
+        KernelExecutionMock c2 = null;
+        for (int i = 0; i < numIterations; i++) {
+            // hub1 -> auth2
+            c1 = new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1, true), new ArgumentMock(2)));
+            c1.schedule();
+            // auth1 -> hub2
+            c2 = new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(3, true), new ArgumentMock(4)));
+            c2.schedule();
+
+            // Without disjoint policy the computation collapses on a single stream after the first iteration;
+            int stream = (retrieveParentStreamPolicy.equals(RetrieveParentStreamPolicyEnum.DISJOINT) || i == 0) ? 0 : 1;
+            assertEquals(stream, c1.getStream().getStreamNumber());
+            assertEquals(1, c2.getStream().getStreamNumber());
+
+            // auth2 -> auth_norm
+            new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(2, true), new ArgumentMock(5))).schedule();
+            // hub2 -> hub_norm
+            new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(4, true), new ArgumentMock(6))).schedule();
+            // auth2, auth_norm -> auth1
+            c1 = new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(2, true), new ArgumentMock(5, true), new ArgumentMock(3)));
+            c1.schedule();
+            // hub2, hub_norm -> hub1
+            c2 = new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(4, true), new ArgumentMock(6, true), new ArgumentMock(1)));
+            c2.schedule();
+        }
+
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+        assertTrue(context.getStreamManager().isStreamFree(c1.getStream()));
+        int activeComps = retrieveParentStreamPolicy.equals(RetrieveParentStreamPolicyEnum.DISJOINT) ? 2 : 0;
+        assertEquals(activeComps, context.getStreamManager().getNumActiveComputationsOnStream(c2.getStream()));
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        assertTrue(context.getStreamManager().isStreamFree(c1.getStream()));
+        assertTrue(context.getStreamManager().isStreamFree(c2.getStream()));
+    }
+
+    @Test
+    public void imageMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy).setRetrieveParentStreamPolicy(this.retrieveParentStreamPolicy)
+                .setDependencyPolicy(this.dependencyPolicy).build();
+        executeMockComputation(imageMockComputation(context));
+
+        int numStreams = 3;
+        if (retrieveParentStreamPolicy.equals(RetrieveParentStreamPolicyEnum.DISJOINT) && dependencyPolicy.equals(DependencyPolicyEnum.WITH_CONST)) {
+            numStreams = 4;
+        }
+        else if (retrieveParentStreamPolicy.equals(RetrieveParentStreamPolicyEnum.SAME_AS_PARENT) && dependencyPolicy.equals(DependencyPolicyEnum.NO_CONST)) {
+            numStreams = 1;
+        }
+        assertEquals(numStreams, context.getStreamManager().getNumberOfStreams());
+        for (CUDAStream stream : ((GrCUDAStreamManagerMock) context.getStreamManager()).getStreams()) {
+            assertTrue(context.getStreamManager().isStreamFree(stream));
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/ExecutionDAGExportMultiGPUTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/ExecutionDAGExportMultiGPUTest.java
new file mode 100644
index 00000000..3e7e7836
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/ExecutionDAGExportMultiGPUTest.java
@@ -0,0 +1,225 @@
+package com.nvidia.grcuda.test.runtime.stream;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.executioncontext.GraphExport;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Arrays;
+import java.util.Collection;
+
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputationAndValidate;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.forkJoinMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.hitsMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.imageMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipeline2MockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipeline3MockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipeline4MockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipelineMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.manyIndependentKernelsMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.manyKernelsMockComputation;
+import static org.junit.Assert.assertEquals;
+
+@RunWith(Parameterized.class)
+public class ExecutionDAGExportMultiGPUTest{
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {RetrieveNewStreamPolicyEnum.ALWAYS_NEW, RetrieveNewStreamPolicyEnum.REUSE}
+        }));
+    }
+
+    private final RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy;
+
+    public ExecutionDAGExportMultiGPUTest (RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy) {
+        this.retrieveNewStreamPolicy = retrieveNewStreamPolicy;
+    }
+
+    private final static int IMAGE_NUM_STREAMS = 4;
+    private final static int HITS_NUM_STREAMS = 2;
+
+    // Test the STREAM_AWARE policy on 2 and 3 GPUs, on the image pipeline and HITS DAGs.
+    // In each case, validate the mapping of each computation on the right GPUs,
+    // and the total number of streams created;
+
+    @Test
+    public void lessBusyWithThreeGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 1, 2,
+                        0, 1,
+                        2, 0,
+                        2, 2, 1, 0, 0));
+        graphExport(context.getDag(), "lessBusyWithThreeGPUImageTest");
+    }
+
+    @Test
+    public void lessBusyWithTwoGPUHitsTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(hitsMockComputation(context),
+                Arrays.asList(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0));
+        graphExport(context.getDag(), "lessBusyWithTwoGPUHitsTest");
+    }
+
+    @Test
+    public void lessBusyManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 2, 1));
+        assertEquals(6, context.getStreamManager().getNumberOfStreams());
+        graphExport(context.getDag(), "lessBusyManyKernelsWithFourGPUTest");
+    }
+
+    @Test
+    public void roundRobinTest() throws UnsupportedTypeException {
+        int[] gpus = {1, 4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.ROUND_ROBIN)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(manyIndependentKernelsMockComputation(context),
+                    Arrays.asList(0, 1 % numGPU, 2 % numGPU, 3 % numGPU, 4 % numGPU, 5 % numGPU, 6 % numGPU, 7 % numGPU, 8 % numGPU, 9 % numGPU));
+            graphExport(context.getDag(), "roundRobinTest" + numGPU + "GPU");
+        }
+    }
+
+    @Test
+    public void roundRobinForkJoinWithTwoGPUTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.ROUND_ROBIN)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(forkJoinMockComputation(context),
+                Arrays.asList(0, 1, 0, 0, 0));
+        graphExport(context.getDag(), "roundRobinForkJoinWithTwoGPUTest");
+    }
+
+    @Test
+    public void minTransferWithDepTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(joinPipelineMockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 2, 3));
+            graphExport(context.getDag(), "minTransferWithDepTest" + numGPU + "GPU");
+        }
+    }
+
+    @Test
+    public void minTransferWithThreeGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 0, 0,
+                        0, 0,
+                        0, 0,
+                        0, 0, 0, 0, 0));
+        graphExport(context.getDag(), "minTransferWithThreeGPUImageTest");
+    }
+
+    @Test
+    public void minTransferManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // The last 4 computations are scheduled on GPU0 as all devices contain just 1 required array and GPU0 is first;
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 0, 0));
+        graphExport(context.getDag(), "minTransferManyKernelsWithFourGPUTest");
+    }
+
+    @Test
+    public void minTransferDisjointWithDep4MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 5/7 is scheduled on 1 because GPU0 is not considered a parent,
+            //   A is read-only in both Comp1 and Comp5;
+            executeMockComputationAndValidate(joinPipeline4MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 1, 3, 3));
+            graphExport(context.getDag(), "minTransferDisjointWithDep4MultiGPUTest" + numGPU + "GPU");
+        }
+    }
+
+    @Test
+    public void minTransferDisjointManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // The last 4 computations are scheduled on GPU0 as all devices contain just 1 required array and GPU0 is first;
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 2, 2));
+        graphExport(context.getDag(), "minTransferDisjointManyKernelsWithFourGPUTest");
+    }
+
+
+    public void graphExport(ExecutionDAG dag, String name){
+        GraphExport graphExport = new GraphExport(dag);
+
+//        if (retrieveNewStreamPolicy==RetrieveNewStreamPolicyEnum.ALWAYS_NEW){
+//            graphExport.graphGenerator("../" + name + "AlwaysNew");
+//        } else {
+//            graphExport.graphGenerator("../" + name + "Reuse");
+//        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/GrCUDAStreamManagerMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/GrCUDAStreamManagerMockTest.java
new file mode 100644
index 00000000..bb35c4af
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/GrCUDAStreamManagerMockTest.java
@@ -0,0 +1,450 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.runtime.stream;
+
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.mock.ArgumentMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.nvidia.grcuda.test.util.mock.GrCUDAStreamManagerMock;
+import com.nvidia.grcuda.test.util.mock.KernelExecutionMock;
+import com.nvidia.grcuda.test.util.mock.SyncExecutionMock;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Collections;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+
+@RunWith(Parameterized.class)
+public class GrCUDAStreamManagerMockTest {
+    /**
+     * Tests are executed for each of the {@link RetrieveNewStreamPolicyEnum} values;
+     *
+     * @return the current stream policy
+     */
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return Arrays.asList(new Object[][]{
+                {RetrieveNewStreamPolicyEnum.ALWAYS_NEW},
+                {RetrieveNewStreamPolicyEnum.REUSE},
+        });
+    }
+
+    private final RetrieveNewStreamPolicyEnum policy;
+
+    public GrCUDAStreamManagerMockTest(RetrieveNewStreamPolicyEnum policy) {
+        this.policy = policy;
+    }
+
+    @Test
+    public void streamSelectionSimpleMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+        // Create 4 mock kernel executions. In this case, kernel 3 requires 1 and 2 to finish,
+        //   and kernel 4 requires kernel 3 to finish. The final frontier is composed of kernel 3 (arguments "1" and "2" are active),
+        //   and kernel 4 (argument "3" is active);
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1),
+                        new ArgumentMock(2),
+                        new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(3, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(0).getComputation().getStream()));
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(1).getComputation().getStream()));
+    }
+
+    @Test
+    public void streamSelectionMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+
+        // A(1,2) -> B(1) -> D(1,3) -> E(1,4) -> F(4)
+        //    \----> C(2)
+        // The final frontier is composed by C(2), D(3), E(1), F(4);
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(1, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(4).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(5).getComputation().getStream().getStreamNumber());
+        assertEquals(6, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(0).getComputation().getStream()));
+    }
+
+    @Test
+    public void streamSelection2MockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+
+        // A(1,2) -> B(1) -> D(1,3)
+        //    \----> C(2)
+        // E(4) -> F(4, 5)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(4), new ArgumentMock(5))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(4).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(5).getComputation().getStream().getStreamNumber());
+        assertEquals(4, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(0).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(4).getComputation().getStream()));
+    }
+
+    @Test
+    public void streamSelectionSimpleWithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+        // Create 4 mock kernel executions. In this case, kernel 3 requires 1 and 2 to finish,
+        //   and kernel 4 requires kernel 3 to finish. The final frontier is composed of kernel 3 (arguments "1" and "2" are active),
+        //   and kernel 4 (argument "3" is active);
+        // A(1) -> C(1, 2, 3) -> D(3)
+        // B(2) /
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1),
+                        new ArgumentMock(2),
+                        new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(3, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(3).getComputation().getStream()));
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(1).getComputation().getStream()));
+
+        // Synchronize computations;
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+        // The stream has no active computation;
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(1).getComputation().getStream()));
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(3).getComputation().getStream()));
+    }
+
+    @Test
+    public void streamSelectionWithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+
+        // A(1,2) -> B(1) -> D(1,3) -> E(1,4) -> F(4)
+        //    \-> C(2)
+        // The final frontier is composed by C(2), D(3), E(1), F(4);
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(4, context.getDag().getFrontier().size());
+        // In this simple test, do not use disjoint stream assignment;
+        assertEquals(1, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(4).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(5).getComputation().getStream().getStreamNumber());
+        assertEquals(6, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(5).getComputation().getStream()));
+
+        // All computations are on the same stream, so syncing one will terminate all of them;
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(5).getComputation().getStream()));
+    }
+
+    @Test
+    public void streamSelection2WithSyncMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+
+        // A(1,2) -> B(1) -> D(1,3)
+        //   \-> C(2)
+        // E(4) -> F(4, 5)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(4), new ArgumentMock(5))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(4).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(5).getComputation().getStream().getStreamNumber());
+        assertEquals(4, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(0).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(4).getComputation().getStream()));
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(2).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(4).getComputation().getStream()));
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(4))).schedule();
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(5).getComputation().getStream()));
+    }
+
+    @Test
+    public void generateManyStreamsTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder().setRetrieveNewStreamPolicy(this.policy).build();
+
+        // Create 2 parallel branches on dependent computations, and check that the total amount of streams created is what is expected;
+        int numLoops = 10;
+        for (int i = 0; i < numLoops * 2; i += 2) {
+            new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(i))).schedule();
+            new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(i + 1))).schedule();
+            // Sync point;
+            new SyncExecutionMock(context, Arrays.asList(new ArgumentMock(i), new ArgumentMock(i + 1))).schedule();
+        }
+
+        ExecutionDAG dag = context.getDag();
+        // Check that kernels have been given the right stream;
+        int numStreams = this.policy == RetrieveNewStreamPolicyEnum.REUSE ? 2 : numLoops * 2;
+        int streamCheck1 = this.policy == RetrieveNewStreamPolicyEnum.REUSE ? 0 : numLoops * 2 - 2;
+        int streamCheck2 = this.policy == RetrieveNewStreamPolicyEnum.REUSE ? 1 : numLoops * 2 - 1;
+
+        assertEquals(numStreams, context.getStreamManager().getNumberOfStreams());
+        assertEquals(streamCheck1, dag.getVertices().get(numLoops * 3 - 3).getComputation().getStream().getStreamNumber());
+        assertEquals(streamCheck2, dag.getVertices().get(numLoops * 3 - 2).getComputation().getStream().getStreamNumber());
+    }
+
+    @Test
+    public void disjointArgumentStreamTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT).build();
+
+        // A(1,2) -> B(1)
+        //   \-> C(2)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(1).getComputation().getStream()));
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(2).getComputation().getStream()));
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(2).getComputation().getStream()));
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(0).getComputation().getStream()));
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(0).getComputation().getStream()));
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(1).getComputation().getStream()));
+    }
+
+    @Test
+    public void disjointArgumentStreamCrossTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT).build();
+
+        // A(1,2) -> C(1,3)
+        //        X
+        // B(3,4) -> D(2,4)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(3), new ArgumentMock(4))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(2), new ArgumentMock(4))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(2).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(3).getComputation().getStream()));
+    }
+
+    @Test
+    public void disjointArgumentStreamCross2Test() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT).build();
+
+        // A(1,2,7) -> D(1,3,5)
+        //          X
+        // B(3,4,8) -> E(2,4,6)
+        //          X
+        // C(5,6,9) -> F(7,8,9)
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(2), new ArgumentMock(7))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(3), new ArgumentMock(4), new ArgumentMock(8))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(5), new ArgumentMock(6), new ArgumentMock(9))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(1), new ArgumentMock(3), new ArgumentMock(5))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(2), new ArgumentMock(4), new ArgumentMock(6))).schedule();
+        new KernelExecutionMock(context,
+                Arrays.asList(new ArgumentMock(7), new ArgumentMock(8), new ArgumentMock(9))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(2, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(4).getComputation().getStream().getStreamNumber());
+        assertEquals(2, dag.getVertices().get(5).getComputation().getStream().getStreamNumber());
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(3).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(4).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(5).getComputation().getStream()));
+    }
+
+    @Test
+    public void syncParentsOfParentsTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT).build();
+
+        // A(1,2) -> B(1)
+        //       \-> C(2,3) -> D(2)
+        //                 \-> E(3)
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(2), new ArgumentMock(3))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+
+        ExecutionDAG dag = context.getDag();
+
+        // Check that kernels have been given the right stream;
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+        assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        assertEquals(0, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(2).getComputation().getStream().getStreamNumber());
+        assertEquals(1, dag.getVertices().get(3).getComputation().getStream().getStreamNumber());
+        assertEquals(2, dag.getVertices().get(4).getComputation().getStream().getStreamNumber());
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(0).getComputation().getStream()));
+        assertEquals(2, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(2).getComputation().getStream()));
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(4).getComputation().getStream()));
+
+        // Syncing E(3) will sync also computations on stream 1 and 0;
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(3))).schedule();
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(3).getComputation().getStream()));
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(1).getComputation().getStream()));
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(4).getComputation().getStream()));
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+        assertEquals(1, context.getStreamManager().getNumActiveComputationsOnStream(dag.getVertices().get(3).getComputation().getStream()));
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(1).getComputation().getStream()));
+
+        new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+        assertFalse(((GrCUDAStreamManagerMock) context.getStreamManager()).getActiveComputationsMap().containsKey(dag.getVertices().get(3).getComputation().getStream()));
+    }
+
+    @Test
+    public void repeatedSyncTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.policy).build();
+
+        int numTest = 10;
+        ExecutionDAG dag = context.getDag();
+
+        for (int i = 0; i < numTest; i++) {
+            new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+            new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+            new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(1), new ArgumentMock(2))).schedule();
+
+            new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(1))).schedule();
+            new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(2))).schedule();
+            assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+            assertEquals(1, dag.getVertices().get(1).getComputation().getStream().getStreamNumber());
+            assertEquals(0, dag.getVertices().get(0).getComputation().getStream().getStreamNumber());
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/MultiGPUComplexDAGMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/MultiGPUComplexDAGMockTest.java
new file mode 100644
index 00000000..03804b8e
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/MultiGPUComplexDAGMockTest.java
@@ -0,0 +1,213 @@
+package com.nvidia.grcuda.test.runtime.stream;
+
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.GrCUDAOptions;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import com.nvidia.grcuda.test.util.mock.AsyncGrCUDAExecutionContextMock;
+import com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock;
+import com.nvidia.grcuda.test.util.mock.OptionValuesMockBuilder;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.bsMultiGPUMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.cgMultiGPUMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputationAndValidate;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.iterationsCg;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.mlMultiGPUMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.mmulMultiGPUMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.vecMultiGPUMockComputation;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assume.assumeTrue;
+
+@RunWith(Parameterized.class)
+public class MultiGPUComplexDAGMockTest {
+
+    private final static boolean DEBUG = false;
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {RetrieveNewStreamPolicyEnum.ALWAYS_NEW},
+                {RetrieveParentStreamPolicyEnum.SAME_AS_PARENT, RetrieveParentStreamPolicyEnum.DISJOINT, RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT, RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT},
+                {DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME, DeviceSelectionPolicyEnum.MINMIN_TRANSFER_TIME},
+                {2, 4, 8}
+        }));
+    }
+
+    private final RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy;
+    private final RetrieveParentStreamPolicyEnum retrieveParentStreamPolicy;
+    private final DeviceSelectionPolicyEnum deviceSelectionPolicy;
+    private final int numberOfGPUs;
+
+    public MultiGPUComplexDAGMockTest(
+            RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy,
+            RetrieveParentStreamPolicyEnum retrieveParentStreamPolicy,
+            DeviceSelectionPolicyEnum deviceSelectionPolicy,
+            int numberOfGPUs) {
+        this.retrieveNewStreamPolicy = retrieveNewStreamPolicy;
+        this.retrieveParentStreamPolicy = retrieveParentStreamPolicy;
+        this.deviceSelectionPolicy = deviceSelectionPolicy;
+        this.numberOfGPUs = numberOfGPUs;
+    }
+
+    private AsyncGrCUDAExecutionContextMock buildContext() {
+        AsyncGrCUDAExecutionContextMock context = new AsyncGrCUDAExecutionContextMock(
+                this.retrieveNewStreamPolicy,
+                this.retrieveParentStreamPolicy,
+                this.deviceSelectionPolicy,
+                true, this.numberOfGPUs, this.numberOfGPUs,
+                new GrCUDAOptionMap(new OptionValuesMockBuilder()
+                        .add(GrCUDAOptions.DependencyPolicy, DependencyPolicyEnum.WITH_CONST.toString())
+                        .add(GrCUDAOptions.InputPrefetch, false)
+                        .add(GrCUDAOptions.BandwidthMatrix, System.getenv("GRCUDA_HOME") + File.separatorChar +
+                                "projects" + File.separatorChar + "resources" + File.separatorChar +
+                                "connection_graph" + File.separatorChar + "datasets" + File.separatorChar + "connection_graph_8_v100.csv").build()));
+        if (MultiGPUComplexDAGMockTest.DEBUG) {
+            System.out.println(this);
+        }
+        return context;
+    }
+
+    @Override
+    public String toString() {
+        return "options{" +
+                "new-stream=" + retrieveNewStreamPolicy +
+                ", parent-stream=" + retrieveParentStreamPolicy +
+                ", device-selection=" + deviceSelectionPolicy +
+                ", gpu-num=" + numberOfGPUs +
+                '}';
+    }
+
+    @Test
+    public void vecMultiGPUMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContextMock context = buildContext();
+        List<Integer> scheduling = new ArrayList<>();
+        for (int i = 0; i < 2 * GrCUDAComputationsMock.partitionsVec / this.numberOfGPUs; i++) {
+            for (int j = 0; j < this.numberOfGPUs / 2; j++) {
+                scheduling.add(j * 2);
+                scheduling.add(1 + j * 2);
+                scheduling.add(j * 2);
+            }
+        }
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsVec; i++) {
+            scheduling.add(0);  // Sync computations are associated to device 0, even if they are run by the CPU;
+        }
+        executeMockComputationAndValidate(vecMultiGPUMockComputation(context), scheduling, DEBUG);
+        assertEquals(2 * GrCUDAComputationsMock.partitionsVec, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void bsMultiGPUMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContextMock context = buildContext();
+        List<Integer> scheduling = new ArrayList<>();
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsBs; i++) {
+            scheduling.add(i % this.numberOfGPUs);
+        }
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsBs; i++) {
+            scheduling.add(0);  // Sync computations are associated to device 0, even if they are run by the CPU;
+        }
+        executeMockComputationAndValidate(bsMultiGPUMockComputation(context), scheduling, DEBUG);
+        assertEquals(GrCUDAComputationsMock.partitionsBs, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void mlMultiGPUMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContextMock context = buildContext();
+        // Skip policies that we know are uninteresting or suboptimal;
+        assumeTrue(this.retrieveParentStreamPolicy == RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT);
+        List<Integer> scheduling = new ArrayList<>();
+        // RR1;
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsMl; i++) {
+            scheduling.add(i % this.numberOfGPUs);
+        }
+        // RR11;
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsMl; i++) {
+            scheduling.add(0);
+        }
+        // RR12, RR2;
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsMl; i++) {
+            scheduling.add(i % this.numberOfGPUs);
+            scheduling.add(i % this.numberOfGPUs);
+        }
+        // RR3, RRSF;
+        scheduling.add(0);
+        scheduling.add(0);
+        // NB1;
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsMl; i++) {
+            scheduling.add(i % this.numberOfGPUs);
+        }
+        // NB2, NB3. NB4, NBSF, AMAX, sync;
+        scheduling.add(0);
+        scheduling.add(0);
+        scheduling.add(0);
+        scheduling.add(0);
+        scheduling.add(0);
+        scheduling.add(0);
+        executeMockComputationAndValidate(mlMultiGPUMockComputation(context, true), scheduling, DEBUG);
+        assertEquals(3 * GrCUDAComputationsMock.partitionsMl - 1, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void cgMultiGPUMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContextMock context = buildContext();
+        // Skip policies that we know are uninteresting or suboptimal;
+        assumeTrue(this.retrieveParentStreamPolicy != RetrieveParentStreamPolicyEnum.SAME_AS_PARENT);
+        List<Integer> scheduling = new ArrayList<>();
+        // MVMA;
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsCg; i++) {
+            scheduling.add(i % this.numberOfGPUs);
+            scheduling.add(i % this.numberOfGPUs);
+        }
+        // CPY, L2;
+        scheduling.add(0);
+        scheduling.add(0);
+        // Main computation;
+        for (int iter = 0; iter < iterationsCg; iter++) {
+            // MMUL;
+            for (int i = 0; i < GrCUDAComputationsMock.partitionsCg; i++) {
+                scheduling.add(i % this.numberOfGPUs);
+            }
+            // DOT, SYNC, SAXPY1, SAXPY2, L2, SYNC, SAXPY3;
+            scheduling.add(0);
+            scheduling.add(0);
+            scheduling.add(0);
+            scheduling.add(0);
+            scheduling.add(0);
+            scheduling.add(0);
+            scheduling.add(0);
+        }
+        scheduling.add(0);  // Sync computations are associated to device 0, even if they are run by the CPU;
+        executeMockComputationAndValidate(cgMultiGPUMockComputation(context, true), scheduling, DEBUG);
+        assertEquals(4 * (GrCUDAComputationsMock.partitionsCg + 1), context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void mmulMultiGPUMockTest() throws UnsupportedTypeException {
+        AsyncGrCUDAExecutionContextMock context = buildContext();
+        List<Integer> scheduling = new ArrayList<>();
+        // Skip policies that we know are uninteresting or suboptimal;
+        assumeTrue(this.retrieveParentStreamPolicy != RetrieveParentStreamPolicyEnum.SAME_AS_PARENT && this.retrieveParentStreamPolicy != RetrieveParentStreamPolicyEnum.DISJOINT);
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsMmul; i++) {
+            scheduling.add(i % this.numberOfGPUs);
+        }
+        for (int i = 0; i < GrCUDAComputationsMock.partitionsMmul; i++) {
+            scheduling.add(0); // Copy all on device 0;
+        }
+        scheduling.add(0);  // Sync computations are associated to device 0, even if they are run by the CPU;
+        executeMockComputationAndValidate(mmulMultiGPUMockComputation(context), scheduling,DEBUG);
+        assertEquals(GrCUDAComputationsMock.partitionsMmul, context.getStreamManager().getNumberOfStreams());
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/MultiGPUExecutionDAGMockTest.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/MultiGPUExecutionDAGMockTest.java
new file mode 100644
index 00000000..921429a7
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/runtime/stream/MultiGPUExecutionDAGMockTest.java
@@ -0,0 +1,598 @@
+package com.nvidia.grcuda.test.runtime.stream;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.test.util.GrCUDATestUtil;
+import com.nvidia.grcuda.test.util.mock.GrCUDAExecutionContextMockBuilder;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Arrays;
+import java.util.Collection;
+
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.executeMockComputationAndValidate;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.forkJoinMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.hitsMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.imageMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipeline2MockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipeline3MockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipeline4MockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.joinPipelineMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.manyIndependentKernelsMockComputation;
+import static com.nvidia.grcuda.test.util.mock.GrCUDAComputationsMock.manyKernelsMockComputation;
+import static org.junit.Assert.assertEquals;
+
+@RunWith(Parameterized.class)
+public class MultiGPUExecutionDAGMockTest {
+
+    @Parameterized.Parameters
+    public static Collection<Object[]> data() {
+        return GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {RetrieveNewStreamPolicyEnum.ALWAYS_NEW, RetrieveNewStreamPolicyEnum.REUSE}
+        }));
+    }
+
+    private final RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy;
+
+    public MultiGPUExecutionDAGMockTest(RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy) {
+        this.retrieveNewStreamPolicy = retrieveNewStreamPolicy;
+    }
+
+    private final static int IMAGE_NUM_STREAMS = 4;
+    private final static int HITS_NUM_STREAMS = 2;
+
+    @Test
+    public void deviceSelectionAlwaysOneImageTest() throws UnsupportedTypeException {
+        // Test that no matter how many GPU we have, the SINGLE_GPU policy always selects the number 0;
+        for (int i = 1; i < 4; i++) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.SINGLE_GPU)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(i).setNumberOfAvailableGPUs(i).build();
+            executeMockComputation(imageMockComputation(context));
+            context.getDeviceList().forEach(d -> d.getStreams().forEach(s -> assertEquals(0, s.getStreamDeviceId())));
+            assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+            assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getDevice(0).getNumberOfFreeStreams());
+            assertEquals(0, context.getStreamManager().getDevice(0).getNumberOfBusyStreams());
+            assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getDevice(0).getStreams().size());
+        }
+    }
+
+    @Test
+    public void lessBusyWithOneGPUImageTest() throws UnsupportedTypeException {
+        // Test that no matter how many GPU we have, the STREAM_AWARE policy with just 1 GPU always selects the number 0;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(1).setNumberOfAvailableGPUs(1).build();
+        executeMockComputation(imageMockComputation(context));
+        context.getDeviceList().forEach(d -> d.getStreams().forEach(s -> assertEquals(0, s.getStreamDeviceId())));
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getDevice(0).getNumberOfFreeStreams());
+        assertEquals(0, context.getStreamManager().getDevice(0).getNumberOfBusyStreams());
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getDevice(0).getStreams().size());
+    }
+
+    @Test
+    public void deviceSelectionAlwaysOneHitsTest() throws UnsupportedTypeException {
+        // Test that no matter how many GPU we have, the SINGLE_GPU policy always selects the number 0;
+        for (int i = 1; i < 4; i++) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.SINGLE_GPU)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(i).setNumberOfAvailableGPUs(i).build();
+            executeMockComputation(hitsMockComputation(context));
+            context.getDeviceList().forEach(d -> d.getStreams().forEach(s -> assertEquals(0, s.getStreamDeviceId())));
+            assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+            assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getDevice(0).getNumberOfFreeStreams());
+            assertEquals(0, context.getStreamManager().getDevice(0).getNumberOfBusyStreams());
+            assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getDevice(0).getStreams().size());
+        }
+    }
+
+    @Test
+    public void lessBusyWithOneGPUHitsTest() throws UnsupportedTypeException {
+        // Test that no matter how many GPU we have, the STREAM_AWARE policy with just 1 GPU always selects the number 0;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(1).setNumberOfAvailableGPUs(1).build();
+        executeMockComputation(hitsMockComputation(context));
+        context.getDeviceList().forEach(d -> d.getStreams().forEach(s -> assertEquals(0, s.getStreamDeviceId())));
+        assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+        assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getDevice(0).getNumberOfFreeStreams());
+        assertEquals(0, context.getStreamManager().getDevice(0).getNumberOfBusyStreams());
+        assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getDevice(0).getStreams().size());
+    }
+
+    // Test the STREAM_AWARE policy on 2 and 3 GPUs, on the image pipeline and HITS DAGs.
+    // In each case, validate the mapping of each computation on the right GPUs,
+    // and the total number of streams created;
+
+    @Test
+    public void lessBusyWithTwoGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 1, 0,
+                        0, 1,
+                        0, 1,
+                        0, 0, 1, 0, 0));
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void lessBusyWithThreeGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 1, 2,
+                        0, 1,
+                        2, 0,
+                        2, 2, 1, 0, 0));
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void lessBusyWithTwoGPUHitsTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(hitsMockComputation(context),
+                Arrays.asList(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0));
+        assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void lessBusyWithThreeGPUHitsTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // Same as 2 GPUs, it never makes sense to use the 3rd GPU;
+        executeMockComputationAndValidate(hitsMockComputation(context),
+                Arrays.asList(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0));
+        assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void lessBusyManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 2, 1));
+        assertEquals(6, context.getStreamManager().getNumberOfStreams());
+    }
+
+    // (X) --> (Z) --> (A)
+    // (Y) -/      \-> (B)
+    @Test
+    public void lessBusyForkJoinWithTwoGPUTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // FIXME: When using stream-aware and 2 GPUs, the 5th kernel should be scheduled on device 2 as device 1 has synced the computation on it,
+        //  and device 2 is the first device with fewer streams active (0, in this case).
+        //  Currently this does not happen, because we cannot know if the computation on device 2 is actually over when we do the scheduling,
+        //  although this does not affect correctness.
+        executeMockComputationAndValidate(forkJoinMockComputation(context),
+                Arrays.asList(0, 1, 0, 0, 0));
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void roundRobinTest() throws UnsupportedTypeException {
+        int[] gpus = {1, 4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.ROUND_ROBIN)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(manyIndependentKernelsMockComputation(context),
+                    Arrays.asList(0, 1 % numGPU, 2 % numGPU, 3 % numGPU, 4 % numGPU, 5 % numGPU, 6 % numGPU, 7 % numGPU, 8 % numGPU, 9 % numGPU));
+            assertEquals(10, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void roundRobinWithThreeGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.ROUND_ROBIN)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 1, 2,
+                        0, 1,
+                        2, 0,
+                        2, 2, 1, 0, 0));
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void roundRobinWithFourGPUHitsTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.ROUND_ROBIN)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(hitsMockComputation(context),
+                Arrays.asList(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0));
+        assertEquals(HITS_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void roundRobinManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.STREAM_AWARE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 2, 1));
+        assertEquals(6, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void roundRobinForkJoinWithTwoGPUTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.ROUND_ROBIN)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(forkJoinMockComputation(context),
+                Arrays.asList(0, 1, 0, 0, 0));
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferTest() throws UnsupportedTypeException {
+        int[] gpus = {1, 4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(manyIndependentKernelsMockComputation(context),
+                    Arrays.asList(0, 1 % numGPU, 2 % numGPU, 3 % numGPU, 4 % numGPU, 5 % numGPU, 6 % numGPU, 7 % numGPU, 8 % numGPU, 9 % numGPU));
+            assertEquals(10, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferWithDepTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(joinPipelineMockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 2, 3));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferWithDepMultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(joinPipelineMockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 0, 3));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferWithDep2MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 5/7 is scheduled on 0 because that's the ideal device chosen by MULTIGPU_EARLY_DISJOINT
+            // (device 0 and 1 have the same amount of data), even though device 1 would have a suitable parent.
+            // This also creates a new stream.
+            // Computation 7/7 is scheduled on 0 because 0 has A,B while device 1,2,3 have only one array each;
+            executeMockComputationAndValidate(joinPipeline2MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 3, 0));
+            assertEquals(5, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferWithDep3MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 7/7 is scheduled on 0 because all devices have 1 array each, but GPU0 comes first
+            executeMockComputationAndValidate(joinPipeline3MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 3, 0));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferWithDep4MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 7/7 is scheduled on 3 because GPU3 has both A and C
+            executeMockComputationAndValidate(joinPipeline4MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 3, 3));
+            assertEquals(5, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferWithThreeGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 0, 0,
+                        0, 0,
+                        0, 0,
+                        0, 0, 0, 0, 0));
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferWithFourGPUHitsTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // After the first iteration, GPU0 has a3 up to date, GPU1 has a4. So GPU0 is chosen as it comes first,
+        // and the scheduling collapses to GPU0;
+        executeMockComputationAndValidate(hitsMockComputation(context),
+                Arrays.asList(0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0));
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // The last 4 computations are scheduled on GPU0 as all devices contain just 1 required array and GPU0 is first;
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 0, 0));
+        assertEquals(7, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferForkJoinWithTwoGPUTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(forkJoinMockComputation(context),
+                Arrays.asList(0, 1, 0, 0, 0));
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferDisjointWithDepMultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            executeMockComputationAndValidate(joinPipelineMockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 0, 3));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferDisjointWithDep2MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 5/7 is scheduled on 1 because 0 is not considered a parent:
+            //   "A" is read-only in both cases, and no edge is added to the graph.
+            // Same for computation 6/7;
+            // Computation 7/7 is scheduled on 1 because 1 has A,B while device 3 only has C;
+            executeMockComputationAndValidate(joinPipeline2MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 1, 3, 1));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferDisjointWithDep3MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 5/7 has GPU0 and GPU1 as parents. Both have the same data, but 0 comes first;
+            executeMockComputationAndValidate(joinPipeline3MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 0, 3, 0));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+    @Test
+    public void minTransferDisjointWithDep4MultiGPUTest() throws UnsupportedTypeException {
+        int[] gpus = {4, 8};
+        for (int numGPU : gpus) {
+            AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                    .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                    .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                    .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                    .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                    .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+            // Computation 5/7 is scheduled on 1 because GPU0 is not considered a parent,
+            //   A is read-only in both Comp1 and Comp5;
+            executeMockComputationAndValidate(joinPipeline4MockComputation(context),
+                    Arrays.asList(0, 1, 2, 3, 1, 3, 3));
+            assertEquals(4, context.getStreamManager().getNumberOfStreams());
+        }
+    }
+
+
+    @Test
+    public void minTransferDisjointWithThreeGPUImageTest() throws UnsupportedTypeException {
+        int numGPU = 3;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(imageMockComputation(context),
+                Arrays.asList(
+                        0, 0, 0,
+                        0, 0,
+                        0, 0,
+                        0, 0, 0, 0, 0));
+        assertEquals(IMAGE_NUM_STREAMS, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferDisjointWithFourGPUHitsTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(hitsMockComputation(context),
+                Arrays.asList(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0));
+        assertEquals(2, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferDisjointManyKernelsWithFourGPUTest() throws UnsupportedTypeException {
+        int numGPU = 4;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        // The last 4 computations are scheduled on GPU0 as all devices contain just 1 required array and GPU0 is first;
+        executeMockComputationAndValidate(manyKernelsMockComputation(context),
+                Arrays.asList(0, 1, 2, 3, 0, 1, 2, 3, 0, 2, 0, 0, 2, 2));
+        assertEquals(6, context.getStreamManager().getNumberOfStreams());
+    }
+
+    @Test
+    public void minTransferDisjointForkJoinWithTwoGPUTest() throws UnsupportedTypeException {
+        int numGPU = 2;
+        AsyncGrCUDAExecutionContext context = new GrCUDAExecutionContextMockBuilder()
+                .setRetrieveNewStreamPolicy(this.retrieveNewStreamPolicy)
+                .setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT)
+                .setDeviceSelectionPolicy(DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE)
+                .setDependencyPolicy(DependencyPolicyEnum.WITH_CONST)
+                .setNumberOfGPUsToUse(numGPU).setNumberOfAvailableGPUs(numGPU).build();
+        executeMockComputationAndValidate(forkJoinMockComputation(context),
+                Arrays.asList(0, 1, 0, 0, 0));
+        assertEquals(3, context.getStreamManager().getNumberOfStreams());
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/GrCUDATestOptionsStruct.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/GrCUDATestOptionsStruct.java
new file mode 100644
index 00000000..a56644d4
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/GrCUDATestOptionsStruct.java
@@ -0,0 +1,87 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+
+public class GrCUDATestOptionsStruct {
+    public final ExecutionPolicyEnum policy;
+    public final boolean inputPrefetch;
+    public final RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy;
+    public final RetrieveParentStreamPolicyEnum retrieveParentStreamPolicy;
+    public final DependencyPolicyEnum dependencyPolicy;
+    public final DeviceSelectionPolicyEnum deviceSelectionPolicy;
+    public final boolean forceStreamAttach;
+    public final boolean timeComputation;
+    public final int numberOfGPUs;
+
+    /**
+     * A simple struct that holds a combination of GrCUDA options, extracted from the output of {@link GrCUDATestUtil#getAllOptionCombinationsSingleGPU}
+     */
+    public GrCUDATestOptionsStruct(ExecutionPolicyEnum policy,
+                                   boolean inputPrefetch,
+                                   RetrieveNewStreamPolicyEnum retrieveNewStreamPolicy,
+                                   RetrieveParentStreamPolicyEnum retrieveParentStreamPolicy,
+                                   DependencyPolicyEnum dependencyPolicy,
+                                   DeviceSelectionPolicyEnum deviceSelectionPolicy,
+                                   boolean forceStreamAttach,
+                                   boolean timeComputation,
+                                   int numberOfGPUs) {
+        this.policy = policy;
+        this.inputPrefetch = inputPrefetch;
+        this.retrieveNewStreamPolicy = retrieveNewStreamPolicy;
+        this.retrieveParentStreamPolicy = retrieveParentStreamPolicy;
+        this.dependencyPolicy = dependencyPolicy;
+        this.deviceSelectionPolicy = deviceSelectionPolicy;
+        this.forceStreamAttach = forceStreamAttach;
+        this.timeComputation = timeComputation;
+        this.numberOfGPUs = numberOfGPUs;
+    }
+
+    @Override
+    public String toString() {
+        return "GrCUDATestOptionsStruct{" +
+                "policy=" + policy +
+                ", inputPrefetch=" + inputPrefetch +
+                ", retrieveNewStreamPolicy=" + retrieveNewStreamPolicy +
+                ", retrieveParentStreamPolicy=" + retrieveParentStreamPolicy +
+                ", dependencyPolicy=" + dependencyPolicy +
+                ", deviceSelectionPolicy=" + deviceSelectionPolicy +
+                ", forceStreamAttach=" + forceStreamAttach +
+                ", timeComputation=" + timeComputation +
+                ", numberOfGPUs=" + numberOfGPUs +
+                '}';
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/GrCUDATestUtil.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/GrCUDATestUtil.java
new file mode 100644
index 00000000..4566e007
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/GrCUDATestUtil.java
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicy;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import org.graalvm.polyglot.Context;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+
+public class GrCUDATestUtil {
+    public static Collection<Object[]> crossProduct(List<Object[]> sets) {
+        int solutions = 1;
+        List<Object[]> combinations = new ArrayList<>();
+        for (Object[] objects : sets) {
+            solutions *= objects.length;
+        }
+        for(int i = 0; i < solutions; i++) {
+            int j = 1;
+            List<Object> current = new ArrayList<>();
+            for(Object[] set : sets) {
+                current.add(set[(i / j) % set.length]);
+                j *= set.length;
+            }
+            combinations.add(current.toArray(new Object[0]));
+        }
+        return combinations;
+    }
+
+    /**
+     * Return a list of {@link GrCUDATestOptionsStruct}, where each element is a combination of input policy options.
+     * Useful to perform tests that cover all cases;
+     * @return the cross-product of all options
+     */
+    public static Collection<Object[]> getAllOptionCombinationsSingleGPU() {
+        Collection<Object[]> options = GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {ExecutionPolicyEnum.SYNC, ExecutionPolicyEnum.ASYNC},
+                {true, false},  // InputPrefetch
+                {RetrieveNewStreamPolicyEnum.REUSE, RetrieveNewStreamPolicyEnum.ALWAYS_NEW},
+                {RetrieveParentStreamPolicyEnum.SAME_AS_PARENT, RetrieveParentStreamPolicyEnum.DISJOINT},
+                {DependencyPolicyEnum.NO_CONST, DependencyPolicyEnum.WITH_CONST},
+                {DeviceSelectionPolicyEnum.SINGLE_GPU},
+                {true, false},  // ForceStreamAttach
+                {true, false},  // With and without timing of kernels
+                {1},            // Number of GPUs
+        }));
+        List<Object[]> combinations = new ArrayList<>();
+        options.forEach(optionArray -> {
+            GrCUDATestOptionsStruct newStruct = new GrCUDATestOptionsStruct(
+                    (ExecutionPolicyEnum) optionArray[0], (boolean) optionArray[1],
+                    (RetrieveNewStreamPolicyEnum) optionArray[2], (RetrieveParentStreamPolicyEnum) optionArray[3],
+                    (DependencyPolicyEnum) optionArray[4], (DeviceSelectionPolicyEnum) optionArray[5],
+                    (boolean) optionArray[6], (boolean) optionArray[7], (int) optionArray[8]);
+            if (!isOptionRedundantForSync(newStruct)) {
+                combinations.add(new GrCUDATestOptionsStruct[]{newStruct});
+            }
+        });
+        // Check that the number of options is correct <(sync + async) * logging>;
+        assert(combinations.size() == (2 * 2 + 2 * 2 * 2 * 2 * 2) * 2);
+        return combinations;
+    }
+
+    /**
+     * Return a list of {@link GrCUDATestOptionsStruct}, where each element is a combination of input policy options.
+     * Cover testing options for multi-GPU systems. Do not consider the sync scheduling as it does not support multiple GPUs;
+     * @return the cross-product of all options
+     */
+    public static Collection<Object[]> getAllOptionCombinationsMultiGPU() {
+        Collection<Object[]> options = GrCUDATestUtil.crossProduct(Arrays.asList(new Object[][]{
+                {ExecutionPolicyEnum.ASYNC},
+                {true, false},  // InputPrefetch
+                {RetrieveNewStreamPolicyEnum.REUSE, RetrieveNewStreamPolicyEnum.ALWAYS_NEW}, // Simplify number of tests, don't use all options;
+                {RetrieveParentStreamPolicyEnum.SAME_AS_PARENT, RetrieveParentStreamPolicyEnum.DISJOINT, RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT, RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT},
+                {DependencyPolicyEnum.WITH_CONST, DependencyPolicyEnum.NO_CONST},   // Simplify number of tests, don't use all options;
+                {DeviceSelectionPolicyEnum.SINGLE_GPU, DeviceSelectionPolicyEnum.STREAM_AWARE, DeviceSelectionPolicyEnum.ROUND_ROBIN,
+                        DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE, DeviceSelectionPolicyEnum.MINMIN_TRANSFER_TIME, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME},
+                {false, true},  // ForceStreamAttach, simplify number of tests, don't use all options;
+                {true, false},  // With and without timing of kernels
+                {2, 4, 8},  // Number of GPUs
+        }));
+        List<Object[]> combinations = new ArrayList<>();
+        options.forEach(optionArray -> {
+            GrCUDATestOptionsStruct newStruct = new GrCUDATestOptionsStruct(
+                    (ExecutionPolicyEnum) optionArray[0], (boolean) optionArray[1],
+                    (RetrieveNewStreamPolicyEnum) optionArray[2], (RetrieveParentStreamPolicyEnum) optionArray[3],
+                    (DependencyPolicyEnum) optionArray[4], (DeviceSelectionPolicyEnum) optionArray[5],
+                    (boolean) optionArray[6], (boolean) optionArray[7], (int) optionArray[8]);
+            combinations.add(new GrCUDATestOptionsStruct[]{newStruct});
+        });
+        // Check that the number of options is correct;
+        assert(combinations.size() == (2 * 2 * 4 * 2 * 6 * 2 * 2 * 3));
+        return combinations;
+    }
+
+    public static Context createContextFromOptions(GrCUDATestOptionsStruct options, int numberOfGPUs) {
+        return buildTestContext()
+                .option("grcuda.ExecutionPolicy", options.policy.toString())
+                .option("grcuda.InputPrefetch", String.valueOf(options.inputPrefetch))
+                .option("grcuda.RetrieveNewStreamPolicy", options.retrieveNewStreamPolicy.toString())
+                .option("grcuda.RetrieveParentStreamPolicy", options.retrieveParentStreamPolicy.toString())
+                .option("grcuda.DependencyPolicy", options.dependencyPolicy.toString())
+                .option("grcuda.DeviceSelectionPolicy", options.deviceSelectionPolicy.toString())
+                .option("grcuda.ForceStreamAttach", String.valueOf(options.forceStreamAttach))
+                .option("grcuda.EnableComputationTimers", String.valueOf(options.timeComputation))
+                .option("grcuda.NumberOfGPUs", String.valueOf(numberOfGPUs))
+                .build();
+    }
+
+    public static Context createContextFromOptions(GrCUDATestOptionsStruct options) {
+        return GrCUDATestUtil.createContextFromOptions(options, options.numberOfGPUs);
+    }
+
+    public static Context.Builder buildTestContext() {
+        return Context.newBuilder()
+                .allowAllAccess(true)
+                .allowExperimentalOptions(true)
+                .logHandler(new TestLogHandler())
+                .option("log.grcuda.com.nvidia.grcuda.level", "WARNING")
+//                .option("log.grcuda." + GrCUDALogger.COMPUTATION_LOGGER + ".level", "FINE")  // Uncomment to print kernel log;
+                ;
+    }
+
+    /**
+     * If the execution policy is "sync", we don't need to test all combinations of flags that are specific to
+     * the async scheduler. So we can simply keep the default values for them (as they are unused anyway)
+     * and flag all other combinations as redundant;
+     * @param options a combination of input options for GrCUDA
+     * @return if the option combination is redundant for the sync scheduler
+     */
+    private static boolean isOptionRedundantForSync(GrCUDATestOptionsStruct options) {
+        if (options.policy.equals(ExecutionPolicyEnum.SYNC)) {
+            return options.retrieveNewStreamPolicy.equals(RetrieveNewStreamPolicyEnum.ALWAYS_NEW) ||
+                    options.retrieveParentStreamPolicy.equals(RetrieveParentStreamPolicyEnum.DISJOINT) ||
+                    options.dependencyPolicy.equals(DependencyPolicyEnum.WITH_CONST);
+        }
+        return false;
+    }
+}
+
+
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/TestLogHandler.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/TestLogHandler.java
new file mode 100644
index 00000000..c5e5fad0
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/TestLogHandler.java
@@ -0,0 +1,61 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util;
+
+import java.util.logging.Handler;
+import java.util.logging.LogRecord;
+
+public final class TestLogHandler extends Handler {
+    private volatile boolean closed;
+
+    public TestLogHandler() {
+    }
+
+    @Override
+    public void publish(LogRecord record) {
+        if (closed) {
+            throw new IllegalStateException("Closed handler");
+        }
+        System.out.println("[" + record.getLoggerName() + "] " + record.getLevel() + ": " + record.getMessage());
+    }
+
+    @Override
+    public void flush() {
+        if (closed) {
+            throw new IllegalStateException("Closed handler");
+        }
+    }
+
+    @Override
+    public void close() throws SecurityException {
+        closed = true;
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/ArgumentMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/ArgumentMock.java
new file mode 100644
index 00000000..a898a952
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/ArgumentMock.java
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+
+public class ArgumentMock extends ComputationArgumentWithValue {
+    public ArgumentMock(Object value) {
+        super("argument_mock_nonconst", Type.NFI_POINTER, Kind.POINTER_INOUT, value);
+    }
+
+    public ArgumentMock(Object value, boolean isConst) {
+        super(isConst ? "argument_mock_const" : "argument_mock_nonconst", Type.NFI_POINTER, isConst ? Kind.POINTER_IN : Kind.POINTER_INOUT, value);
+    }
+
+    @Override
+    public String toString() {
+        return this.getArgumentValue().toString() + (isArray ? "" : ", scalar") + (isConst ? "-const" : "");
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/AsyncGrCUDAExecutionContextMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/AsyncGrCUDAExecutionContextMock.java
new file mode 100644
index 00000000..abc06336
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/AsyncGrCUDAExecutionContextMock.java
@@ -0,0 +1,110 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.GrCUDAOptions;
+import com.nvidia.grcuda.runtime.computation.streamattach.PostPascalStreamAttachPolicy;
+import com.nvidia.grcuda.runtime.computation.streamattach.StreamAttachArchitecturePolicy;
+import com.nvidia.grcuda.runtime.computation.streamattach.PrePascalStreamAttachPolicy;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+
+/**
+ * Mock class to test the GrCUDAExecutionContextTest, it has a null CUDARuntime;
+ */
+public class AsyncGrCUDAExecutionContextMock extends AsyncGrCUDAExecutionContext {
+
+    // Store it here to avoid using a mocked runtime;
+    private final boolean architectureIsPascalOrNewer;
+
+    public void setCurrentGPU(int gpu) {
+        GrCUDADevicesManagerMock devicesManager = (GrCUDADevicesManagerMock) this.getStreamManager().getStreamPolicy().getDevicesManager();
+        devicesManager.setCurrentGPU(gpu);
+    }
+
+    public int getCurrentGPU() {
+        return this.getStreamManager().getStreamPolicy().getDevicesManager().getCurrentGPU().getDeviceId();
+    }
+
+    public AsyncGrCUDAExecutionContextMock() {
+        this(DependencyPolicyEnum.NO_CONST);
+    }
+
+    public AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum dependencyPolicy) {
+        this(dependencyPolicy, RetrieveNewStreamPolicyEnum.ALWAYS_NEW, RetrieveParentStreamPolicyEnum.SAME_AS_PARENT);
+    }
+
+    public AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum dependencyPolicy,
+                                           RetrieveNewStreamPolicyEnum retrieveStreamPolicy,
+                                           RetrieveParentStreamPolicyEnum parentStreamPolicyEnum) {
+        this(dependencyPolicy, retrieveStreamPolicy, parentStreamPolicyEnum, DeviceSelectionPolicyEnum.SINGLE_GPU, true, 1, 1);
+    }
+
+    public AsyncGrCUDAExecutionContextMock(DependencyPolicyEnum dependencyPolicy,
+                                           RetrieveNewStreamPolicyEnum retrieveStreamPolicy,
+                                           RetrieveParentStreamPolicyEnum parentStreamPolicyEnum,
+                                           DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+                                           boolean architectureIsPascalOrNewer,
+                                           int numberOfAvailableGPUs,
+                                           int numberOfGPUsToUse) {
+        super(null,
+                new GrCUDAOptionMap(new OptionValuesMockBuilder()
+                        .add(GrCUDAOptions.DependencyPolicy, dependencyPolicy.toString())
+                        .add(GrCUDAOptions.InputPrefetch, false).build()),
+                new GrCUDAStreamManagerMock(null, retrieveStreamPolicy, parentStreamPolicyEnum, deviceSelectionPolicyEnum, GrCUDAOptionMap.DEFAULT_BANDWIDTH_MATRIX, numberOfAvailableGPUs, numberOfGPUsToUse));
+        this.architectureIsPascalOrNewer = architectureIsPascalOrNewer;
+    }
+
+    public AsyncGrCUDAExecutionContextMock(RetrieveNewStreamPolicyEnum retrieveStreamPolicy,
+                                           RetrieveParentStreamPolicyEnum parentStreamPolicyEnum,
+                                           DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+                                           boolean architectureIsPascalOrNewer,
+                                           int numberOfAvailableGPUs,
+                                           int numberOfGPUsToUse,
+                                           GrCUDAOptionMap options) {
+        super(null, options,
+                new GrCUDAStreamManagerMock(null, retrieveStreamPolicy, parentStreamPolicyEnum, deviceSelectionPolicyEnum, options.getBandwidthMatrix(), numberOfAvailableGPUs, numberOfGPUsToUse));
+        this.architectureIsPascalOrNewer = architectureIsPascalOrNewer;
+    }
+
+    public StreamAttachArchitecturePolicy getArrayStreamArchitecturePolicy() {
+        return architectureIsPascalOrNewer ? new PrePascalStreamAttachPolicy() : new PostPascalStreamAttachPolicy();
+    }
+
+    @Override
+    public boolean isArchitecturePascalOrNewer() {
+        return architectureIsPascalOrNewer;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayMock.java
new file mode 100644
index 00000000..58f1abe7
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayMock.java
@@ -0,0 +1,58 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.LittleEndianNativeArrayView;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.oracle.truffle.api.dsl.Cached;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+public class DeviceArrayMock extends DeviceArray {
+    public DeviceArrayMock() {
+        this(0, Type.SINT32);
+    }
+
+    public DeviceArrayMock(int numElements) {
+        this(numElements, Type.SINT32);
+    }
+
+    public DeviceArrayMock(int numElements, Type type) {
+        super(new AsyncGrCUDAExecutionContextMock(), numElements, type);
+    }
+
+    public DeviceArrayMock(AbstractGrCUDAExecutionContext context) {
+        super(context, 0, Type.SINT32);
+        if (context.isArchitecturePascalOrNewer()) {
+            this.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            this.addArrayUpToDateLocations(context.getCurrentGPU());
+        }
+    }
+
+    @Override
+    public void writeArrayElement(long index, Object value,
+                                  @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                                  @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException, InvalidArrayIndexException {
+        if (this.canSkipSchedulingWrite()) {
+            // Fast path, don't do anything here;
+        } else {
+            new DeviceArrayWriteExecutionMock(this, index, value).schedule();
+        }
+    }
+
+    @Override
+    protected LittleEndianNativeArrayView allocateMemory() {
+        return null;
+    }
+
+    @Override
+    public String toString() {
+        return this.getElementType() + "[" + this.getArraySize() + "]";
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayReadExecutionMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayReadExecutionMock.java
new file mode 100644
index 00000000..12439fe0
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayReadExecutionMock.java
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import java.util.stream.Collectors;
+
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.DeviceArrayReadExecution;
+
+/**
+ * Mock class that represents a synchronous read execution,
+ * it can be used to synchronize previous computations using the specified arguments;
+ */
+public class DeviceArrayReadExecutionMock extends DeviceArrayReadExecution {
+
+    public DeviceArrayReadExecutionMock(DeviceArray array,
+                                        long index) {
+        super(array, index, null);
+    }
+
+    @Override
+    public Object execute() {
+        this.setComputationFinished();
+        return NoneValue.get();
+    }
+
+    @Override
+    public boolean canUseStream() { return false; }
+
+    @Override
+    public void associateArraysToStreamImpl() { }
+
+    @Override
+    public String toString() {
+        return "sync read" + "; args=[" +
+                this.argumentsThatCanCreateDependencies.stream().map(Object::toString).collect(Collectors.joining(", ")) +
+                "]";
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayWriteExecutionMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayWriteExecutionMock.java
new file mode 100644
index 00000000..c42e9dd2
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceArrayWriteExecutionMock.java
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import java.util.stream.Collectors;
+
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.DeviceArrayWriteExecution;
+
+/**
+ * Mock class that represents a synchronous write execution,
+ * it can be used to synchronize previous computations using the specified arguments;
+ */
+public class DeviceArrayWriteExecutionMock extends DeviceArrayWriteExecution {
+
+    public DeviceArrayWriteExecutionMock(DeviceArray array,
+                                         long index,
+                                         Object value) {
+        super(array, index, value, null, null);
+    }
+
+    @Override
+    public Object execute() {
+        this.setComputationFinished();
+        return NoneValue.get();
+    }
+
+    @Override
+    public boolean canUseStream() { return false; }
+
+    @Override
+    public void associateArraysToStreamImpl() { }
+
+    @Override
+    public String toString() {
+        return "sync write" + "; args=[" +
+                this.argumentsThatCanCreateDependencies.stream().map(Object::toString).collect(Collectors.joining(", ")) +
+                "]";
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceListMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceListMock.java
new file mode 100644
index 00000000..effde17e
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceListMock.java
@@ -0,0 +1,17 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.DeviceList;
+
+public class DeviceListMock extends DeviceList {
+    public DeviceListMock(int numDevices) {
+        super(numDevices, null);
+    }
+
+    @Override
+    public void initializeDeviceList(int numDevices, CUDARuntime runtime) {
+        for (int deviceOrdinal = 0; deviceOrdinal < numDevices; deviceOrdinal++) {
+            devices.set(deviceOrdinal, new DeviceMock(deviceOrdinal, null));
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceMock.java
new file mode 100644
index 00000000..354730aa
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/DeviceMock.java
@@ -0,0 +1,34 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+public class DeviceMock extends Device {
+
+    public DeviceMock(int deviceId, CUDARuntime runtime) {
+        super(deviceId, runtime);
+    }
+
+    /**
+     * Create a fake CUDA stream on this device
+     */
+    @Override
+    public CUDAStream createStream() {
+        CUDAStream newStream = new CUDAStream(0, GrCUDAStreamManagerMock.numUserAllocatedStreams++, deviceId);
+        this.getStreams().add(newStream);
+        return newStream;
+    }
+
+    @Override
+    public void cleanup() {
+        this.freeStreams.clear();
+        this.getStreams().clear();
+    }
+
+    @Override
+    public String toString() {
+        return "MockGPU(id=" + deviceId + ")";
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAComputationsMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAComputationsMock.java
new file mode 100644
index 00000000..fe7eaad6
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAComputationsMock.java
@@ -0,0 +1,617 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+
+public class GrCUDAComputationsMock {
+
+    /**
+     * Schedule for execution a sequence of mock GrCUDAComputationalElement;
+     */
+    public static void executeMockComputation(List<GrCUDAComputationalElement> computations) throws UnsupportedTypeException {
+        executeMockComputationAndValidateInner(computations, new ArrayList<>(), false, false);
+    }
+
+    /**
+     * Schedule for execution a sequence of mock GrCUDAComputationalElement;
+     */
+    public static void executeMockComputation(List<GrCUDAComputationalElement> computations, boolean debug) throws UnsupportedTypeException {
+        executeMockComputationAndValidateInner(computations, new ArrayList<>(), false, debug);
+    }
+
+    /**
+     * Schedule for execution a sequence of mock GrCUDAComputationalElement,
+     * and validate that the GPU scheduling of the computation is the one expected.
+     * @param computations a sequence of computations to be scheduled
+     * @param gpuScheduling a list of gpu identifiers. Each identifier "i" represents the GPU scheduling for the i-th computation;
+     * @throws UnsupportedTypeException
+     */
+    public static void executeMockComputationAndValidate(List<GrCUDAComputationalElement> computations, List<Integer> gpuScheduling) throws UnsupportedTypeException {
+        executeMockComputationAndValidate(computations, gpuScheduling, false);
+    }
+
+    /**
+     * Schedule for execution a sequence of mock GrCUDAComputationalElement,
+     * and validate that the GPU scheduling of the computation is the one expected.
+     * @param computations a sequence of computations to be scheduled
+     * @param gpuScheduling a list of gpu identifiers. Each identifier "i" represents the GPU scheduling for the i-th computation;
+     * @param debug if true, print debug information about the scheduling
+     * @throws UnsupportedTypeException
+     */
+    public static void executeMockComputationAndValidate(List<GrCUDAComputationalElement> computations, List<Integer> gpuScheduling, boolean debug) throws UnsupportedTypeException {
+        executeMockComputationAndValidateInner(computations, gpuScheduling, true, debug);
+    }
+
+    private static void executeMockComputationAndValidateInner(
+            List<GrCUDAComputationalElement> computations,
+            List<Integer> gpuScheduling,
+            boolean validate,
+            boolean debug) throws UnsupportedTypeException {
+        if (validate) {
+            assertEquals(computations.size(), gpuScheduling.size());
+        }
+        for (int i = 0; i < computations.size(); i++) {
+            GrCUDAComputationalElement c = computations.get(i);
+            c.schedule();
+            int actual = c.getStream().getStreamDeviceId();
+            if (debug) {
+                System.out.println(c);
+            }
+            if (validate) {
+                int expected = gpuScheduling.get(i);
+                if (expected != actual) {
+                    System.out.println("wrong GPU allocation for kernel " + i + "=" + c + "; expected=" + expected + "; actual=" + actual);
+                }
+                assertEquals(expected, actual);
+            }
+        }
+    }
+
+    //////////////////////////////////////////////////////////////////////////////
+    // Simple GPU computations, to test standard DAG patterns (e.g. fork-join), //
+    // and corner-cases;                                                        //
+    //////////////////////////////////////////////////////////////////////////////
+
+    // Simply schedule 10 kernels on independent data;
+    public static List<GrCUDAComputationalElement> manyIndependentKernelsMockComputation(AsyncGrCUDAExecutionContext context) {
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10)))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(new DeviceArrayMock(10))))
+        );
+    }
+
+    // (Ar) --> (A, B) --> (A, B, C) -> (A, B, D)
+    // (Br) -/         /             /
+    // (Cr) ----------/             /
+    // (Dr) -----------------------/
+    public static List<GrCUDAComputationalElement> joinPipelineMockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a = new DeviceArrayMock(10);
+        DeviceArrayMock b = new DeviceArrayMock(10);
+        DeviceArrayMock c = new DeviceArrayMock(10);
+        DeviceArrayMock d = new DeviceArrayMock(100);
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(b, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(c, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(d, true))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(b))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(b), new ArgumentMock(c))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(b), new ArgumentMock(d)))
+        );
+    }
+
+    // (Ar) --> (Ar, B) --> (A, B, C)
+    // (Br) -/           /
+    // (Cr) --> (C, D) -/
+    // (Dr) -/
+    public static List<GrCUDAComputationalElement> joinPipeline2MockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a = new DeviceArrayMock(10);
+        DeviceArrayMock b = new DeviceArrayMock(10);
+        DeviceArrayMock c = new DeviceArrayMock(10);
+        DeviceArrayMock d = new DeviceArrayMock(100);
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(b, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(c, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(d, true))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a, true), new ArgumentMock(b))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(c), new ArgumentMock(d))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(b), new ArgumentMock(c)))
+        );
+    }
+
+    // (Ar) --> (Ar, B) --> (B, C)
+    // (Br) -/           /
+    // (Cr) --> (C, D) -/
+    // (Dr) -/
+    public static List<GrCUDAComputationalElement> joinPipeline3MockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a = new DeviceArrayMock(10);
+        DeviceArrayMock b = new DeviceArrayMock(10);
+        DeviceArrayMock c = new DeviceArrayMock(10);
+        DeviceArrayMock d = new DeviceArrayMock(100);
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a, false))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(b, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(c, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(d, true))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a, true), new ArgumentMock(b))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(c), new ArgumentMock(d))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(b), new ArgumentMock(c)))
+        );
+    }
+
+    // (Ar) --> (Ar, B) -> (A, C, D) -> (A, C)
+    // (Br) -/          /
+    // (Cr) -----------/
+    // (Dr) ----------/
+    public static List<GrCUDAComputationalElement> joinPipeline4MockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a = new DeviceArrayMock(10);
+        DeviceArrayMock b = new DeviceArrayMock(10);
+        DeviceArrayMock c = new DeviceArrayMock(10);
+        DeviceArrayMock d = new DeviceArrayMock(100);
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(b, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(c, true))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(d, true))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a, true), new ArgumentMock(b))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(c), new ArgumentMock(d))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(c)))
+        );
+    }
+
+    // (X) --> (Z) --> (A)
+    // (Y) -/      \-> (B)
+    public static List<GrCUDAComputationalElement> forkJoinMockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a1 = new DeviceArrayMock(10);
+        DeviceArrayMock a2 = new DeviceArrayMock(10);
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a1))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a2))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a1), new ArgumentMock(a2))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a1))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a2)))
+        );
+    }
+
+
+    // K0 -> K4 -> K8 ---> K10
+    // K1 -> K5 /     \--> K11
+    // K2 -> K6 -> K9 -\-> K12
+    // K3 -> K7 /------\-> K13
+    public static List<GrCUDAComputationalElement> manyKernelsMockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a = new DeviceArrayMock(10);
+        DeviceArrayMock b = new DeviceArrayMock(10);
+        DeviceArrayMock c = new DeviceArrayMock(10);
+        DeviceArrayMock d = new DeviceArrayMock(10);
+        return Arrays.asList(
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(b))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(c))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(d))),
+
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(b))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(c))),
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(d))),
+
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a), new ArgumentMock(b))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(c), new ArgumentMock(d))),
+
+                new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(a, true))),
+                // When using stream-aware and 4 GPUs, this is scheduled on device 2 (of 4) as device 1 has synced the computation on it (with K8),
+                // and device 2 is the first device with fewer streams;
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a, true), new ArgumentMock(b))),
+                // When using stream-aware and 4 GPUs, this is scheduled on device 3 (reuse the stream of K9);
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a, true), new ArgumentMock(c))),
+                // When using stream-aware and 4 GPUs, this is scheduled on device 2 (device with fewer streams, device 1 has 2);
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a, true), new ArgumentMock(d)))
+        );
+    }
+
+    ///////////////////////////////////////////////////////////////////////
+    // Complex GPU benchmarks, inspired by the real benchmarks in GrCUDA //
+    ///////////////////////////////////////////////////////////////////////
+
+    // 0: K0(1c, 2) -> 2: K3(2c, 5) -> 4: K5(2c, 5c, 3) -> Repeat -> S(3)
+    //              \--------------\/
+    //              /--------------/\
+    // 1: K1(3c, 4) -> 3: K4(4c, 6) -> 5: K6(4c, 6c, 1) -> Repeat -> S(1)
+    public static List<GrCUDAComputationalElement> hitsMockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a1 = new DeviceArrayMock(10);
+        DeviceArrayMock a2 = new DeviceArrayMock(10);
+        DeviceArrayMock a3 = new DeviceArrayMock(10);
+        DeviceArrayMock a4 = new DeviceArrayMock(10);
+        DeviceArrayMock a5 = new DeviceArrayMock(1);
+        DeviceArrayMock a6 = new DeviceArrayMock(1);
+        List<GrCUDAComputationalElement> computations = new ArrayList<>();
+        int numIterations = 2;
+        for (int i = 0; i < numIterations; i++) {
+            // hub1 -> auth2
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a1, true), new ArgumentMock(a2))));
+            // auth1 -> hub2
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a3, true), new ArgumentMock(a4))));
+            // auth2 -> auth_norm
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a2, true), new ArgumentMock(a5))));
+            // hub2 -> hub_norm
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a4, true), new ArgumentMock(a6))));
+            // auth2, auth_norm -> auth1
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a2, true), new ArgumentMock(a5, true), new ArgumentMock(a3))));
+            // hub2, hub_norm -> hub1
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a4, true), new ArgumentMock(a6, true), new ArgumentMock(a1))));
+        }
+        computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(a3))));
+        computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(a1))));
+        return computations;
+    }
+
+    // 0: B1(1c, 2) -> 3: S1(2c, 5) -------------------------------------------------------------> 10: C(10c, 2c, 5c, 11) -> X
+    // 1: B2(1c, 3) -> 4: S2(3c, 6) -------------------------------------> 9: C(9c, 3c, 6c, 10) /
+    // 2: B3(1c, 4) -> 5: E1(4c, 7) -> 7: E3(4, 7, 8) -> 8: U(1c, 4, 9) /
+    //             \-> 6: E2(4c, 8) /
+    public static List<GrCUDAComputationalElement> imageMockComputation(AsyncGrCUDAExecutionContext context) {
+        DeviceArrayMock a1 = new DeviceArrayMock(10);
+        DeviceArrayMock a2 = new DeviceArrayMock(10);
+        DeviceArrayMock a3 = new DeviceArrayMock(10);
+        DeviceArrayMock a4 = new DeviceArrayMock(10);
+        DeviceArrayMock a5 = new DeviceArrayMock(10);
+        DeviceArrayMock a6 = new DeviceArrayMock(10);
+        DeviceArrayMock a7 = new DeviceArrayMock(1);
+        DeviceArrayMock a8 = new DeviceArrayMock(1);
+        DeviceArrayMock a9 = new DeviceArrayMock(10);
+        DeviceArrayMock a10 = new DeviceArrayMock(10);
+        DeviceArrayMock a11 = new DeviceArrayMock(10);
+        return Arrays.asList(
+                // blur
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a1, true), new ArgumentMock(a2))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a1, true), new ArgumentMock(a3))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a1, true), new ArgumentMock(a4))),
+                // sobel
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a2, true), new ArgumentMock(a5))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a3, true), new ArgumentMock(a6))),
+                // extend
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a4, true), new ArgumentMock(a7))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a4, true), new ArgumentMock(a8))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a4), new ArgumentMock(a7), new ArgumentMock(a8))),
+                // unsharpen
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a1, true), new ArgumentMock(a4), new ArgumentMock(a9))),
+                // combine
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a9, true), new ArgumentMock(a3, true),
+                        new ArgumentMock(a6, true), new ArgumentMock(a10))),
+                new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(a10, true), new ArgumentMock(a2, true),
+                        new ArgumentMock(a5, true), new ArgumentMock(a11))),
+                new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(a11)))
+        );
+    }
+
+    public static final int partitionsVec = 16;
+
+    // K1(Xr, X1) --> K3(X1r, Y1r, R)
+    // K2(Yr, Y1) -/
+    // A simple join pattern, with X, Y, X1, Y1, R being split into P partitions, to parallelize the computation
+    // on multiple GPUs;
+    public static List<GrCUDAComputationalElement> vecMultiGPUMockComputation(AsyncGrCUDAExecutionContext context) {
+        // Arrays have P partitions;
+        int P = partitionsVec;
+        int N = 1000;
+        DeviceArrayMock[] x = new DeviceArrayMock[P];
+        DeviceArrayMock[] y = new DeviceArrayMock[P];
+        DeviceArrayMock[] x1 = new DeviceArrayMock[P];
+        DeviceArrayMock[] y1 = new DeviceArrayMock[P];
+        DeviceArrayMock[] res = new DeviceArrayMock[P];
+        for (int i = 0; i < P; i++) {
+            x[i] = new DeviceArrayMock(N);
+            y[i] = new DeviceArrayMock(N);
+            x1[i] = new DeviceArrayMock(N);
+            y1[i] = new DeviceArrayMock(N);
+            res[i] = new DeviceArrayMock(1);
+        }
+        List<GrCUDAComputationalElement> computations = new ArrayList<>();
+        // Schedule the computations;
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(x[i], true), new ArgumentMock(x1[i])), "SQ1-" + i));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(y[i], true), new ArgumentMock(y1[i])), "SQ2-" + i));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(x1[i], true), new ArgumentMock(y1[i], true), new ArgumentMock(res[i])), "SUM-" + i));
+        }
+        for (int i = 0; i < P; i++) {
+            computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(res[i]))));
+        }
+        return computations;
+    }
+
+    public static final int partitionsBs = 24;
+
+    // K(X1r, Y1)
+    // K(X2r, Y2)
+    // ...
+    // K(XPr, YP)
+    //
+    // Many independent computations, on different data;
+    public static List<GrCUDAComputationalElement> bsMultiGPUMockComputation(AsyncGrCUDAExecutionContext context) {
+        // Arrays have P partitions;
+        int P = partitionsBs;
+        int N = 1000;
+        DeviceArrayMock[] x = new DeviceArrayMock[P];
+        DeviceArrayMock[] y = new DeviceArrayMock[P];
+
+        for (int i = 0; i < P; i++) {
+            x[i] = new DeviceArrayMock(N);
+            y[i] = new DeviceArrayMock(N);
+        }
+        List<GrCUDAComputationalElement> computations = new ArrayList<>();
+        // Schedule the computations;
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(x[i], true), new ArgumentMock(y[i])), "BS-" + i));
+        }
+        for (int i = 0; i < P; i++) {
+            computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(y[i]))));
+        }
+        return computations;
+    }
+
+    public static final int partitionsMl = 16;
+
+    /**
+     * DAG that represents the B6-ML benchmark;
+     * The "c" before the variable name denotes a const argument;
+     *
+     * RR1_1(cX1, M1, STD1) ---> RR11_0(M1, STD1, cMi, cSTDi) -> ... -> RR11_P(M1, STD1, Mi, STDi) ---> RR12_1(cX1, Z1, cM1, cSTD1) -> RR2_1(cZ1, cRC, R2) ---> RR3(R2, cRI) -> RRSF(R2) -> AMAX(cR1, cR2, R)
+     * ...                    /                                                                    \    ...                                                  /                           /
+     * RR1_P(cXP, MP, STDP) -/                                                                      \-> RR12_P(cXP, ZP, cM1, cSTD1) -> RR2_P(cZP, cRC, R2) -/                           /
+     *                                                                                                                                                                                 /
+     * NB1_1(cX1, NBF, R1) ---> NB2(cR1, NBAMAX) -> NB3(cR1, cNBAMAX, NBL) -> NB4(cR1, cNBL) -> NBSF(R1) -----------------------------------------------------------------------------/
+     * ...                   /
+     * NB1_P(cXP, NBF, R!) -/
+     *
+     * @param context the context where computations are scheduled
+     * @param fixConst if true, manually correct the "const" flag in some computations, to avoid the creation of
+     *                 fake dependencies in data that is shared between devices, but every device modified a distinct part
+     * @return the sequence of computations to schedule
+     */
+    public static List<GrCUDAComputationalElement> mlMultiGPUMockComputation(AsyncGrCUDAExecutionContext context, boolean fixConst) {
+        // Arrays have P partitions;
+        int P = partitionsMl;
+        int N = 100000;
+        int S = N / P;
+        int F = 1024;
+        int C = 16;
+        DeviceArrayMock[] x = new DeviceArrayMock[P];
+        DeviceArrayMock[] z = new DeviceArrayMock[P];
+        DeviceArrayMock[] mean = new DeviceArrayMock[P];
+        DeviceArrayMock[] std = new DeviceArrayMock[P];
+        for (int i = 0; i < P; i++) {
+            x[i] = new DeviceArrayMock(S * F);
+            z[i] = new DeviceArrayMock(S * F);
+            mean[i] = new DeviceArrayMock(F);
+            std[i] = new DeviceArrayMock(F);
+        }
+        DeviceArrayMock nbfeat = new DeviceArrayMock(C * F);
+        DeviceArrayMock ridgecoeff = new DeviceArrayMock(C * F);
+        DeviceArrayMock ridgeint = new DeviceArrayMock(C);
+        DeviceArrayMock nbamax = new DeviceArrayMock(N);
+        DeviceArrayMock nbl = new DeviceArrayMock(N);
+        DeviceArrayMock r1 = new DeviceArrayMock(C * N);
+        DeviceArrayMock r2 = new DeviceArrayMock(C * N);
+        DeviceArrayMock r = new DeviceArrayMock(N);
+
+        List<GrCUDAComputationalElement> computations = new ArrayList<>();
+        // Schedule Ridge Regression;
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(x[i], true),
+                    new ArgumentMock(mean[i]),
+                    new ArgumentMock(std[i])),
+                    "RR1-" + i));
+        }
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(mean[0]),
+                    new ArgumentMock(std[0]),
+                    new ArgumentMock(mean[i], true),
+                    new ArgumentMock(std[i], true)),
+                    "RR11-" + i));
+        }
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(x[i], true),
+                    new ArgumentMock(z[i]),
+                    new ArgumentMock(mean[0], true),
+                    new ArgumentMock(std[0], true)),
+                    "RR12-" + i));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(z[i], true),
+                    new ArgumentMock(ridgecoeff, true),
+                    new ArgumentMock(r2, fixConst)), // NOT CONST, BUT WE AVOID FAKE DEPENDENCIES;
+                    "RR2-" + i));
+        }
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(r2),
+                new ArgumentMock(ridgeint, true)),
+                "RR3"));
+        computations.add(new KernelExecutionMock(context, Collections.singletonList(
+                new ArgumentMock(r2)),
+                "RRSM"));
+
+        // Schedule Naive Bayes;
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(x[i], true),
+                    new ArgumentMock(nbfeat, true),
+                    new ArgumentMock(r1, fixConst)), // NOT CONST, BUT WE AVOID FAKE DEPENDENCIES;
+                    "NB1-" + i));
+        }
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(r1, !fixConst), // IT SHOULD BE CONST, BUT IF SO SKIP A DEPENDENCY WITH NB-1;
+                new ArgumentMock(nbamax)),
+                "NB2"));
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(r1, true),
+                new ArgumentMock(nbamax, true),
+                new ArgumentMock(nbl)),
+                "NB3"));
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(r1),
+                new ArgumentMock(nbl, true)),
+                "NB4"));
+        computations.add(new KernelExecutionMock(context, Collections.singletonList(
+                new ArgumentMock(r1)),
+                "NBSM"));
+
+        // Combine the two computations;
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(r1, true),
+                new ArgumentMock(r2, true),
+                new ArgumentMock(r)),
+                "AMAX"));
+        // Synchronize;
+        computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(r))));
+        return computations;
+    }
+
+    public static final int partitionsCg = 16;
+    public static final int iterationsCg = 3;
+
+    /**
+     * DAG that represents the B9-CG benchmark;
+     * The "c" before the variable name denotes a const argument;
+     *
+     * F(A1) -> MVMV(cA1, cX, cB, R) ----> CPY(P, cR) -> L2(R, T1) -> (*) MMUL(cA1, cP, Y) ----> DOT(cP, cY, T2) -> SYNC(T1, T2) -> AXPY(X, cX, xP) -> AXPY(R. cR, cY) -> L2(cR, cT1) -> SYNC(T1) -> AXPY(P, cR, cP) -> jump to (*)
+     * ...                             /                           \      ...                /
+     * F(AP) -> MVMV(cAP, cX, cB, R) -/                             \---> MMUL(cAP, cP, Y) -/
+     *
+     * @param context the context where computations are scheduled
+     * @param fixConst if true, manually correct the "const" flag in some computations, to avoid the creation of
+     *                 fake dependencies in data that is shared between devices, but every device modified a distinct part
+     * @return the sequence of computations to schedule
+     */
+    public static List<GrCUDAComputationalElement> cgMultiGPUMockComputation(AsyncGrCUDAExecutionContext context, boolean fixConst) {
+        // Arrays have P partitions;
+        int P = partitionsCg;
+        int N = 1000;
+        int S = N / P;
+        DeviceArrayMock[] A = new DeviceArrayMock[P];
+        for (int i = 0; i < P; i++) {
+            A[i] = new DeviceArrayMock(S * N);
+        }
+        DeviceArrayMock x = new DeviceArrayMock(N);
+        DeviceArrayMock b = new DeviceArrayMock(N);
+        DeviceArrayMock p = new DeviceArrayMock(N);
+        DeviceArrayMock r = new DeviceArrayMock(N);
+        DeviceArrayMock y = new DeviceArrayMock(N);
+        DeviceArrayMock t1 = new DeviceArrayMock(1);
+        DeviceArrayMock t2 = new DeviceArrayMock(1);
+
+        List<GrCUDAComputationalElement> computations = new ArrayList<>();
+        // Initialization of CG;
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Collections.singletonList(new ArgumentMock(A[i])), "PRE-" + i));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(A[i], true),
+                    new ArgumentMock(x, true),
+                    new ArgumentMock(b, true),
+                    new ArgumentMock(r, fixConst)),
+                    "MVMA-" + i));
+        }
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(p),
+                new ArgumentMock(r, !fixConst)),
+                "CPY"));
+        computations.add(new KernelExecutionMock(context, Arrays.asList(
+                new ArgumentMock(r, true),
+                new ArgumentMock(t1)),
+                "L2-1"));
+        // Iterative computation;
+        for (int iter = 0; iter < iterationsCg; iter++) {
+            for (int i = 0; i < P; i++) {
+                computations.add(new KernelExecutionMock(context, Arrays.asList(
+                        new ArgumentMock(A[i], true),
+                        new ArgumentMock(p, true),
+                        new ArgumentMock(y, fixConst)),
+                        "MUL-" + i));
+            }
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(p, true),
+                    new ArgumentMock(y, !fixConst),
+                    new ArgumentMock(t2)),
+                    "DOT"));
+            computations.add(new SyncExecutionMock(context, Arrays.asList(new ArgumentMock(t1), new ArgumentMock(t2))));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(x),
+                    new ArgumentMock(x, true),
+                    new ArgumentMock(p, true)),
+                    "SAXPY-1"));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(r),
+                    new ArgumentMock(r, true),
+                    new ArgumentMock(y, true)),
+                    "SAXPY-2"));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(r, true),
+                    new ArgumentMock(t1)),
+                    "L2-2"));
+            computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(t1, true))));
+            computations.add(new KernelExecutionMock(context, Arrays.asList(
+                    new ArgumentMock(p),
+                    new ArgumentMock(r, true),
+                    new ArgumentMock(p, true)),
+                    "SAXPY-3"));
+        }
+        // Synchronize;
+        computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(x))));
+        return computations;
+    }
+
+    public static final int partitionsMmul = 16;
+
+    /**
+     * K(X1r, Y, Z1) ----> C(Z, cZ1) -> ... -> C(Z, cZP);
+     * K(X2r, Y, Z2) -/ /
+     * ...             /
+     * K(XPr, Y, ZP) -/
+     *
+     *  Partition a matrix-vector multiplication on different devices;
+     * @param context the context where computations are scheduled
+     * @return the sequence of computations to schedule
+     */
+    public static List<GrCUDAComputationalElement> mmulMultiGPUMockComputation(AsyncGrCUDAExecutionContext context) {
+        // Arrays have P partitions;
+        int P = partitionsMmul;
+        int N = 1000;
+        int S = N / P;
+        DeviceArrayMock[] x = new DeviceArrayMock[P];
+        DeviceArrayMock[] z = new DeviceArrayMock[P];
+        for (int i = 0; i < P; i++) {
+            x[i] = new DeviceArrayMock(N * S);
+            z[i] = new DeviceArrayMock(S);
+        }
+        DeviceArrayMock y = new DeviceArrayMock(N);
+        DeviceArrayMock z_out = new DeviceArrayMock(N);
+        List<GrCUDAComputationalElement> computations = new ArrayList<>();
+        // Schedule the computations;
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(x[i], true), new ArgumentMock(y, true), new ArgumentMock(z[i])), "MUL-" + i));
+        }
+        for (int i = 0; i < P; i++) {
+            computations.add(new KernelExecutionMock(context, Arrays.asList(new ArgumentMock(z_out), new ArgumentMock(z[i], true)), "CPY-" + i));
+        }
+        computations.add(new SyncExecutionMock(context, Collections.singletonList(new ArgumentMock(z_out, true))));
+        return computations;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDADevicesManagerMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDADevicesManagerMock.java
new file mode 100644
index 00000000..9fd895d2
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDADevicesManagerMock.java
@@ -0,0 +1,29 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.stream.policy.GrCUDADevicesManager;
+
+public class GrCUDADevicesManagerMock extends GrCUDADevicesManager {
+
+    private int currentGPU = 0;
+    final private int numberOfGPUsToUse;
+
+    public GrCUDADevicesManagerMock(DeviceListMock deviceList, int numberOfGPUsToUse) {
+        super(null, deviceList);
+        this.numberOfGPUsToUse = numberOfGPUsToUse;
+    }
+
+    @Override
+    public Device getCurrentGPU(){
+        return this.getDevice(this.currentGPU);
+    }
+
+    @Override
+    public int getNumberOfGPUsToUse(){
+        return numberOfGPUsToUse;
+    }
+
+    public void setCurrentGPU(int deviceId) {
+        currentGPU = deviceId;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAExecutionContextMockBuilder.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAExecutionContextMockBuilder.java
new file mode 100644
index 00000000..c13d73d2
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAExecutionContextMockBuilder.java
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+
+public class GrCUDAExecutionContextMockBuilder {
+
+    DependencyPolicyEnum dependencyPolicy = DependencyPolicyEnum.NO_CONST;
+    RetrieveNewStreamPolicyEnum retrieveStreamPolicy = RetrieveNewStreamPolicyEnum.REUSE;
+    RetrieveParentStreamPolicyEnum parentStreamPolicyEnum = RetrieveParentStreamPolicyEnum.SAME_AS_PARENT;
+    DeviceSelectionPolicyEnum deviceSelectionPolicyEnum = DeviceSelectionPolicyEnum.SINGLE_GPU;
+    boolean isArchitecturePascalOrNewer = true;
+    int numberOfAvailableGPUs = 1;
+    int numberOfGPUsToUse = 1;
+
+    public AsyncGrCUDAExecutionContextMock build() {
+        return new AsyncGrCUDAExecutionContextMock(dependencyPolicy, retrieveStreamPolicy, parentStreamPolicyEnum, deviceSelectionPolicyEnum, isArchitecturePascalOrNewer, numberOfAvailableGPUs, numberOfGPUsToUse);
+    }
+
+    public GrCUDAExecutionContextMockBuilder setDependencyPolicy(DependencyPolicyEnum dependencyPolicy) {
+        this.dependencyPolicy = dependencyPolicy;
+        return this;
+    }
+
+    public GrCUDAExecutionContextMockBuilder setRetrieveNewStreamPolicy(RetrieveNewStreamPolicyEnum retrieveStreamPolicy) {
+        this.retrieveStreamPolicy = retrieveStreamPolicy;
+        return this;
+    }
+
+    public GrCUDAExecutionContextMockBuilder setDeviceSelectionPolicy(DeviceSelectionPolicyEnum deviceSelectionPolicyEnum) {
+        this.deviceSelectionPolicyEnum = deviceSelectionPolicyEnum;
+        return this;
+    }
+
+    public GrCUDAExecutionContextMockBuilder setRetrieveParentStreamPolicy(RetrieveParentStreamPolicyEnum retrieveStreamPolicy) {
+        this.parentStreamPolicyEnum = retrieveStreamPolicy;
+        return this;
+    }
+
+    public GrCUDAExecutionContextMockBuilder setArchitecturePascalOrNewer(boolean isArchitecturePascalOrNewer) {
+        this.isArchitecturePascalOrNewer = isArchitecturePascalOrNewer;
+        return this;
+    }
+
+    public GrCUDAExecutionContextMockBuilder setNumberOfAvailableGPUs(int numberOfAvailableGPUs) {
+        this.numberOfAvailableGPUs = numberOfAvailableGPUs;
+        return this;
+    }
+
+    public GrCUDAExecutionContextMockBuilder setNumberOfGPUsToUse(int numberOfGPUsToUse) {
+        this.numberOfGPUsToUse = numberOfGPUsToUse;
+        return this;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAStreamManagerMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAStreamManagerMock.java
new file mode 100644
index 00000000..5c5891fc
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAStreamManagerMock.java
@@ -0,0 +1,106 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.GrCUDAStreamManager;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class GrCUDAStreamManagerMock extends GrCUDAStreamManager {
+
+    /**
+     * Static number of fake GPU streams, as we don't have access to the CUDA runtime in the mocked class hierarchy;
+     */
+    public static int numUserAllocatedStreams = 0;
+
+    GrCUDAStreamManagerMock(CUDARuntime runtime,
+                            RetrieveNewStreamPolicyEnum retrieveStreamPolicy,
+                            RetrieveParentStreamPolicyEnum parentStreamPolicy,
+                            DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+                            String bandwidthMatrixPath,
+                            int numberOfAvailableGPUs,
+                            int numberOfGPUsToUse) {
+        // Possibly use a number of GPUs lower than the number of available GPUs;
+        super(runtime, false, new GrCUDAStreamPolicyMock(retrieveStreamPolicy, parentStreamPolicy, deviceSelectionPolicyEnum, bandwidthMatrixPath, numberOfAvailableGPUs, numberOfGPUsToUse));
+        // Reset the number of streams;
+        numUserAllocatedStreams = 0;
+    }
+
+    @Override
+    public void assignEventStart(ExecutionDAG.DAGVertex vertex) { }
+
+    @Override
+    public void assignEventStop(ExecutionDAG.DAGVertex vertex) { }
+
+    @Override
+    public void syncStream(CUDAStream stream) { }
+
+    @Override
+    protected void setComputationFinishedInner(GrCUDAComputationalElement computation) {
+        computation.setComputationFinished();
+    }
+
+    @Override
+    protected void syncStreamsUsingEvents(ExecutionDAG.DAGVertex vertex) { }
+
+    @Override
+    protected void syncDevice() { }
+
+    /**
+     * Return all the streams allocated on every device;
+     * @return all the streams that have been created
+     */
+    public List<CUDAStream> getStreams() {
+        List<CUDAStream> allStreams = new ArrayList<>();
+        this.getStreamPolicy().getDevicesManager().getDeviceList().forEach(d -> allStreams.addAll(d.getStreams()));
+        return allStreams;
+    }
+
+    public Map<CUDAStream, Set<GrCUDAComputationalElement>> getActiveComputationsMap() {
+        Map<CUDAStream, Set<GrCUDAComputationalElement>> activeComputations = new HashMap<>();
+        for (Map.Entry<CUDAStream, Set<ExecutionDAG.DAGVertex>> e : this.activeComputationsPerStream.entrySet()) {
+            activeComputations.put(e.getKey(), e.getValue().stream().map(ExecutionDAG.DAGVertex::getComputation).collect(Collectors.toSet()));
+        }
+        return activeComputations;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAStreamPolicyMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAStreamPolicyMock.java
new file mode 100644
index 00000000..0dcd2252
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/GrCUDAStreamPolicyMock.java
@@ -0,0 +1,33 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.GrCUDAOptions;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicy;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.GrCUDAStreamPolicy;
+
+public class GrCUDAStreamPolicyMock extends GrCUDAStreamPolicy {
+
+    public GrCUDAStreamPolicyMock(
+            RetrieveNewStreamPolicyEnum retrieveNewStreamPolicyEnum,
+            RetrieveParentStreamPolicyEnum retrieveParentStreamPolicyEnum,
+            DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+            String bandwidthMatrixPath,
+            int numberOfAvailableGPUs,
+            int numberOfGPUsToUse) {
+        super(
+                new GrCUDADevicesManagerMock(new DeviceListMock(numberOfAvailableGPUs), numberOfGPUsToUse),
+                retrieveNewStreamPolicyEnum,
+                retrieveParentStreamPolicyEnum,
+                deviceSelectionPolicyEnum,
+                bandwidthMatrixPath,
+                GrCUDAOptionMap.DEFAULT_DATA_THRESHOLD
+        );
+    }
+
+    public DeviceSelectionPolicy getDeviceSelectionPolicy() {
+        return this.deviceSelectionPolicy;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/KernelExecutionMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/KernelExecutionMock.java
new file mode 100644
index 00000000..2330dbba
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/KernelExecutionMock.java
@@ -0,0 +1,97 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Mock class to test the DAG execution;
+ */
+public class KernelExecutionMock extends GrCUDAComputationalElement {
+
+    /**
+     * Simulate an execution by forcing a wait that last the given number of milliseconds;
+     */
+    private final int durationMs;
+
+    private final String name;
+
+    public KernelExecutionMock(AbstractGrCUDAExecutionContext grCUDAExecutionContext, List<ComputationArgumentWithValue> args) {
+        this(grCUDAExecutionContext, args, "kernel");
+    }
+
+    public KernelExecutionMock(AbstractGrCUDAExecutionContext grCUDAExecutionContext, List<ComputationArgumentWithValue> args, String name) {
+        this(grCUDAExecutionContext, args, name, 0);
+    }
+
+    public KernelExecutionMock(AbstractGrCUDAExecutionContext grCUDAExecutionContext, List<ComputationArgumentWithValue> args, String name, int durationMs) {
+        super(grCUDAExecutionContext, args);
+        this.name = name;
+        this.durationMs = durationMs;
+    }
+
+    public String getName() {
+        return name;
+    }
+
+    @Override
+    public Object execute() {
+        if (this.durationMs > 0) {
+            try {
+                Thread.sleep(this.durationMs);
+            } catch (InterruptedException e) {
+                System.out.println("ERROR; failed to pause " + this + " for " + this.durationMs + " msec");
+                e.printStackTrace();
+            }
+        }
+        return NoneValue.get();
+    }
+
+    @Override
+    public boolean canUseStream() { return true; }
+
+    @Override
+    public void associateArraysToStreamImpl() { }
+
+    @Override
+    public String toString() {
+        return this.getName() + ": args={" +
+                this.argumentsThatCanCreateDependencies.stream().map(Object::toString).collect(Collectors.joining(", ")) +
+                "}" + "; stream=" + this.getStream().getStreamNumber() + "; gpu=" + this.getStream().getStreamDeviceId();
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/MultiDimDeviceArrayMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/MultiDimDeviceArrayMock.java
new file mode 100644
index 00000000..3c313fd0
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/MultiDimDeviceArrayMock.java
@@ -0,0 +1,27 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.LittleEndianNativeArrayView;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArray;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+
+public class MultiDimDeviceArrayMock extends MultiDimDeviceArray {
+    public MultiDimDeviceArrayMock(long[] dimensions, boolean columnMajor) {
+        super(new AsyncGrCUDAExecutionContextMock(), Type.SINT32, dimensions, columnMajor);
+    }
+
+    public MultiDimDeviceArrayMock(AbstractGrCUDAExecutionContext context, long[] dimensions, boolean columnMajor) {
+        super(context,  Type.SINT32, dimensions, columnMajor);
+        if (context.isArchitecturePascalOrNewer()) {
+            this.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            this.addArrayUpToDateLocations(context.getCurrentGPU());
+        }
+    }
+
+    @Override
+    protected LittleEndianNativeArrayView allocateMemory() {
+        return null;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/OptionValuesMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/OptionValuesMock.java
new file mode 100644
index 00000000..06d9236f
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/OptionValuesMock.java
@@ -0,0 +1,75 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.GrCUDALanguage;
+import org.graalvm.options.OptionDescriptors;
+import org.graalvm.options.OptionKey;
+import org.graalvm.options.OptionValues;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class OptionValuesMock implements OptionValues {
+
+    private final Map<OptionKey<?>, Object> values;
+
+    public OptionValuesMock() {
+        this.values =  new HashMap<>();
+        GrCUDALanguage.getOptionDescriptorsStatic().forEach(o -> values.put(o.getKey(), o.getKey().getDefaultValue()));
+    }
+
+    @Override
+    public OptionDescriptors getDescriptors() {
+        return GrCUDALanguage.getOptionDescriptorsStatic();
+    }
+
+    @Override
+    public <T> void set(OptionKey<T> optionKey, T value) {
+        this.values.put(optionKey, value);
+    }
+
+    @Override
+    public <T> T get(OptionKey<T> optionKey) {
+        return (T) this.values.get(optionKey);
+    }
+
+    @Override
+    public boolean hasBeenSet(OptionKey<?> optionKey) {
+        return values.containsKey(optionKey);
+    }
+
+    @Override
+    public boolean hasSetOptions() {
+        return OptionValues.super.hasSetOptions();
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/OptionValuesMockBuilder.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/OptionValuesMockBuilder.java
new file mode 100644
index 00000000..9abfe0f6
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/OptionValuesMockBuilder.java
@@ -0,0 +1,20 @@
+package com.nvidia.grcuda.test.util.mock;
+
+import org.graalvm.options.OptionKey;
+
+public class OptionValuesMockBuilder {
+    private final OptionValuesMock options;
+
+    public OptionValuesMockBuilder() {
+        this.options = new OptionValuesMock();
+    }
+
+    public <T> OptionValuesMockBuilder add(OptionKey<T> optionKey, T value) {
+        this.options.set(optionKey, value);
+        return this;
+    }
+
+    public OptionValuesMock build() {
+        return options;
+    }
+}
diff --git a/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/SyncExecutionMock.java b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/SyncExecutionMock.java
new file mode 100644
index 00000000..3bc913b1
--- /dev/null
+++ b/projects/com.nvidia.grcuda.test/src/com/nvidia/grcuda/test/util/mock/SyncExecutionMock.java
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.test.util.mock;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Mock class that represents a synchronous execution,
+ * it can be used to synchronize previous computations using the specified arguments;
+ */
+public class SyncExecutionMock extends GrCUDAComputationalElement {
+
+    public SyncExecutionMock(AsyncGrCUDAExecutionContext grCUDAExecutionContext, List<ComputationArgumentWithValue> args) {
+        super(grCUDAExecutionContext, args);
+    }
+
+    @Override
+    public Object execute() {
+        this.setComputationFinished();
+        return NoneValue.get();
+    }
+
+    @Override
+    public boolean canUseStream() { return false; }
+
+    @Override
+    public void associateArraysToStreamImpl() { }
+
+    @Override
+    public String toString() {
+        return "sync: args={" +
+                this.argumentsThatCanCreateDependencies.stream().map(Object::toString).collect(Collectors.joining(", ")) +
+                "}";
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda/.checkstyle_checks.xml b/projects/com.nvidia.grcuda/.checkstyle_checks.xml
index be8529a9..af15b5a1 100644
--- a/projects/com.nvidia.grcuda/.checkstyle_checks.xml
+++ b/projects/com.nvidia.grcuda/.checkstyle_checks.xml
@@ -1,31 +1,38 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <!--
-  Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-  Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
 
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions
-  are met:
-   * Redistributions of source code must retain the above copyright
-     notice, this list of conditions and the following disclaimer.
-   * Redistributions in binary form must reproduce the above copyright
-     notice, this list of conditions and the following disclaimer in the
-     documentation and/or other materials provided with the distribution.
-   * Neither the name of NVIDIA CORPORATION nor the names of its
-     contributors may be used to endorse or promote products derived
-     from this software without specific prior written permission.
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+  * Redistributions of source code must retain the above copyright
+    notice, this list of conditions and the following disclaimer.
+  * Redistributions in binary form must reproduce the above copyright
+    notice, this list of conditions and the following disclaimer in the
+    documentation and/or other materials provided with the distribution.
+  * Neither the name of NVIDIA CORPORATION nor the names of its
+    contributors may be used to endorse or promote products derived
+    from this software without specific prior written permission.
+  * Neither the name of NECSTLab nor the names of its
+    contributors may be used to endorse or promote products derived
+    from this software without specific prior written permission.
+  * Neither the name of Politecnico di Milano nor the names of its
+    contributors may be used to endorse or promote products derived
+    from this software without specific prior written permission.
 
-  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-  PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-  EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-  PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 <!DOCTYPE module PUBLIC "-//Puppy Crawl//DTD Check Configuration 1.3//EN" "https://checkstyle.org/dtds/configuration_1_3.dtd">
 
@@ -168,8 +175,8 @@
       <metadata name="com.atlassw.tools.eclipse.checkstyle.comment" value="Disable method name checks"/>
     </module>
     <module name="SuppressionCommentFilter">
-      <property name="offCommentFormat" value="Checkstyle: stop parameter assignment check"/>
-      <property name="onCommentFormat" value="Checkstyle: resume parameter assignment check"/>
+      <property name="offCommentFormat" value="Checkstyle: stop computationArgument assignment check"/>
+      <property name="onCommentFormat" value="Checkstyle: resume computationArgument assignment check"/>
       <property name="checkFormat" value="ParameterAssignment"/>
       <property name="checkC" value="false"/>
       <metadata name="com.atlassw.tools.eclipse.checkstyle.comment" value="Disable Parameter Assignment"/>
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Binding.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Binding.java
index 508a44f8..f44c1e6a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Binding.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Binding.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,6 +35,8 @@
  */
 package com.nvidia.grcuda;
 
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.stream.Collectors;
@@ -35,7 +44,7 @@
 public abstract class Binding {
     protected final boolean hasCxxMangledName;
     protected final String name;
-    protected final Parameter[] parameters;
+    protected final ComputationArgument[] computationArguments;
     protected String[] namespaceList;
     protected String mangledName;
     protected String libraryFileName;
@@ -45,19 +54,19 @@ public abstract class Binding {
      *
      * @param name a C style name or as fully qualified C++ name (e.g.,
      *            `namespace1::namespace2::name`).
-     * @param parameterList list of parameter names, types, and directions
+     * @param computationArgumentList list of parameter names, types, and directions
      * @param hasCxxMangledName true if `name` is a C++ name and the symbol name is therefore
      *            mangled.
      */
-    public Binding(String name, ArrayList<Parameter> parameterList, boolean hasCxxMangledName) {
+    public Binding(String name, ArrayList<ComputationArgument> computationArgumentList, boolean hasCxxMangledName) {
         String[] identifierList = name.trim().split("::");
         this.name = identifierList[identifierList.length - 1];
         this.namespaceList = new String[identifierList.length - 1];
         if (identifierList.length > 1) {
             System.arraycopy(identifierList, 0, namespaceList, 0, identifierList.length - 1);
         }
-        Parameter[] params = new Parameter[parameterList.size()];
-        this.parameters = parameterList.toArray(params);
+        ComputationArgument[] params = new ComputationArgument[computationArgumentList.size()];
+        this.computationArguments = computationArgumentList.toArray(params);
         this.hasCxxMangledName = hasCxxMangledName;
     }
 
@@ -91,14 +100,14 @@ public String getSymbolName() {
             mangled += name.length() + name;
         }
         // add arguments
-        if (parameters.length == 0) {
+        if (computationArguments.length == 0) {
             mangled += 'v';     // f() -> f(void) -> void
         } else {
-            ArrayList<Parameter> processedSymbolParameters = new ArrayList<>(parameters.length);
-            ArrayList<Integer> referencePositions = new ArrayList<>(parameters.length);
+            ArrayList<ComputationArgument> processedSymbolComputationArguments = new ArrayList<>(computationArguments.length);
+            ArrayList<Integer> referencePositions = new ArrayList<>(computationArguments.length);
             int lastReference = 0;
-            for (Parameter currentParam : parameters) {
-                if (currentParam.getKind() == Parameter.Kind.BY_VALUE) {
+            for (ComputationArgument currentParam : computationArguments) {
+                if (currentParam.getKind() == ComputationArgument.Kind.BY_VALUE) {
                     // parameter of primitive type passed by-value: is not a symbol parameter
                     mangled += currentParam.getMangledType();
                 } else {
@@ -106,8 +115,8 @@ public String getSymbolName() {
                     // -> check whether we've emitted a pointer of this (kind, type) already seen
 
                     boolean paramProcessed = false;
-                    for (int i = 0; i < processedSymbolParameters.size(); i++) {
-                        Parameter p = processedSymbolParameters.get(i);
+                    for (int i = 0; i < processedSymbolComputationArguments.size(); i++) {
+                        ComputationArgument p = processedSymbolComputationArguments.get(i);
                         if (p.getKind() == currentParam.getKind() && p.getType() == currentParam.getType()) {
                             // found repetition -> apply substitution rule
                             int occurrencePos = referencePositions.get(i);
@@ -122,8 +131,8 @@ public String getSymbolName() {
                         mangled += currentParam.getMangledType();
 
                         // count "T*" as 1 symbol and "const T*" as 2 symbols
-                        lastReference += currentParam.getKind() == Parameter.Kind.POINTER_IN ? 2 : 1;
-                        processedSymbolParameters.add(currentParam);
+                        lastReference += currentParam.getKind() == ComputationArgument.Kind.POINTER_IN ? 2 : 1;
+                        processedSymbolComputationArguments.add(currentParam);
                         referencePositions.add(lastReference - 1);
                     }
                 }
@@ -146,7 +155,7 @@ public String getLibraryFileName() {
     }
 
     public String getNIDLParameterSignature() {
-        return Arrays.stream(parameters).map(Parameter::toNFISignatureElement).collect(Collectors.joining(", "));
+        return Arrays.stream(computationArguments).map(ComputationArgument::toNFISignatureElement).collect(Collectors.joining(", "));
     }
 
     public String toNIDLString() {
@@ -155,7 +164,7 @@ public String toNIDLString() {
 
     @Override
     public String toString() {
-        String argString = Arrays.stream(parameters).map(Object::toString).collect(Collectors.joining(", ", "[", "]"));
+        String argString = Arrays.stream(computationArguments).map(Object::toString).collect(Collectors.joining(", ", "[", "]"));
         return "Binding(name=" + name + ", argumentList=" + argString +
                         ", cxxnamespace=" + String.join("::", namespaceList) +
                         ", hasCxxMangledName=" + hasCxxMangledName + ", symbol=" + getSymbolName() + ")";
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/CUDAEvent.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/CUDAEvent.java
new file mode 100644
index 00000000..602b7853
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/CUDAEvent.java
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda;
+
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+
+import java.util.Objects;
+
+@ExportLibrary(InteropLibrary.class)
+public class CUDAEvent extends GPUPointer {
+
+    private final long eventNumber;
+    /**
+     * Keep track of whether this event has been destroyed by {@link com.nvidia.grcuda.runtime.CUDARuntime#cudaEventDestroy}
+     */
+    private boolean isAlive = true;
+
+    public CUDAEvent(long rawPointer, long eventNumber) {
+        super(rawPointer);
+        this.eventNumber = eventNumber;
+    }
+
+    public long getEventNumber() {
+        return eventNumber;
+    }
+
+    public boolean isDefaultStream() { return false; }
+
+    /**
+     * Keep track of whether this event has been destroyed by {@link com.nvidia.grcuda.runtime.CUDARuntime#cudaEventDestroy}
+     */
+    public boolean isAlive() {
+        return isAlive;
+    }
+
+    /**
+     * Set the event as destroyed by the CUDA runtime;
+     */
+    public void setDead() {
+        isAlive = false;
+    }
+
+    @Override
+    public String toString() {
+        return "CUDAEvent(eventNumber=" + this.eventNumber + "; address=0x" + Long.toHexString(this.getRawPointer()) + ")";
+    }
+
+    @ExportMessage
+    public Object toDisplayString(boolean allowSideEffect) {
+        return this.toString();
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        CUDAEvent that = (CUDAEvent) o;
+        return (eventNumber == that.eventNumber && this.getRawPointer() == that.getRawPointer());
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(super.hashCode(), eventNumber);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/DeviceArray.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/DeviceArray.java
deleted file mode 100644
index 4e58d604..00000000
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/DeviceArray.java
+++ /dev/null
@@ -1,409 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-package com.nvidia.grcuda;
-
-import java.util.Arrays;
-
-import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.gpu.LittleEndianNativeArrayView;
-import com.oracle.truffle.api.CompilerDirectives;
-import com.oracle.truffle.api.CompilerDirectives.CompilationFinal;
-import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
-import com.oracle.truffle.api.dsl.Cached;
-import com.oracle.truffle.api.dsl.Cached.Shared;
-import com.oracle.truffle.api.interop.ArityException;
-import com.oracle.truffle.api.interop.InteropLibrary;
-import com.oracle.truffle.api.interop.InvalidArrayIndexException;
-import com.oracle.truffle.api.interop.TruffleObject;
-import com.oracle.truffle.api.interop.UnknownIdentifierException;
-import com.oracle.truffle.api.interop.UnsupportedMessageException;
-import com.oracle.truffle.api.interop.UnsupportedTypeException;
-import com.oracle.truffle.api.library.CachedLibrary;
-import com.oracle.truffle.api.library.ExportLibrary;
-import com.oracle.truffle.api.library.ExportMessage;
-import com.oracle.truffle.api.profiles.ValueProfile;
-
-@ExportLibrary(InteropLibrary.class)
-public final class DeviceArray implements TruffleObject {
-
-    private static final String POINTER = "pointer";
-    private static final String COPY_FROM = "copyFrom";
-    private static final String COPY_TO = "copyTo";
-    private static final String FREE = "free";
-    private static final String IS_MEMORY_FREED = "isMemoryFreed";
-    private static final String ACCESSED_FREED_MEMORY_MESSAGE = "memory of array freed";
-
-    private static final MemberSet PUBLIC_MEMBERS = new MemberSet(COPY_FROM, COPY_TO, FREE, IS_MEMORY_FREED);
-    private static final MemberSet MEMBERS = new MemberSet(POINTER, COPY_FROM, COPY_TO, FREE, IS_MEMORY_FREED);
-
-    @ExportLibrary(InteropLibrary.class)
-    public static final class MemberSet implements TruffleObject {
-
-        @CompilationFinal(dimensions = 1) private final String[] values;
-
-        public MemberSet(String... values) {
-            this.values = values;
-        }
-
-        @ExportMessage
-        @SuppressWarnings("static-method")
-        public boolean hasArrayElements() {
-            return true;
-        }
-
-        @ExportMessage
-        public long getArraySize() {
-            return values.length;
-        }
-
-        @ExportMessage
-        public boolean isArrayElementReadable(long index) {
-            return index >= 0 && index < values.length;
-        }
-
-        @ExportMessage
-        public Object readArrayElement(long index) throws InvalidArrayIndexException {
-            if ((index < 0) || (index >= values.length)) {
-                CompilerDirectives.transferToInterpreter();
-                throw InvalidArrayIndexException.create(index);
-            }
-            return values[(int) index];
-        }
-
-        @TruffleBoundary
-        public boolean constainsValue(String name) {
-            return Arrays.asList(values).contains(name);
-        }
-    }
-
-    private final CUDARuntime runtime;
-
-    /** Data type of elements stored in the array. */
-    private final Type elementType;
-
-    /** Total number of elements stored in the array. */
-    private final long numElements;
-
-    /**
-     * Total number of bytes allocated and used to store the array data (includes padding).
-     */
-    private final long sizeBytes;
-
-    /** Mutable view onto the underlying memory buffer. */
-    private final LittleEndianNativeArrayView nativeView;
-
-    /** Flag set when underlying off-heap memory has been freed. */
-    private boolean arrayFreed = false;
-
-    public DeviceArray(CUDARuntime runtime, long numElements, Type elementType) {
-        this.runtime = runtime;
-        this.numElements = numElements;
-        this.elementType = elementType;
-        this.sizeBytes = numElements * elementType.getSizeBytes();
-        this.nativeView = runtime.cudaMallocManaged(this.sizeBytes);
-    }
-
-    long getSizeBytes() {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        return sizeBytes;
-    }
-
-    public long getPointer() {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        return nativeView.getStartAddress();
-    }
-
-    public Type getElementType() {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        return elementType;
-    }
-
-    @Override
-    public String toString() {
-        if (arrayFreed) {
-            return "DeviceArray(memory freed)";
-        } else {
-            return "DeviceArray(elementType=" + elementType + ", numElements=" + numElements + ", nativeView=" + nativeView + ')';
-        }
-    }
-
-    @Override
-    protected void finalize() throws Throwable {
-        if (!arrayFreed) {
-            runtime.cudaFree(nativeView);
-        }
-        super.finalize();
-    }
-
-    public void copyFrom(long fromPointer, long numCopyElements) throws IndexOutOfBoundsException {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        long numBytesToCopy = numCopyElements * elementType.getSizeBytes();
-        if (numBytesToCopy > getSizeBytes()) {
-            CompilerDirectives.transferToInterpreter();
-            throw new IndexOutOfBoundsException();
-        }
-        runtime.cudaMemcpy(getPointer(), fromPointer, numBytesToCopy);
-    }
-
-    public void copyTo(long toPointer, long numCopyElements) throws IndexOutOfBoundsException {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        long numBytesToCopy = numCopyElements * elementType.getSizeBytes();
-        if (numBytesToCopy > getSizeBytes()) {
-            CompilerDirectives.transferToInterpreter();
-            throw new IndexOutOfBoundsException();
-        }
-        runtime.cudaMemcpy(toPointer, getPointer(), numBytesToCopy);
-    }
-
-    public void freeMemory() {
-        if (arrayFreed) {
-            throw new GrCUDAException("device array already freed");
-        }
-        runtime.cudaFree(nativeView);
-        arrayFreed = true;
-    }
-
-    public boolean isMemoryFreed() {
-        return arrayFreed;
-    }
-
-    // Implementation of InteropLibrary
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean hasArrayElements() {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        return true;
-    }
-
-    @ExportMessage
-    public long getArraySize() {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        return numElements;
-    }
-
-    @ExportMessage
-    boolean isArrayElementReadable(long index) {
-        return !arrayFreed && index >= 0 && index < numElements;
-    }
-
-    @ExportMessage
-    boolean isArrayElementModifiable(long index) {
-        return index >= 0 && index < numElements;
-    }
-
-    @SuppressWarnings("static-method")
-    @ExportMessage
-    boolean isArrayElementInsertable(@SuppressWarnings("unused") long index) {
-        return false;
-    }
-
-    @ExportMessage
-    Object readArrayElement(long index,
-                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws InvalidArrayIndexException {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        if ((index < 0) || (index >= numElements)) {
-            CompilerDirectives.transferToInterpreter();
-            throw InvalidArrayIndexException.create(index);
-
-        }
-        switch (elementTypeProfile.profile(elementType)) {
-            case CHAR:
-                return nativeView.getByte(index);
-            case SINT16:
-                return nativeView.getShort(index);
-            case SINT32:
-                return nativeView.getInt(index);
-            case SINT64:
-                return nativeView.getLong(index);
-            case FLOAT:
-                return nativeView.getFloat(index);
-            case DOUBLE:
-                return nativeView.getDouble(index);
-        }
-        return null;
-    }
-
-    @ExportMessage
-    public void writeArrayElement(long index, Object value,
-                    @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
-                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException, InvalidArrayIndexException {
-        if (arrayFreed) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
-        }
-        if ((index < 0) || (index >= numElements)) {
-            CompilerDirectives.transferToInterpreter();
-            throw InvalidArrayIndexException.create(index);
-        }
-        try {
-            switch (elementTypeProfile.profile(elementType)) {
-
-                case CHAR:
-                    nativeView.setByte(index, valueLibrary.asByte(value));
-                    break;
-                case SINT16:
-                    nativeView.setShort(index, valueLibrary.asShort(value));
-                    break;
-                case SINT32:
-                    nativeView.setInt(index, valueLibrary.asInt(value));
-                    break;
-                case SINT64:
-                    nativeView.setLong(index, valueLibrary.asLong(value));
-                    break;
-                case FLOAT:
-                    // going via "double" to allow floats to be initialized with doubles
-                    nativeView.setFloat(index, (float) valueLibrary.asDouble(value));
-                    break;
-                case DOUBLE:
-                    nativeView.setDouble(index, valueLibrary.asDouble(value));
-                    break;
-            }
-        } catch (UnsupportedMessageException e) {
-            CompilerDirectives.transferToInterpreter();
-            throw UnsupportedTypeException.create(new Object[]{value}, "value cannot be coerced to " + elementType);
-        }
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean hasMembers() {
-        return true;
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    Object getMembers(boolean includeInternal) {
-        return includeInternal ? MEMBERS : PUBLIC_MEMBERS;
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isMemberReadable(String memberName,
-                    @Shared("memberName") @Cached("createIdentityProfile()") ValueProfile memberProfile) {
-        String name = memberProfile.profile(memberName);
-        return POINTER.equals(name) || COPY_FROM.equals(name) || COPY_TO.equals(name) || FREE.equals(name) || IS_MEMORY_FREED.equals(name);
-    }
-
-    @ExportMessage
-    Object readMember(String memberName,
-                    @Shared("memberName") @Cached("createIdentityProfile()") ValueProfile memberProfile) throws UnknownIdentifierException {
-        if (!isMemberReadable(memberName, memberProfile)) {
-            CompilerDirectives.transferToInterpreter();
-            throw UnknownIdentifierException.create(memberName);
-        }
-        if (POINTER.equals(memberName)) {
-            return getPointer();
-        }
-        if (COPY_FROM.equals(memberName)) {
-            return new DeviceArrayCopyFunction(this, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
-        }
-        if (COPY_TO.equals(memberName)) {
-            return new DeviceArrayCopyFunction(this, DeviceArrayCopyFunction.CopyDirection.TO_POINTER);
-        }
-        if (FREE.equals(memberName)) {
-            return new DeviceArrayFreeFunction();
-        }
-        if (IS_MEMORY_FREED.equals(memberName)) {
-            return isMemoryFreed();
-        }
-        CompilerDirectives.transferToInterpreter();
-        throw UnknownIdentifierException.create(memberName);
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isMemberInvocable(String memberName) {
-        return COPY_FROM.equals(memberName) || COPY_TO.equals(memberName) || FREE.equals(memberName);
-    }
-
-    @ExportMessage
-    Object invokeMember(String memberName,
-                    Object[] arguments,
-                    @CachedLibrary("this") InteropLibrary interopRead,
-                    @CachedLibrary(limit = "1") InteropLibrary interopExecute)
-                    throws UnsupportedTypeException, ArityException, UnsupportedMessageException, UnknownIdentifierException {
-        return interopExecute.execute(interopRead.readMember(this, memberName), arguments);
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isPointer() {
-        return true;
-    }
-
-    @ExportMessage
-    long asPointer() {
-        return getPointer();
-    }
-
-    @ExportLibrary(InteropLibrary.class)
-    final class DeviceArrayFreeFunction implements TruffleObject {
-        @ExportMessage
-        @SuppressWarnings("static-method")
-        boolean isExecutable() {
-            return true;
-        }
-
-        @ExportMessage
-        Object execute(Object[] arguments) throws ArityException {
-            if (arguments.length != 0) {
-                CompilerDirectives.transferToInterpreter();
-                throw ArityException.create(0, arguments.length);
-            }
-            freeMemory();
-            return NoneValue.get();
-        }
-    }
-}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/FunctionBinding.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/FunctionBinding.java
index 9961db1b..f58c6160 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/FunctionBinding.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/FunctionBinding.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,6 +35,8 @@
  */
 package com.nvidia.grcuda;
 
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.stream.Collectors;
@@ -36,18 +45,18 @@ public final class FunctionBinding extends Binding {
 
     private final Type returnType;
 
-    private FunctionBinding(String name, ArrayList<Parameter> parameterList,
+    private FunctionBinding(String name, ArrayList<ComputationArgument> computationArgumentList,
                     Type returnType, boolean hasCxxMangledName) {
-        super(name, parameterList, hasCxxMangledName);
+        super(name, computationArgumentList, hasCxxMangledName);
         this.returnType = returnType;
     }
 
-    public static FunctionBinding newCxxBinding(String name, ArrayList<Parameter> parameterList, Type returnType) {
-        return new FunctionBinding(name, parameterList, returnType, true);
+    public static FunctionBinding newCxxBinding(String name, ArrayList<ComputationArgument> computationArgumentList, Type returnType) {
+        return new FunctionBinding(name, computationArgumentList, returnType, true);
     }
 
-    public static FunctionBinding newCBinding(String name, ArrayList<Parameter> parameterList, Type returnType) {
-        return new FunctionBinding(name, parameterList, returnType, false);
+    public static FunctionBinding newCBinding(String name, ArrayList<ComputationArgument> computationArgumentList, Type returnType) {
+        return new FunctionBinding(name, computationArgumentList, returnType, false);
     }
 
     @Override
@@ -61,6 +70,6 @@ public String toNIDLString() {
     }
 
     public String toNFISignature() {
-        return "(" + Arrays.stream(parameters).map(Parameter::toNFISignatureElement).collect(Collectors.joining(", ")) + "): " + returnType.getNFITypeName();
+        return "(" + Arrays.stream(computationArguments).map(ComputationArgument::toNFISignatureElement).collect(Collectors.joining(", ")) + "): " + returnType.getNFITypeName();
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GPUPointer.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GPUPointer.java
index ae3db5e5..bbbb62de 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GPUPointer.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GPUPointer.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -33,8 +40,10 @@
 import com.oracle.truffle.api.library.ExportLibrary;
 import com.oracle.truffle.api.library.ExportMessage;
 
+import java.util.Objects;
+
 @ExportLibrary(InteropLibrary.class)
-public final class GPUPointer implements TruffleObject {
+public class GPUPointer implements TruffleObject {
 
     private final long rawPointer;
 
@@ -53,12 +62,25 @@ public String toString() {
 
     @ExportMessage
     @SuppressWarnings("static-method")
-    boolean isPointer() {
+    public boolean isPointer() {
         return true;
     }
 
     @ExportMessage
-    long asPointer() {
+    public long asPointer() {
         return rawPointer;
     }
+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        GPUPointer that = (GPUPointer) o;
+        return rawPointer == that.rawPointer;
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(rawPointer);
+    }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAContext.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAContext.java
index 66db2019..42dcf976 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAContext.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAContext.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,14 +35,10 @@
  */
 package com.nvidia.grcuda;
 
-import java.util.ArrayList;
-import java.util.concurrent.ConcurrentHashMap;
-import java.util.concurrent.atomic.AtomicInteger;
-
-import org.graalvm.options.OptionKey;
-
-import com.nvidia.grcuda.cublas.CUBLASRegistry;
-import com.nvidia.grcuda.cuml.CUMLRegistry;
+import com.nvidia.grcuda.cudalibraries.cublas.CUBLASRegistry;
+import com.nvidia.grcuda.cudalibraries.cuml.CUMLRegistry;
+import com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry;
+import com.nvidia.grcuda.cudalibraries.tensorrt.TensorRTRegistry;
 import com.nvidia.grcuda.functions.BindAllFunction;
 import com.nvidia.grcuda.functions.BindFunction;
 import com.nvidia.grcuda.functions.BindKernelFunction;
@@ -43,27 +46,45 @@
 import com.nvidia.grcuda.functions.DeviceArrayFunction;
 import com.nvidia.grcuda.functions.GetDeviceFunction;
 import com.nvidia.grcuda.functions.GetDevicesFunction;
+import com.nvidia.grcuda.functions.GetOptionsFunction;
 import com.nvidia.grcuda.functions.map.MapFunction;
 import com.nvidia.grcuda.functions.map.ShredFunction;
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.tensorrt.TensorRTRegistry;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.GraphExport;
+import com.nvidia.grcuda.runtime.executioncontext.SyncGrCUDAExecutionContext;
 import com.oracle.truffle.api.CallTarget;
-import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
+import com.oracle.truffle.api.TruffleLanguage;
 import com.oracle.truffle.api.TruffleLanguage.Env;
+import com.oracle.truffle.api.TruffleLogger;
+import com.oracle.truffle.api.nodes.Node;
+
+import java.util.ArrayList;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
 
 /**
- * Context for the grCUDA language holds reference to CUDA runtime, a function registry and device
+ * Context for the GrCUDA language holds reference to CUDA runtime, a function registry and device
  * resources.
  */
 public final class GrCUDAContext {
 
+//    private static final TruffleLanguage.ContextReference<GrCUDAContext> REFERENCE = TruffleLanguage.ContextReference.create(GrCUDALanguage.class);
+
     private static final String ROOT_NAMESPACE = "CU";
 
+    private static final TruffleLogger LOGGER = GrCUDALogger.getLogger(GrCUDALogger.GRCUDA_LOGGER);
+
+    private final GrCUDAOptionMap grCUDAOptionMap;
+
     private final Env env;
-    private final CUDARuntime cudaRuntime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
     private final Namespace rootNamespace;
     private final ArrayList<Runnable> disposables = new ArrayList<>();
-    private AtomicInteger moduleId = new AtomicInteger(0);
+    private final AtomicInteger moduleId = new AtomicInteger(0);
     private volatile boolean cudaInitialized = false;
 
     // this is used to look up pre-existing call targets for "map" operations, see MapArrayNode
@@ -71,44 +92,91 @@ public final class GrCUDAContext {
 
     public GrCUDAContext(Env env) {
         this.env = env;
-        this.cudaRuntime = new CUDARuntime(this, env);
+
+        this.grCUDAOptionMap = new GrCUDAOptionMap(env.getOptions());
+
+        // Retrieve the execution policy;
+        ExecutionPolicyEnum executionPolicy = grCUDAOptionMap.getExecutionPolicy();
+
+        // FIXME: TensorRT is currently incompatible with the async scheduler. TensorRT is supported in CUDA 11.4, and we cannot test it. 
+        //  Once Nvidia adds support for it, we want to remove this limitation;
+        if (grCUDAOptionMap.isTensorRTEnabled() && (executionPolicy == ExecutionPolicyEnum.ASYNC)) {
+            LOGGER.warning("TensorRT and the asynchronous scheduler are not compatible. Switching to the synchronous scheduler.");
+            executionPolicy = ExecutionPolicyEnum.SYNC;
+        }
+
+        // Initialize the execution policy;
+        LOGGER.info("using " + executionPolicy.toString() + " execution policy");
+        switch (executionPolicy) {
+            case SYNC:
+                this.grCUDAExecutionContext = new SyncGrCUDAExecutionContext(this, env);
+                break;
+            case ASYNC:
+                this.grCUDAExecutionContext = new AsyncGrCUDAExecutionContext(this, env);
+                break;
+            default:
+                LOGGER.severe("Cannot create an ExecutionContext. The selected execution policy is not valid: " + executionPolicy);
+                throw new GrCUDAException("selected execution policy is not valid: " + executionPolicy);
+        }
 
         Namespace namespace = new Namespace(ROOT_NAMESPACE);
         namespace.addNamespace(namespace);
         namespace.addFunction(new BindFunction());
+        namespace.addFunction(new DeviceArrayFunction(this.grCUDAExecutionContext));
         namespace.addFunction(new BindAllFunction(this));
-        namespace.addFunction(new DeviceArrayFunction(cudaRuntime));
         namespace.addFunction(new MapFunction());
         namespace.addFunction(new ShredFunction());
-        namespace.addFunction(new BindKernelFunction(cudaRuntime));
-        namespace.addFunction(new BuildKernelFunction(cudaRuntime));
-        namespace.addFunction(new GetDevicesFunction(cudaRuntime));
-        namespace.addFunction(new GetDeviceFunction(cudaRuntime));
-        cudaRuntime.registerCUDAFunctions(namespace);
-        if (this.getOption(GrCUDAOptions.CuMLEnabled)) {
-            Namespace ml = new Namespace(CUMLRegistry.NAMESPACE);
-            namespace.addNamespace(ml);
-            new CUMLRegistry(this).registerCUMLFunctions(ml);
+        namespace.addFunction(new BindKernelFunction(this.grCUDAExecutionContext));
+        namespace.addFunction(new BuildKernelFunction(this.grCUDAExecutionContext));
+        namespace.addFunction(new GetDevicesFunction(this.grCUDAExecutionContext));
+        namespace.addFunction(new GetDeviceFunction(this.grCUDAExecutionContext));
+        namespace.addFunction(new GetOptionsFunction(grCUDAOptionMap));
+        this.grCUDAExecutionContext.getCudaRuntime().registerCUDAFunctions(namespace);
+        if (grCUDAOptionMap.isCuMLEnabled()) {
+            if (this.getCUDARuntime().isArchitectureIsPascalOrNewer()) {
+                Namespace ml = new Namespace(CUMLRegistry.NAMESPACE);
+                namespace.addNamespace(ml);
+                new CUMLRegistry(this).registerCUMLFunctions(ml);
+            } else {
+                LOGGER.warning("cuML is supported only on GPUs with compute capability >= 6.0 (Pascal and newer). It cannot be enabled.");
+            }
         }
-        if (this.getOption(GrCUDAOptions.CuBLASEnabled)) {
-            Namespace blas = new Namespace(CUBLASRegistry.NAMESPACE);
-            namespace.addNamespace(blas);
-            new CUBLASRegistry(this).registerCUBLASFunctions(blas);
+        if (grCUDAOptionMap.isCuBLASEnabled()) {
+            if (this.getCUDARuntime().isArchitectureIsPascalOrNewer() || executionPolicy.equals(ExecutionPolicyEnum.SYNC)) {
+                Namespace blas = new Namespace(CUBLASRegistry.NAMESPACE);
+                namespace.addNamespace(blas);
+                new CUBLASRegistry(this).registerCUBLASFunctions(blas);
+            } else {
+                LOGGER.warning("cuBLAS with asynchronous scheduler is supported only on GPUs with compute capability >= 6.0 (Pascal and newer). It cannot be enabled.");
+            }
         }
-        if (this.getOption(GrCUDAOptions.TensorRTEnabled)) {
+        if (grCUDAOptionMap.isTensorRTEnabled()) {
             Namespace trt = new Namespace(TensorRTRegistry.NAMESPACE);
             namespace.addNamespace(trt);
             new TensorRTRegistry(this).registerTensorRTFunctions(trt);
         }
+        if (grCUDAOptionMap.isCuSPARSEEnabled()) {
+            Namespace sparse = new Namespace(CUSPARSERegistry.NAMESPACE);
+            namespace.addNamespace(sparse);
+            new CUSPARSERegistry(this).registerCUSPARSEFunctions(sparse);
+        }
         this.rootNamespace = namespace;
     }
 
+//    public static GrCUDAContext get(Node node) {
+//        return REFERENCE.get(node);
+//    }
+
     public Env getEnv() {
         return env;
     }
 
+    public AbstractGrCUDAExecutionContext getGrCUDAExecutionContext() {
+        return grCUDAExecutionContext;
+    }
+
     public CUDARuntime getCUDARuntime() {
-        return cudaRuntime;
+        return this.grCUDAExecutionContext.getCudaRuntime();
     }
 
     public Namespace getRootNamespace() {
@@ -141,8 +209,33 @@ public ConcurrentHashMap<Class<?>, CallTarget> getMapCallTargets() {
         return uncachedMapCallTargets;
     }
 
-    @TruffleBoundary
-    public <T> T getOption(OptionKey<T> key) {
-        return env.getOptions().get(key);
+
+    /**
+     * Compute the maximum number of concurrent threads that can be spawned by GrCUDA.
+     * This value is usually smaller or equal than the number of logical CPU threads available on the machine.
+     * @return the maximum number of concurrent threads that can be spawned by GrCUDA
+     */
+    public int getNumberOfThreads() {
+        return Runtime.getRuntime().availableProcessors();
+    }
+
+    public GrCUDAOptionMap getOptions() {
+        return grCUDAOptionMap;
     }
+
+    /**
+     * Cleanup the GrCUDA context at the end of the execution. If ExportDAG option is enabled,
+     * scheduling DAG will be dumped before the cleanup.
+     */
+    public void cleanup() {
+        if (grCUDAOptionMap.getExportDAGPath().equals("true")){
+            System.out.println("Please specify the destination path for the scheduling DAG export");
+        } else if (!grCUDAOptionMap.getExportDAGPath().equals("false")){
+            ExecutionDAG dag = grCUDAExecutionContext.getDag();
+            GraphExport graphExport = new GraphExport(dag);
+            graphExport.graphGenerator(grCUDAOptionMap.getExportDAGPath());
+        }
+        this.grCUDAExecutionContext.cleanup();
+    }
+
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAException.java
index 73c4ab13..5f74e421 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAException.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAException.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -31,22 +38,19 @@
 import java.util.Arrays;
 import java.util.Optional;
 
-import com.oracle.truffle.api.TruffleException;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
 import com.oracle.truffle.api.interop.InteropException;
 import com.oracle.truffle.api.nodes.Node;
 
-public final class GrCUDAException extends RuntimeException implements TruffleException {
+public final class GrCUDAException extends AbstractTruffleException {
     private static final long serialVersionUID = 8614211550329856579L;
 
-    private final Node node;
-
     public GrCUDAException(String message) {
         this(message, null);
     }
 
     public GrCUDAException(String message, Node node) {
-        super(message);
-        this.node = node;
+        super(message, node);
     }
 
     public GrCUDAException(InteropException e) {
@@ -62,12 +66,7 @@ public GrCUDAException(int errorCode, String message, String[] functionName) {
     }
 
     public static String format(String... name) {
-        Optional<String> result = Arrays.asList(name).stream().reduce((a, b) -> a + "::" + b);
+        Optional<String> result = Arrays.stream(name).reduce((a, b) -> a + "::" + b);
         return result.orElse("<empty>");
     }
-
-    @Override
-    public Node getLocation() {
-        return node;
-    }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAInternalException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAInternalException.java
index b716a428..0ffb0574 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAInternalException.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAInternalException.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,34 +35,22 @@
  */
 package com.nvidia.grcuda;
 
-import com.oracle.truffle.api.TruffleException;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
 import com.oracle.truffle.api.interop.InteropException;
 import com.oracle.truffle.api.nodes.Node;
 
-public final class GrCUDAInternalException extends RuntimeException implements TruffleException {
+public final class GrCUDAInternalException extends AbstractTruffleException {
     private static final long serialVersionUID = 8614211550329856579L;
 
-    private final Node node;
-
     public GrCUDAInternalException(String message) {
         this(message, null);
     }
 
     public GrCUDAInternalException(String message, Node node) {
-        super(message);
-        this.node = node;
+        super(message, node);
     }
 
     public GrCUDAInternalException(InteropException e) {
         this(e.getMessage());
     }
-
-    public boolean isInternalError() {
-        return true;
-    }
-
-    @Override
-    public Node getLocation() {
-        return node;
-    }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALanguage.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALanguage.java
index dcf106ed..31519d4c 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALanguage.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALanguage.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,6 +35,8 @@
  */
 package com.nvidia.grcuda;
 
+
+import com.oracle.truffle.api.TruffleLogger;
 import org.graalvm.options.OptionDescriptors;
 
 import com.nvidia.grcuda.nodes.ExpressionNode;
@@ -36,16 +45,18 @@
 import com.oracle.truffle.api.CallTarget;
 import com.oracle.truffle.api.Truffle;
 import com.oracle.truffle.api.TruffleLanguage;
-import com.oracle.truffle.api.interop.TruffleObject;
+
 
 /**
- * grCUDA Truffle language that exposes the GPU device and CUDA runtime to polyglot Graal languages.
+ * GrCUDA Truffle language that exposes the GPU device and CUDA runtime to polyglot Graal languages.
  */
-@TruffleLanguage.Registration(id = GrCUDALanguage.ID, name = "grcuda", version = "0.1", internal = false)
+@TruffleLanguage.Registration(id = GrCUDALanguage.ID, name = "grcuda", version = "0.1", internal = false, contextPolicy = TruffleLanguage.ContextPolicy.SHARED)
 public final class GrCUDALanguage extends TruffleLanguage<GrCUDAContext> {
 
     public static final String ID = "grcuda";
 
+    public static final TruffleLogger LOGGER = TruffleLogger.getLogger(ID, "com.nvidia.grcuda");
+
     @Override
     protected GrCUDAContext createContext(Env env) {
         if (!env.isNativeAccessAllowed()) {
@@ -54,15 +65,6 @@ protected GrCUDAContext createContext(Env env) {
         return new GrCUDAContext(env);
     }
 
-    @Override
-    protected boolean isObjectOfLanguage(Object object) {
-        if (!(object instanceof TruffleObject)) {
-            return false;
-        }
-        TruffleObject truffleObject = (TruffleObject) object;
-        return truffleObject instanceof DeviceArray;
-    }
-
     @Override
     protected CallTarget parse(ParsingRequest request) {
         ExpressionNode expression = new ParserAntlr().parse(request.getSource());
@@ -74,13 +76,35 @@ public static GrCUDALanguage getCurrentLanguage() {
         return TruffleLanguage.getCurrentLanguage(GrCUDALanguage.class);
     }
 
+    public static GrCUDAContext getCurrentContext() {
+        return getCurrentContext(GrCUDALanguage.class);
+    }
+
     @Override
     protected void disposeContext(GrCUDAContext cxt) {
         cxt.disposeAll();
     }
 
     @Override
-    protected OptionDescriptors getOptionDescriptors() {
+    public OptionDescriptors getOptionDescriptors() {
+        return GrCUDALanguage.getOptionDescriptorsStatic();
+    }
+
+    /**
+     * We make the list of ooption descriptors available statically so it can be used when mocking the language, without having to create a context;
+     * @return the list of option descriptors, with default values available;
+     */
+    public static OptionDescriptors getOptionDescriptorsStatic() {
         return new GrCUDAOptionsOptionDescriptors();
     }
+
+    @Override
+    protected boolean isThreadAccessAllowed(Thread thread, boolean singleThreaded) {
+        return true;
+    }
+
+    @Override
+    protected void finalizeContext(GrCUDAContext context) {
+        context.cleanup();
+    }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALogger.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALogger.java
new file mode 100644
index 00000000..0cc1791c
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDALogger.java
@@ -0,0 +1,32 @@
+package com.nvidia.grcuda;
+
+import com.oracle.truffle.api.TruffleLogger;
+
+public class GrCUDALogger {
+
+    public static final String DEFAULT_LOGGER_LEVEL= "INFO";
+
+    public static final String GRCUDA_LOGGER = "com.nvidia.grcuda";
+
+    public static final String CUDALIBRARIES_LOGGER = "com.nvidia.grcuda.cudalibraries";
+
+    public static final String FUNCTIONS_LOGGER = "com.nvidia.grcuda.functions";
+
+    public static final String NODES_LOGGER = "com.nvidia.grcuda.nodes";
+
+    public static final String PARSER_LOGGER = "com.nvidia.grcuda.parser";
+
+    public static final String RUNTIME_LOGGER = "com.nvidia.grcuda.runtime";
+
+    public static final String ARRAY_LOGGER = "com.nvidia.grcuda.runtime.array";
+
+    public static final String COMPUTATION_LOGGER = "com.nvidia.grcuda.runtime.computation";
+
+    public static final String EXECUTIONCONTEXT_LOGGER = "com.nvidia.grcuda.runtime.executioncontext";
+
+    public static final String STREAM_LOGGER = "com.nvidia.grcuda.runtime.stream";
+
+    public static TruffleLogger getLogger(String name) {
+        return TruffleLogger.getLogger(GrCUDALanguage.ID, name);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptionMap.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptionMap.java
new file mode 100644
index 00000000..b7c18795
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptionMap.java
@@ -0,0 +1,381 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda;
+
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.nvidia.grcuda.runtime.computation.memadvise.MemAdviserEnum;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.oracle.truffle.api.TruffleLogger;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.StopIterationException;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnknownKeyException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+import org.graalvm.options.OptionKey;
+import org.graalvm.options.OptionValues;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.NoSuchElementException;
+import java.util.Objects;
+
+@ExportLibrary(InteropLibrary.class)
+public class GrCUDAOptionMap implements TruffleObject {
+
+    /**
+     * Store options using the option name and its value;
+     */
+    private final HashMap<String, Object> optionsMap;
+    /**
+     * Store a mapping between GrCUDA's Truffle options and their names, as strings.
+     * OptionKeys are assumed to be immutable, so this map must be read-only as well;
+     */
+    private final HashMap<OptionKey<?>, String> optionNames;
+
+    private static final TruffleLogger LOGGER = TruffleLogger.getLogger(GrCUDALanguage.ID, "com.nvidia.grcuda.GrCUDAOptionMap");
+
+    public static final ExecutionPolicyEnum DEFAULT_EXECUTION_POLICY = ExecutionPolicyEnum.ASYNC;
+    public static final DependencyPolicyEnum DEFAULT_DEPENDENCY_POLICY = DependencyPolicyEnum.NO_CONST;
+    public static final RetrieveNewStreamPolicyEnum DEFAULT_RETRIEVE_STREAM_POLICY = RetrieveNewStreamPolicyEnum.REUSE;
+    public static final RetrieveParentStreamPolicyEnum DEFAULT_PARENT_STREAM_POLICY = RetrieveParentStreamPolicyEnum.SAME_AS_PARENT;
+    public static final DeviceSelectionPolicyEnum DEFAULT_DEVICE_SELECTION_POLICY = DeviceSelectionPolicyEnum.SINGLE_GPU;
+    public static final MemAdviserEnum DEFAULT_MEM_ADVISE_POLICY = MemAdviserEnum.NONE;
+    public static final boolean DEFAULT_INPUT_PREFETCH = false;  // Value obtained from the input flags;
+    public static final boolean DEFAULT_FORCE_STREAM_ATTACH = false;
+    public static final boolean DEFAULT_TENSORRT_ENABLED = false;
+    public static final boolean DEFAULT_ENABLE_COMPUTATION_TIMERS = false;
+    public static final Integer DEFAULT_NUMBER_OF_GPUs = 1;
+    public static final String DEFAULT_BANDWIDTH_MATRIX = System.getenv("GRCUDA_HOME") + File.separatorChar +
+            "projects" + File.separatorChar + "resources" + File.separatorChar +
+            "connection_graph" + File.separatorChar + "datasets" + File.separatorChar + "connection_graph.csv";
+    public static final double DEFAULT_DATA_THRESHOLD = 0.1;
+    public static final String DEFAULT_EXPORT_DAG = "false";
+
+    public GrCUDAOptionMap(OptionValues options) {
+        optionsMap = new HashMap<>();
+        optionNames = new HashMap<>();
+
+        // Store the name and value of each option;
+        // Map each OptionKey to its name, to retrieve values inside GrCUDA;
+        options.getDescriptors().forEach(o -> {
+            optionsMap.put(o.getName(), options.get(o.getKey()));
+            optionNames.put(o.getKey(), o.getName());
+        });
+
+        // Parse individual options;
+
+        // Stream retrieval policy;
+        optionsMap.replace(optionNames.get(GrCUDAOptions.RetrieveNewStreamPolicy), parseRetrieveStreamPolicy(options.get(GrCUDAOptions.RetrieveNewStreamPolicy)));
+        // How streams are obtained from parent computations;
+        optionsMap.replace(optionNames.get(GrCUDAOptions.RetrieveParentStreamPolicy), parseParentStreamPolicy(options.get(GrCUDAOptions.RetrieveParentStreamPolicy)));
+        // Dependency computation policy;
+        optionsMap.replace(optionNames.get(GrCUDAOptions.DependencyPolicy), parseDependencyPolicy(options.get(GrCUDAOptions.DependencyPolicy)));
+        // Execution policy;
+        optionsMap.replace(optionNames.get(GrCUDAOptions.ExecutionPolicy), parseExecutionPolicy(options.get(GrCUDAOptions.ExecutionPolicy)));
+        // Device selection policy;
+        optionsMap.replace(optionNames.get(GrCUDAOptions.DeviceSelectionPolicy), parseDeviceSelectionPolicy(options.get(GrCUDAOptions.DeviceSelectionPolicy)));
+        // Memory advise policy;
+        optionsMap.replace(optionNames.get(GrCUDAOptions.MemAdvisePolicy), parseMemAdvisePolicy(options.get(GrCUDAOptions.MemAdvisePolicy)));
+    }
+
+    /**
+     * Obtain the option value starting from the OptionKey;
+     */
+    private Object getOptionValueFromOptionKey(OptionKey<?> optionKey) {
+        return optionsMap.get(optionNames.get(optionKey));
+    }
+
+    // Enforces immutability;
+    public HashMap<String, Object> getOptions(){
+        return new HashMap<>(optionsMap);
+    }
+
+    private static ExecutionPolicyEnum parseExecutionPolicy(String policyString) {
+        if (policyString.equals(ExecutionPolicyEnum.SYNC.toString())) return ExecutionPolicyEnum.SYNC;
+        else if (policyString.equals(ExecutionPolicyEnum.ASYNC.toString())) return ExecutionPolicyEnum.ASYNC;
+        else {
+            LOGGER.severe("unknown execution policy=" + policyString + "; using default=" + DEFAULT_EXECUTION_POLICY);
+            return DEFAULT_EXECUTION_POLICY;
+        }
+    }
+
+    private static DependencyPolicyEnum parseDependencyPolicy(String policyString) {
+        if (policyString.equals(DependencyPolicyEnum.WITH_CONST.toString())) return DependencyPolicyEnum.WITH_CONST;
+        else if (policyString.equals(DependencyPolicyEnum.NO_CONST.toString())) return DependencyPolicyEnum.NO_CONST;
+        else {
+            LOGGER.warning("Warning: unknown dependency policy=" + policyString + "; using default=" + DEFAULT_DEPENDENCY_POLICY);
+            return DEFAULT_DEPENDENCY_POLICY;
+        }
+    }
+
+    private static RetrieveNewStreamPolicyEnum parseRetrieveStreamPolicy(String policyString) {
+        if (policyString.equals(RetrieveNewStreamPolicyEnum.REUSE.toString())) return RetrieveNewStreamPolicyEnum.REUSE;
+        else if (policyString.equals(RetrieveNewStreamPolicyEnum.ALWAYS_NEW.toString())) return RetrieveNewStreamPolicyEnum.ALWAYS_NEW;
+        else {
+            LOGGER.warning("Warning: unknown new stream retrieval policy=" + policyString + "; using default=" + DEFAULT_RETRIEVE_STREAM_POLICY);
+            return DEFAULT_RETRIEVE_STREAM_POLICY;
+        }
+    }
+
+    private static RetrieveParentStreamPolicyEnum parseParentStreamPolicy(String policyString) {
+        if (Objects.equals(policyString, RetrieveParentStreamPolicyEnum.DISJOINT.toString())) return RetrieveParentStreamPolicyEnum.DISJOINT;
+        else if (Objects.equals(policyString, RetrieveParentStreamPolicyEnum.SAME_AS_PARENT.toString())) return RetrieveParentStreamPolicyEnum.SAME_AS_PARENT;
+        else if (Objects.equals(policyString, RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT.toString())) return RetrieveParentStreamPolicyEnum.MULTIGPU_EARLY_DISJOINT;
+        else if (Objects.equals(policyString, RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT.toString())) return RetrieveParentStreamPolicyEnum.MULTIGPU_DISJOINT;
+        else {
+            LOGGER.warning("Warning: unknown parent stream retrieval policy=" + policyString + "; using default=" + DEFAULT_PARENT_STREAM_POLICY);
+            return DEFAULT_PARENT_STREAM_POLICY;
+        }
+    }
+
+    private static DeviceSelectionPolicyEnum parseDeviceSelectionPolicy(String policyString) {
+        if (Objects.equals(policyString, DeviceSelectionPolicyEnum.SINGLE_GPU.toString())) return DeviceSelectionPolicyEnum.SINGLE_GPU;
+        else if (Objects.equals(policyString, DeviceSelectionPolicyEnum.ROUND_ROBIN.toString())) return DeviceSelectionPolicyEnum.ROUND_ROBIN;
+        else if (Objects.equals(policyString, DeviceSelectionPolicyEnum.STREAM_AWARE.toString())) return DeviceSelectionPolicyEnum.STREAM_AWARE;
+        else if (Objects.equals(policyString, DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE.toString())) return DeviceSelectionPolicyEnum.MIN_TRANSFER_SIZE;
+        else if (Objects.equals(policyString, DeviceSelectionPolicyEnum.MINMIN_TRANSFER_TIME.toString())) return DeviceSelectionPolicyEnum.MINMIN_TRANSFER_TIME;
+        else if (Objects.equals(policyString, DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME.toString())) return DeviceSelectionPolicyEnum.MINMAX_TRANSFER_TIME;
+        else {
+            LOGGER.warning("Warning: unknown device selection policy=" + policyString + "; using default=" + DEFAULT_DEVICE_SELECTION_POLICY);
+            return DEFAULT_DEVICE_SELECTION_POLICY;
+        }
+    }
+
+    private static MemAdviserEnum parseMemAdvisePolicy(String policyString) {
+        if (Objects.equals(policyString, MemAdviserEnum.ADVISE_READ_MOSTLY.toString())) return MemAdviserEnum.ADVISE_READ_MOSTLY;
+        else if (Objects.equals(policyString, MemAdviserEnum.ADVISE_PREFERRED_LOCATION.toString())) return MemAdviserEnum.ADVISE_PREFERRED_LOCATION;
+        else if (Objects.equals(policyString, MemAdviserEnum.NONE.toString())) return MemAdviserEnum.NONE;
+        else {
+            LOGGER.warning("Warning: unknown memory advice policy=" + policyString + "; using default=" + DEFAULT_MEM_ADVISE_POLICY);
+            return DEFAULT_MEM_ADVISE_POLICY;
+        }
+    }
+
+    public Boolean isCuBLASEnabled(){
+        return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.CuBLASEnabled);
+    }
+
+    public String getCuBLASLibrary(){
+        return (String) getOptionValueFromOptionKey(GrCUDAOptions.CuBLASLibrary);
+    }
+
+    public Boolean isCuMLEnabled(){
+        return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.CuMLEnabled);
+    }
+
+    public String getCuMLLibrary(){
+        return (String) getOptionValueFromOptionKey(GrCUDAOptions.CuMLLibrary);
+    }
+
+    public Boolean isCuSPARSEEnabled(){
+        return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.CuSPARSEEnabled);
+    }
+
+    public String getCuSPARSELibrary(){
+        return (String) getOptionValueFromOptionKey(GrCUDAOptions.CuSPARSELibrary);
+    }
+
+    public ExecutionPolicyEnum getExecutionPolicy(){
+        return (ExecutionPolicyEnum) getOptionValueFromOptionKey(GrCUDAOptions.ExecutionPolicy);
+    }
+
+    public DependencyPolicyEnum getDependencyPolicy(){
+        return (DependencyPolicyEnum) getOptionValueFromOptionKey(GrCUDAOptions.DependencyPolicy);
+    }
+
+    public RetrieveNewStreamPolicyEnum getRetrieveNewStreamPolicy(){
+        return (RetrieveNewStreamPolicyEnum) getOptionValueFromOptionKey(GrCUDAOptions.RetrieveNewStreamPolicy);
+    }
+
+    public RetrieveParentStreamPolicyEnum getRetrieveParentStreamPolicy(){
+        return (RetrieveParentStreamPolicyEnum) getOptionValueFromOptionKey(GrCUDAOptions.RetrieveParentStreamPolicy);
+    }
+
+    public Boolean isForceStreamAttach(){
+        return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.ForceStreamAttach);
+    }
+
+    public Boolean isInputPrefetch(){
+        return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.InputPrefetch);
+    }
+    
+    public Boolean isTimeComputation() { return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.EnableComputationTimers); }
+
+    public Boolean isTensorRTEnabled(){
+        return (Boolean) getOptionValueFromOptionKey(GrCUDAOptions.TensorRTEnabled);
+    }
+
+    public String getTensorRTLibrary(){
+        return (String) getOptionValueFromOptionKey(GrCUDAOptions.TensorRTLibrary);
+    }
+
+    public Integer getNumberOfGPUs() {
+        return (Integer) getOptionValueFromOptionKey(GrCUDAOptions.NumberOfGPUs);
+    }
+
+    public String getBandwidthMatrix() { return (String) getOptionValueFromOptionKey(GrCUDAOptions.BandwidthMatrix); }
+
+    public Double getDataThreshold() {
+        return (Double) getOptionValueFromOptionKey(GrCUDAOptions.DataThreshold);
+    }
+
+    public String getExportDAGPath() {
+        return (String) getOptionValueFromOptionKey(GrCUDAOptions.ExportDAG);
+    }
+
+    public MemAdviserEnum getMemAdvisePolicy() {
+        return (MemAdviserEnum) getOptionValueFromOptionKey(GrCUDAOptions.MemAdvisePolicy);
+    }
+
+    public DeviceSelectionPolicyEnum getDeviceSelectionPolicy() {
+        return (DeviceSelectionPolicyEnum) getOptionValueFromOptionKey(GrCUDAOptions.DeviceSelectionPolicy);
+    }
+
+    public void setNumberOfGPUs(int numberOfGPUs) {
+        LOGGER.info("updated the number of GPUs to use from " + getNumberOfGPUs() + " to " + numberOfGPUs);
+        optionsMap.replace(optionNames.get(GrCUDAOptions.NumberOfGPUs), numberOfGPUs);
+    }
+
+    // Implement InteropLibrary;
+
+    @ExportMessage
+    public final boolean hasHashEntries(){
+        return true;
+    }
+
+    @ExportMessage
+    public final Object readHashValue(Object key) throws UnknownKeyException, UnsupportedMessageException {
+        Object value;
+        if (key instanceof String){
+            value = this.optionsMap.get(key);
+        }
+        else {
+            throw UnsupportedMessageException.create();
+        }
+        if (value == null) throw UnknownKeyException.create(key);
+        return value.toString();
+    }
+
+    @ExportMessage
+    public final long getHashSize(){
+        return optionsMap.size();
+    }
+
+    @ExportMessage
+    public final boolean isHashEntryReadable(Object key) {
+        return key instanceof String && this.optionsMap.containsKey(key);
+    }
+
+    @ExportMessage
+    public Object getHashEntriesIterator() {
+        return new EntriesIterator(optionsMap.entrySet().iterator());
+    }
+
+    @ExportLibrary(InteropLibrary.class)
+    public static final class EntriesIterator implements TruffleObject {
+        private final Iterator<Map.Entry<String, Object>> iterator;
+
+        private EntriesIterator(Iterator<Map.Entry<String, Object>> iterator) {
+            this.iterator = iterator;
+        }
+
+        @SuppressWarnings("static-method")
+        @ExportMessage
+        public boolean isIterator() {
+            return true;
+        }
+
+        @ExportMessage
+        public boolean hasIteratorNextElement() {
+            try {
+                return iterator.hasNext();
+            } catch(NoSuchElementException e) {
+                return false;
+            }
+        }
+
+        @ExportMessage
+        public GrCUDAOptionTuple getIteratorNextElement() throws StopIterationException {
+            if (hasIteratorNextElement()) {
+                Map.Entry<String,Object> entry = iterator.next();
+                return new GrCUDAOptionTuple(entry.getKey(), entry.getValue().toString());
+            } else {
+                throw StopIterationException.create();
+            }
+        }
+    }
+
+    @ExportLibrary(InteropLibrary.class)
+    public static class GrCUDAOptionTuple implements TruffleObject {
+
+        private final int SIZE = 2;
+        private final String[] entry = new String[SIZE];
+
+        public GrCUDAOptionTuple(String key, String value) {
+            entry[0] = key;
+            entry[1] = value;
+        }
+
+        @ExportMessage
+        static boolean hasArrayElements(GrCUDAOptionTuple tuple) {
+            return true;
+        }
+
+        @ExportMessage
+        public final boolean isArrayElementReadable(long index) {
+            return index == 0 || index == 1;
+        }
+
+        @ExportMessage
+        public final Object readArrayElement(long index) throws InvalidArrayIndexException {
+            if (index == 0 || index == 1) {
+                return entry[(int)index];
+            }
+            else {
+                throw InvalidArrayIndexException.create(index);
+            }
+        }
+
+        @ExportMessage
+        public final long getArraySize() {
+            return SIZE;
+        }
+    }
+
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptions.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptions.java
index 906c240e..49180f24 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptions.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/GrCUDAOptions.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -32,33 +39,77 @@
 import org.graalvm.options.OptionKey;
 import org.graalvm.options.OptionStability;
 
-import com.nvidia.grcuda.cublas.CUBLASRegistry;
-import com.nvidia.grcuda.cuml.CUMLRegistry;
-import com.nvidia.grcuda.tensorrt.TensorRTRegistry;
+import com.nvidia.grcuda.cudalibraries.cublas.CUBLASRegistry;
+import com.nvidia.grcuda.cudalibraries.cuml.CUMLRegistry;
+import com.nvidia.grcuda.cudalibraries.tensorrt.TensorRTRegistry;
+import com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry;
 import com.oracle.truffle.api.Option;
 
 @Option.Group(GrCUDALanguage.ID)
 public final class GrCUDAOptions {
-    private GrCUDAOptions() {
-        // no instances
-    }
 
     @Option(category = OptionCategory.USER, help = "Enable cuBLAS support.", stability = OptionStability.STABLE) //
     public static final OptionKey<Boolean> CuBLASEnabled = new OptionKey<>(true);
 
-    @Option(category = OptionCategory.USER, help = "Set the location of the cublas library.", stability = OptionStability.STABLE) //
+    @Option(category = OptionCategory.USER, help = "Set the location of the cuBLAS library.", stability = OptionStability.STABLE) //
     public static final OptionKey<String> CuBLASLibrary = new OptionKey<>(CUBLASRegistry.DEFAULT_LIBRARY);
 
     @Option(category = OptionCategory.USER, help = "Enable cuML support.", stability = OptionStability.STABLE) //
     public static final OptionKey<Boolean> CuMLEnabled = new OptionKey<>(true);
 
-    @Option(category = OptionCategory.USER, help = "Set the location of the cuml library.", stability = OptionStability.STABLE) //
+    @Option(category = OptionCategory.USER, help = "Set the location of the cuML library.", stability = OptionStability.STABLE) //
     public static final OptionKey<String> CuMLLibrary = new OptionKey<>(CUMLRegistry.DEFAULT_LIBRARY);
+  
+    @Option(category = OptionCategory.USER, help = "Enable cuSPARSE support.", stability = OptionStability.STABLE) //
+    public static final OptionKey<Boolean> CuSPARSEEnabled = new OptionKey<>(true);
+
+    @Option(category = OptionCategory.USER, help = "Set the location of the cuSPARSE library.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> CuSPARSELibrary = new OptionKey<>(CUSPARSERegistry.DEFAULT_LIBRARY);
 
     @Option(category = OptionCategory.USER, help = "Enable TensorRT support.", stability = OptionStability.STABLE) //
-    public static final OptionKey<Boolean> TensorRTEnabled = new OptionKey<>(true);
+    public static final OptionKey<Boolean> TensorRTEnabled = new OptionKey<>(GrCUDAOptionMap.DEFAULT_TENSORRT_ENABLED);
 
     @Option(category = OptionCategory.USER, help = "Set the location of the TensorRT library.", stability = OptionStability.STABLE) //
     public static final OptionKey<String> TensorRTLibrary = new OptionKey<>(TensorRTRegistry.DEFAULT_LIBRARY);
 
+    @Option(category = OptionCategory.USER, help = "Log the execution time of GrCUDA computations using timers.", stability = OptionStability.STABLE) //
+    public static final OptionKey<Boolean> EnableComputationTimers = new OptionKey<>(GrCUDAOptionMap.DEFAULT_ENABLE_COMPUTATION_TIMERS);
+
+    @Option(category = OptionCategory.USER, help = "Choose the scheduling policy of GrCUDA computations.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> ExecutionPolicy = new OptionKey<>(GrCUDAOptionMap.DEFAULT_EXECUTION_POLICY.toString());
+
+    @Option(category = OptionCategory.USER, help = "Choose how data dependencies between GrCUDA computations are computed.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> DependencyPolicy = new OptionKey<>(GrCUDAOptionMap.DEFAULT_DEPENDENCY_POLICY.toString());
+
+    @Option(category = OptionCategory.USER, help = "Choose how streams for new GrCUDA computations are created.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> RetrieveNewStreamPolicy = new OptionKey<>(GrCUDAOptionMap.DEFAULT_RETRIEVE_STREAM_POLICY.toString());
+
+    @Option(category = OptionCategory.USER, help = "Choose how streams for new GrCUDA computations are obtained from parent computations.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> RetrieveParentStreamPolicy = new OptionKey<>(GrCUDAOptionMap.DEFAULT_PARENT_STREAM_POLICY.toString());
+
+    @Option(category = OptionCategory.USER, help = "Force the use of array stream attaching even when not required (e.g. post-Pascal GPUs).", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<Boolean> ForceStreamAttach = new OptionKey<>(GrCUDAOptionMap.DEFAULT_FORCE_STREAM_ATTACH);
+
+    @Option(category = OptionCategory.USER, help = "Always prefetch input arrays to GPU if possible (e.g. post-Pascal GPUs).", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<Boolean> InputPrefetch = new OptionKey<>(GrCUDAOptionMap.DEFAULT_INPUT_PREFETCH);
+
+    @Option(category = OptionCategory.USER, help = "Set how many GPUs can be used during computation. It must be at least 1, and if > 1 more than 1 GPUs are used (if available).", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<Integer> NumberOfGPUs = new OptionKey<>(GrCUDAOptionMap.DEFAULT_NUMBER_OF_GPUs);
+
+    @Option(category = OptionCategory.USER, help = "Choose the heuristic that manages how GPU computations are mapped to devices, if multiple GPUs are available.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> DeviceSelectionPolicy = new OptionKey<>(GrCUDAOptionMap.DEFAULT_DEVICE_SELECTION_POLICY.toString());
+
+    @Option(category = OptionCategory.USER, help = "Select a managed memory memAdvise flag, if multiple GPUs are available.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> MemAdvisePolicy = new OptionKey<>(GrCUDAOptionMap.DEFAULT_MEM_ADVISE_POLICY.toString());
+
+    @Option(category = OptionCategory.USER, help = "Set the location of the CSV file that contains the estimated bandwidth between each CPU and GPU in the system.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> BandwidthMatrix = new OptionKey<>(GrCUDAOptionMap.DEFAULT_BANDWIDTH_MATRIX);
+
+    @Option(category = OptionCategory.USER, help = "When selecting a device, do not give priority to devices that have less than this percentage of data already available, " +
+            "in some DeviceSelectionPolicies such as min-transfer-size. A lower percentage favors exploitation, a high percentage favors exploration.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<Double> DataThreshold = new OptionKey<>(GrCUDAOptionMap.DEFAULT_DATA_THRESHOLD);
+
+    @Option(category = OptionCategory.USER, help = "Add this option to dump scheduling DAG. Specify the destination path and the file name as value of the option (e.g. ../../../ExecutionDAG). File will be saved with .dot extension.", stability = OptionStability.EXPERIMENTAL) //
+    public static final OptionKey<String> ExportDAG = new OptionKey<>(GrCUDAOptionMap.DEFAULT_EXPORT_DAG);
 }
+
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/KernelBinding.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/KernelBinding.java
index cf2b048e..2ad67c35 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/KernelBinding.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/KernelBinding.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,19 +35,21 @@
  */
 package com.nvidia.grcuda;
 
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+
 import java.util.ArrayList;
 
 public final class KernelBinding extends Binding {
 
-    private KernelBinding(String name, ArrayList<Parameter> parameterList, boolean hasCxxMangledName) {
-        super(name, parameterList, hasCxxMangledName);
+    private KernelBinding(String name, ArrayList<ComputationArgument> computationArgumentList, boolean hasCxxMangledName) {
+        super(name, computationArgumentList, hasCxxMangledName);
     }
 
-    public static KernelBinding newCxxBinding(String name, ArrayList<Parameter> parameterList) {
-        return new KernelBinding(name, parameterList, true);
+    public static KernelBinding newCxxBinding(String name, ArrayList<ComputationArgument> computationArgumentList) {
+        return new KernelBinding(name, computationArgumentList, true);
     }
 
-    public static KernelBinding newCBinding(String name, ArrayList<Parameter> parameterList) {
-        return new KernelBinding(name, parameterList, false);
+    public static KernelBinding newCBinding(String name, ArrayList<ComputationArgument> computationArgumentList) {
+        return new KernelBinding(name, computationArgumentList, false);
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/ConfiguredKernel.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MemberSet.java
similarity index 50%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/ConfiguredKernel.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MemberSet.java
index f26750f1..9b4bdba4 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/ConfiguredKernel.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MemberSet.java
@@ -1,6 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -10,7 +9,10 @@
  *  * Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
  *
@@ -26,52 +28,53 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda;
 
-import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
-import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
 import com.oracle.truffle.api.interop.TruffleObject;
-import com.oracle.truffle.api.interop.UnsupportedTypeException;
-import com.oracle.truffle.api.library.CachedLibrary;
 import com.oracle.truffle.api.library.ExportLibrary;
 import com.oracle.truffle.api.library.ExportMessage;
 
+import java.util.Arrays;
+
 @ExportLibrary(InteropLibrary.class)
-public class ConfiguredKernel implements TruffleObject {
+public final class MemberSet implements TruffleObject {
 
-    private final Kernel kernel;
-    private final KernelConfig config;
+    @CompilerDirectives.CompilationFinal(dimensions = 1) private final String[] values;
 
-    public ConfiguredKernel(Kernel kernel, KernelConfig config) {
-        this.kernel = kernel;
-        this.config = config;
+    public MemberSet(String... values) {
+        this.values = values;
     }
 
     @ExportMessage
-    boolean isExecutable() {
+    @SuppressWarnings("static-method")
+    public boolean hasArrayElements() {
         return true;
     }
 
     @ExportMessage
-    @TruffleBoundary
-    Object execute(Object[] arguments,
-                    @CachedLibrary(limit = "3") InteropLibrary boolAccess,
-                    @CachedLibrary(limit = "3") InteropLibrary int8Access,
-                    @CachedLibrary(limit = "3") InteropLibrary int16Access,
-                    @CachedLibrary(limit = "3") InteropLibrary int32Access,
-                    @CachedLibrary(limit = "3") InteropLibrary int64Access,
-                    @CachedLibrary(limit = "3") InteropLibrary doubleAccess) throws UnsupportedTypeException, ArityException {
-        kernel.incrementLaunchCount();
-        try (KernelArguments args = kernel.createKernelArguments(arguments, boolAccess, int8Access, int16Access,
-                        int32Access, int64Access, doubleAccess)) {
-            kernel.getCudaRuntime().cuLaunchKernel(kernel, config, args);
+    public long getArraySize() {
+        return values.length;
+    }
+
+    @ExportMessage
+    public boolean isArrayElementReadable(long index) {
+        return index >= 0 && index < values.length;
+    }
+
+    @ExportMessage
+    public Object readArrayElement(long index) throws InvalidArrayIndexException {
+        if ((index < 0) || (index >= values.length)) {
+            CompilerDirectives.transferToInterpreter();
+            throw InvalidArrayIndexException.create(index);
         }
-        return this;
+        return values[(int) index];
     }
 
-    @Override
-    public String toString() {
-        return "ConfiguredKernel(config=" + config + ", kernel=" + kernel + ')';
+    @CompilerDirectives.TruffleBoundary
+    public boolean constainsValue(String name) {
+        return Arrays.asList(values).contains(name);
     }
-}
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MultiDimDeviceArrayView.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MultiDimDeviceArrayView.java
deleted file mode 100644
index 124a2bf9..00000000
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MultiDimDeviceArrayView.java
+++ /dev/null
@@ -1,181 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-package com.nvidia.grcuda;
-
-import com.oracle.truffle.api.CompilerDirectives;
-import com.oracle.truffle.api.dsl.Cached;
-import com.oracle.truffle.api.dsl.Cached.Shared;
-import com.oracle.truffle.api.interop.InteropLibrary;
-import com.oracle.truffle.api.interop.InvalidArrayIndexException;
-import com.oracle.truffle.api.interop.TruffleObject;
-import com.oracle.truffle.api.interop.UnsupportedMessageException;
-import com.oracle.truffle.api.interop.UnsupportedTypeException;
-import com.oracle.truffle.api.library.CachedLibrary;
-import com.oracle.truffle.api.library.ExportLibrary;
-import com.oracle.truffle.api.library.ExportMessage;
-import com.oracle.truffle.api.profiles.ValueProfile;
-
-@ExportLibrary(InteropLibrary.class)
-public final class MultiDimDeviceArrayView implements TruffleObject {
-
-    private final MultiDimDeviceArray mdDeviceArray;
-    private final int thisDimension;
-    private final long offset;
-    private final long stride;
-
-    MultiDimDeviceArrayView(MultiDimDeviceArray mdDeviceArray, int dim, long offset, long stride) {
-        this.mdDeviceArray = mdDeviceArray;
-        this.thisDimension = dim;
-        this.offset = offset;
-        this.stride = stride;
-    }
-
-    public int getDimension() {
-        return thisDimension;
-    }
-
-    public long getOffset() {
-        return offset;
-    }
-
-    public long getStride() {
-        return stride;
-    }
-
-    @Override
-    public String toString() {
-        return String.format("MultiDimDeviceArrayView(dim=%d, offset=%d, stride=%d)\n",
-                        thisDimension, offset, stride);
-    }
-
-    //
-    // Implementation of Interop Library
-    //
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean hasArrayElements() {
-        return true;
-    }
-
-    @ExportMessage
-    long getArraySize() {
-        return mdDeviceArray.getElementsInDimension(thisDimension);
-    }
-
-    @ExportMessage
-    boolean isArrayElementReadable(long index) {
-        return index >= 0 && index < mdDeviceArray.getElementsInDimension(thisDimension);
-    }
-
-    @ExportMessage
-    boolean isArrayElementModifiable(long index) {
-        return (thisDimension + 1) == mdDeviceArray.getNumberDimensions() &&
-                        index >= 0 && index < mdDeviceArray.getElementsInDimension(thisDimension);
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isArrayElementInsertable(@SuppressWarnings("unused") long index) {
-        return false;
-    }
-
-    @ExportMessage
-    Object readArrayElement(long index,
-                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws InvalidArrayIndexException {
-        if ((index < 0) || (index >= mdDeviceArray.getElementsInDimension(thisDimension))) {
-            CompilerDirectives.transferToInterpreter();
-            throw InvalidArrayIndexException.create(index);
-        }
-        if ((thisDimension + 1) == mdDeviceArray.getNumberDimensions()) {
-            long flatIndex = offset + index * stride;
-            switch (elementTypeProfile.profile(mdDeviceArray.getElementType())) {
-                case CHAR:
-                    return mdDeviceArray.getNativeView().getByte(flatIndex);
-                case SINT16:
-                    return mdDeviceArray.getNativeView().getShort(flatIndex);
-                case SINT32:
-                    return mdDeviceArray.getNativeView().getInt(flatIndex);
-                case SINT64:
-                    return mdDeviceArray.getNativeView().getLong(flatIndex);
-                case FLOAT:
-                    return mdDeviceArray.getNativeView().getFloat(flatIndex);
-                case DOUBLE:
-                    return mdDeviceArray.getNativeView().getDouble(flatIndex);
-            }
-            return null;
-        } else {
-            long off = offset + index * stride;
-            long newStride = mdDeviceArray.getStrideInDimension(thisDimension + 1);
-            return new MultiDimDeviceArrayView(mdDeviceArray, thisDimension + 1, off, newStride);
-        }
-    }
-
-    @ExportMessage
-    void writeArrayElement(long index, Object value,
-                    @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
-                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException, InvalidArrayIndexException {
-        if ((index < 0) || (index >= mdDeviceArray.getElementsInDimension(thisDimension))) {
-            CompilerDirectives.transferToInterpreter();
-            throw InvalidArrayIndexException.create(index);
-        }
-        if ((thisDimension + 1) == mdDeviceArray.getNumberDimensions()) {
-            long flatIndex = offset + index * stride;
-            try {
-                switch (elementTypeProfile.profile(mdDeviceArray.getElementType())) {
-                    case CHAR:
-                        mdDeviceArray.getNativeView().setByte(flatIndex, valueLibrary.asByte(value));
-                        break;
-                    case SINT16:
-                        mdDeviceArray.getNativeView().setShort(flatIndex, valueLibrary.asShort(value));
-                        break;
-                    case SINT32:
-                        mdDeviceArray.getNativeView().setInt(flatIndex, valueLibrary.asInt(value));
-                        break;
-                    case SINT64:
-                        mdDeviceArray.getNativeView().setLong(flatIndex, valueLibrary.asLong(value));
-                        break;
-                    case FLOAT:
-                        // InteropLibrary does not downcast Double to Float due loss of precision
-                        mdDeviceArray.getNativeView().setFloat(flatIndex, (float) valueLibrary.asDouble(value));
-                        break;
-                    case DOUBLE:
-                        mdDeviceArray.getNativeView().setDouble(flatIndex, valueLibrary.asDouble(value));
-                        break;
-                }
-            } catch (UnsupportedMessageException e) {
-                CompilerDirectives.transferToInterpreter();
-                throw UnsupportedTypeException.create(new Object[]{value}, "value cannot be coerced to " + mdDeviceArray.getElementType());
-            }
-        } else {
-            CompilerDirectives.transferToInterpreter();
-            throw new IllegalStateException("tried to write non-last dimension in MultiDimDeviceArrayView");
-        }
-    }
-}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Namespace.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Namespace.java
index c8d8b16f..009b5a2a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Namespace.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Namespace.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -31,9 +38,8 @@
 import java.util.Optional;
 import java.util.TreeMap;
 
-import com.nvidia.grcuda.DeviceArray.MemberSet;
 import com.nvidia.grcuda.functions.Function;
-import com.nvidia.grcuda.gpu.LazyKernel;
+import com.nvidia.grcuda.runtime.LazyKernel;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.InteropLibrary;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/NoneValue.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/NoneValue.java
index c78502f0..7a911023 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/NoneValue.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/NoneValue.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Type.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Type.java
index 418d3777..9aa500d1 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Type.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Type.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -183,7 +190,7 @@ public static Type fromNIDLTypeString(String type) throws TypeException {
         }
     }
 
-    String getMangled() {
+    public String getMangled() {
         return this.cxxEncoding;
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/TypeException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/TypeException.java
index 2219e8db..ce3b6b27 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/TypeException.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/TypeException.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -34,7 +41,7 @@ public class TypeException extends Exception {
 
     private static final long serialVersionUID = -7313402629647154160L;
 
-    TypeException(String message) {
+    public TypeException(String message) {
         super(message);
         CompilerAsserts.neverPartOfCompilation();
     }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/CUDALibraryFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/CUDALibraryFunction.java
new file mode 100644
index 00000000..e69495d8
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/CUDALibraryFunction.java
@@ -0,0 +1,87 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.cudalibraries;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.TypeException;
+import com.nvidia.grcuda.functions.Function;
+import com.oracle.truffle.api.CompilerDirectives;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Wrapper class to CUDA library functions. It holds the signature of the function being wrapped,
+ * and creates {@link ComputationArgument} for the signature and inputs;
+ */
+public abstract class CUDALibraryFunction extends Function {
+
+    protected final List<ComputationArgument> computationArguments;
+
+    /**
+     * Constructor, it takes the name of the wrapped function and its NFI signature,
+     * and creates a list of {@link ComputationArgument} from it;
+     * @param name name of the function
+     * @param nfiSignature NFI signature of the function
+     */
+    protected CUDALibraryFunction(String name, String nfiSignature) {
+        super(name);
+        // Create the list of computation arguments;
+        try {
+            this.computationArguments = ComputationArgument.parseParameterSignature(nfiSignature);
+        } catch (TypeException e) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(e.getMessage());
+        }
+    }
+
+    /**
+     * Given a list of inputs, map each signature argument to the corresponding input.
+     * Assume that inputs are given in the same order as specified by the signature.
+     * In this case, provide a pointer to the CUDA library handle, which is stored as first parameter of the argument list
+     * @param args list of inputs
+     * @param libraryHandle pointer to the native object used as CUDA library handle
+     * @return list of inputs mapped to signature elements, used to compute dependencies
+     */
+    public List<ComputationArgumentWithValue> createComputationArgumentWithValueList(Object[] args, Long libraryHandle) {
+        ArrayList<ComputationArgumentWithValue> argumentsWithValue = new ArrayList<>();
+        // Set the library handle;
+        argumentsWithValue.add(new ComputationArgumentWithValue(this.computationArguments.get(0), libraryHandle));
+        // Set the other arguments;
+        for (int i = 0; i < args.length; i++) {
+            argumentsWithValue.add(new ComputationArgumentWithValue(this.computationArguments.get(i + 1), args[i]));
+        }
+        return argumentsWithValue;
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cublas/CUBLASRegistry.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cublas/CUBLASRegistry.java
similarity index 76%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cublas/CUBLASRegistry.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cublas/CUBLASRegistry.java
index 8620ccc4..9f3a04aa 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cublas/CUBLASRegistry.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cublas/CUBLASRegistry.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,21 +33,25 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.cublas;
+package com.nvidia.grcuda.cudalibraries.cublas;
 
 import static com.nvidia.grcuda.functions.Function.INTEROP;
 import static com.nvidia.grcuda.functions.Function.expectLong;
 
 import java.util.ArrayList;
+import java.util.Arrays;
 
 import com.nvidia.grcuda.GrCUDAContext;
 import com.nvidia.grcuda.GrCUDAException;
 import com.nvidia.grcuda.GrCUDAInternalException;
-import com.nvidia.grcuda.GrCUDAOptions;
 import com.nvidia.grcuda.Namespace;
+import com.nvidia.grcuda.cudalibraries.CUDALibraryFunction;
 import com.nvidia.grcuda.functions.ExternalFunctionFactory;
 import com.nvidia.grcuda.functions.Function;
-import com.nvidia.grcuda.gpu.UnsafeHelper;
+import com.nvidia.grcuda.runtime.UnsafeHelper;
+import com.nvidia.grcuda.runtime.stream.LibrarySetStreamCUBLAS;
+import com.nvidia.grcuda.runtime.computation.CUDALibraryExecution;
+import com.nvidia.grcuda.runtime.stream.LibrarySetStream;
 import com.oracle.truffle.api.CompilerAsserts;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.CompilationFinal;
@@ -53,7 +64,7 @@
 import com.oracle.truffle.api.interop.UnsupportedTypeException;
 
 public class CUBLASRegistry {
-    public static final String DEFAULT_LIBRARY = "libcublas.so";
+    public static final String DEFAULT_LIBRARY = (System.getenv("LIBCUBLAS_DIR") != null ? System.getenv("LIBCUBLAS_DIR") : "") + "libcublas.so";
     public static final String DEFAULT_LIBRARY_HINT = " (CuBLAS library location can be set via the --grcuda.CuBLASLibrary= option. " +
                     "CuBLAS support can be disabled via --grcuda.CuBLASEnabled=false.";
     public static final String NAMESPACE = "BLAS";
@@ -61,19 +72,23 @@ public class CUBLASRegistry {
     private final GrCUDAContext context;
     private final String libraryPath;
 
+    private LibrarySetStream cublasLibrarySetStream;
+
     @CompilationFinal private TruffleObject cublasCreateFunction;
     @CompilationFinal private TruffleObject cublasDestroyFunction;
+    @CompilationFinal private TruffleObject cublasSetStreamFunction;
     @CompilationFinal private TruffleObject cublasCreateFunctionNFI;
     @CompilationFinal private TruffleObject cublasDestroyFunctionNFI;
+    @CompilationFinal private TruffleObject cublasSetStreamFunctionNFI;
 
     private Long cublasHandle = null;
 
     public CUBLASRegistry(GrCUDAContext context) {
         this.context = context;
-        libraryPath = context.getOption(GrCUDAOptions.CuBLASLibrary);
+        libraryPath = context.getOptions().getCuBLASLibrary();
     }
 
-    private void ensureInitialized() {
+    public void ensureInitialized() {
         if (cublasHandle == null) {
             CompilerDirectives.transferToInterpreterAndInvalidate();
 
@@ -81,6 +96,7 @@ private void ensureInitialized() {
 
             cublasCreateFunctionNFI = CUBLAS_CUBLASCREATE.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
             cublasDestroyFunctionNFI = CUBLAS_CUBLASDESTROY.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
+            cublasSetStreamFunctionNFI = CUBLAS_CUBLASSETSTREAM.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
 
             // create wrapper for cublasCreate: cublasError_t cublasCreate(long* handle) -> int
             // cublasCreate()
@@ -99,7 +115,7 @@ public Object call(Object[] arguments) throws ArityException {
                 }
             };
 
-            // create wrapper for cublasDestroy: cublasError_t cublasDestroy(long handle) -> void
+            // create wrapper for cublasDestroy: cublasError_t cublasDestroy(long handle)
             // cublasDestroy(long handle)
             cublasDestroyFunction = new Function(CUBLAS_CUBLASDESTROY.getName()) {
                 @Override
@@ -117,6 +133,23 @@ public Object call(Object[] arguments) throws ArityException, UnsupportedTypeExc
                 }
             };
 
+            cublasSetStreamFunction = new Function(CUBLAS_CUBLASSETSTREAM.getName()) {
+                @Override
+                @TruffleBoundary
+                public Object call(Object[] arguments) throws ArityException {
+                    checkArgumentLength(arguments, 2);
+                    try {
+                        long handle = expectLong(arguments[0]);
+                        long streamID = expectLong(arguments[1]);
+                        Object result = INTEROP.execute(cublasSetStreamFunctionNFI, handle, streamID);
+                        checkCUBLASReturnCode(result, "cublasSetStream");
+                        return result;
+                    } catch (InteropException e) {
+                        throw new GrCUDAInternalException(e);
+                    }
+                }
+            };
+
             try {
                 Object result = INTEROP.execute(cublasCreateFunction);
                 cublasHandle = expectLong(result);
@@ -126,6 +159,9 @@ public Object call(Object[] arguments) throws ArityException, UnsupportedTypeExc
                 throw new GrCUDAInternalException(e);
             }
         }
+
+        cublasLibrarySetStream = new LibrarySetStreamCUBLAS((Function) cublasSetStreamFunctionNFI, cublasHandle);
+
     }
 
     private void cuBLASShutdown() {
@@ -143,9 +179,9 @@ private void cuBLASShutdown() {
 
     public void registerCUBLASFunctions(Namespace namespace) {
         // Create function wrappers (decorators for all functions except handle con- and
-        // destruction)
+        // destruction);
         for (ExternalFunctionFactory factory : functions) {
-            final Function wrapperFunction = new Function(factory.getName()) {
+            final Function wrapperFunction = new CUDALibraryFunction(factory.getName(), factory.getNFISignature()) {
 
                 private Function nfiFunction;
 
@@ -153,18 +189,13 @@ public void registerCUBLASFunctions(Namespace namespace) {
                 @TruffleBoundary
                 protected Object call(Object[] arguments) {
                     ensureInitialized();
-
-                    Object[] argsWithHandle = new Object[arguments.length + 1];
-                    System.arraycopy(arguments, 0, argsWithHandle, 1, arguments.length);
-                    argsWithHandle[0] = cublasHandle;
-
                     try {
                         if (nfiFunction == null) {
                             CompilerDirectives.transferToInterpreterAndInvalidate();
                             nfiFunction = factory.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
                         }
-                        Object result = INTEROP.execute(nfiFunction, argsWithHandle);
-                        context.getCUDARuntime().cudaDeviceSynchronize();
+                        Object result = new CUDALibraryExecution(context.getGrCUDAExecutionContext(), nfiFunction, cublasLibrarySetStream,
+                                        this.createComputationArgumentWithValueList(arguments, cublasHandle)).schedule();
                         checkCUBLASReturnCode(result, nfiFunction.getName());
                         return result;
                     } catch (InteropException e) {
@@ -182,7 +213,7 @@ private static void checkCUBLASReturnCode(Object result, String... function) {
         try {
             returnCode = InteropLibrary.getFactory().getUncached().asInt(result);
         } catch (UnsupportedMessageException e) {
-            throw new GrCUDAInternalException("expected return code as Integer object in " + function + ", got " + result.getClass().getName());
+            throw new GrCUDAInternalException("expected return code as Integer object in " + Arrays.toString(function) + ", got " + result.getClass().getName());
         }
         if (returnCode != 0) {
             throw new GrCUDAException(returnCode, cublasReturnCodeToString(returnCode), function);
@@ -218,6 +249,7 @@ private static String cublasReturnCodeToString(int returnCode) {
 
     private static final ExternalFunctionFactory CUBLAS_CUBLASCREATE = new ExternalFunctionFactory("cublasCreate", "cublasCreate_v2", "(pointer): sint32");
     private static final ExternalFunctionFactory CUBLAS_CUBLASDESTROY = new ExternalFunctionFactory("cublasDestroy", "cublasDestroy_v2", "(sint64): sint32");
+    private static final ExternalFunctionFactory CUBLAS_CUBLASSETSTREAM = new ExternalFunctionFactory("cublasSetStream", "cublasSetStream_v2", "(sint64, sint64): sint32");
 
     private static final ArrayList<ExternalFunctionFactory> functions = new ArrayList<>();
 
@@ -231,4 +263,5 @@ private static String cublasReturnCodeToString(int returnCode) {
                             "(sint64, sint32, sint32, sint32, sint32, sint32, pointer, pointer, sint32, pointer, sint32, pointer, pointer, sint32): sint32"));
         }
     }
+
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cuml/CUMLRegistry.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cuml/CUMLRegistry.java
similarity index 72%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cuml/CUMLRegistry.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cuml/CUMLRegistry.java
index 662d3faa..a73b80be 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cuml/CUMLRegistry.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cuml/CUMLRegistry.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,9 +33,9 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.cuml;
+package com.nvidia.grcuda.cudalibraries.cuml;
 
-import static com.nvidia.grcuda.functions.Function.expectInt;
+import static com.nvidia.grcuda.functions.Function.expectLong;
 
 import java.util.Arrays;
 import java.util.EnumSet;
@@ -37,11 +44,14 @@
 import com.nvidia.grcuda.GrCUDAContext;
 import com.nvidia.grcuda.GrCUDAException;
 import com.nvidia.grcuda.GrCUDAInternalException;
-import com.nvidia.grcuda.GrCUDAOptions;
 import com.nvidia.grcuda.Namespace;
+import com.nvidia.grcuda.cudalibraries.CUDALibraryFunction;
+import com.nvidia.grcuda.runtime.stream.LibrarySetStreamCUML;
+import com.nvidia.grcuda.runtime.computation.CUDALibraryExecution;
 import com.nvidia.grcuda.functions.ExternalFunctionFactory;
 import com.nvidia.grcuda.functions.Function;
-import com.nvidia.grcuda.gpu.UnsafeHelper;
+import com.nvidia.grcuda.runtime.UnsafeHelper;
+import com.nvidia.grcuda.runtime.stream.LibrarySetStream;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.CompilationFinal;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
@@ -56,7 +66,8 @@ public class CUMLRegistry {
 
     private static final InteropLibrary INTEROP = InteropLibrary.getFactory().getUncached();
 
-    public static final String DEFAULT_LIBRARY = "libcuml.so";
+    public static final String DEFAULT_LIBRARY = (System.getenv("LIBCUML_DIR") != null ? System.getenv("LIBCUML_DIR") : "") + "libcuml.so";
+
     public static final String DEFAULT_LIBRARY_HINT = " (CuML library location can be set via the --grcuda.CuMLLibrary= option. " +
                     "CuML support can be disabled via --grcuda.CuMLEnabled=false.";
     public static final String NAMESPACE = "ML";
@@ -68,24 +79,31 @@ public class CUMLRegistry {
 
     @CompilationFinal private TruffleObject cumlDestroyFunction;
 
+    @CompilationFinal private TruffleObject cumlSetStreamFunction;
+
     @CompilationFinal private TruffleObject cumlCreateFunctionNFI;
 
     @CompilationFinal private TruffleObject cumlDestroyFunctionNFI;
 
-    private Integer cumlHandle = null;
+    @CompilationFinal private TruffleObject cumlSetStreamFunctionNFI;
+
+    private LibrarySetStream cumlLibrarySetStream;
+
+    private Long cumlHandle = null;
 
     public CUMLRegistry(GrCUDAContext context) {
         this.context = context;
-        libraryPath = context.getOption(GrCUDAOptions.CuMLLibrary);
+        libraryPath = context.getOptions().getCuMLLibrary();
     }
 
-    private void ensureInitialized() {
+    public void ensureInitialized() {
         if (cumlHandle == null) {
             CompilerDirectives.transferToInterpreterAndInvalidate();
 
             // create NFI function objects for handle creation and destruction
             cumlCreateFunctionNFI = CUMLFunctionNFI.CUML_CUMLCREATE.getFunctionFactory().makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
             cumlDestroyFunctionNFI = CUMLFunctionNFI.CUML_CUMLDESTROY.getFunctionFactory().makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
+            cumlSetStreamFunctionNFI = CUMLFunctionNFI.CUML_CUMLSETSTREAM.getFunctionFactory().makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
 
             // create wrapper for cumlCreate: cumlError_t cumlCreate(int* handle) -> int
             // cumlCreate()
@@ -94,7 +112,7 @@ private void ensureInitialized() {
                 @TruffleBoundary
                 public Object call(Object[] arguments) throws ArityException {
                     checkArgumentLength(arguments, 0);
-                    try (UnsafeHelper.Integer32Object handle = UnsafeHelper.createInteger32Object()) {
+                    try (UnsafeHelper.Integer64Object handle = UnsafeHelper.createInteger64Object()) {
                         Object result = INTEROP.execute(cumlCreateFunctionNFI, handle.getAddress());
                         checkCUMLReturnCode(result, "cumlCreate");
                         return handle.getValue();
@@ -105,14 +123,33 @@ public Object call(Object[] arguments) throws ArityException {
                 }
             };
 
-            // create wrapper for cumlDestroy: cumlError_t cumlDestroy(int handle) -> void
+            cumlSetStreamFunction = new Function(CUMLFunctionNFI.CUML_CUMLSETSTREAM.getFunctionFactory().getName()) {
+                @Override
+                @TruffleBoundary
+                public Object call(Object[] arguments) throws ArityException {
+                    checkArgumentLength(arguments, 2);
+                    try {
+                        long stream_handle = expectLong(arguments[0]);
+                        long streamID = expectLong(arguments[1]);
+                        Object result = INTEROP.execute(cumlSetStreamFunctionNFI, stream_handle, streamID);
+                        checkCUMLReturnCode(result, "cumlSetStream");
+                        return result;
+                    } catch (InteropException e) {
+                        CompilerDirectives.transferToInterpreter();
+                        throw new GrCUDAInternalException(e);
+                    }
+                }
+            };
+
+            // create wrapper for cumlDestroy: cumlError_t cumlDestroy(int handle) -> void // must
+            // be long
             // cumlDestroy(int handle)
             cumlDestroyFunction = new Function(CUMLFunctionNFI.CUML_CUMLDESTROY.getFunctionFactory().getName()) {
                 @Override
                 @TruffleBoundary
                 public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
                     checkArgumentLength(arguments, 1);
-                    Object handle = expectInt(arguments[0]);
+                    long handle = expectLong(arguments[0]);
                     try {
                         Object result = INTEROP.execute(cumlDestroyFunctionNFI, handle);
                         checkCUMLReturnCode(result, "cumlDestroy");
@@ -126,13 +163,14 @@ public Object call(Object[] arguments) throws ArityException, UnsupportedTypeExc
 
             try {
                 Object result = INTEROP.execute(cumlCreateFunction);
-                cumlHandle = expectInt(result);
+                cumlHandle = expectLong(result);
                 context.addDisposable(this::cuMLShutdown);
             } catch (InteropException e) {
                 CompilerDirectives.transferToInterpreter();
                 throw new GrCUDAInternalException(e);
             }
         }
+        cumlLibrarySetStream = new LibrarySetStreamCUML((Function) cumlSetStreamFunctionNFI, cumlHandle);
     }
 
     private void cuMLShutdown() {
@@ -151,10 +189,10 @@ private void cuMLShutdown() {
     public void registerCUMLFunctions(Namespace namespace) {
         // Create function wrappers (decorators for all functions except handle con- and
         // destruction)
-        List<CUMLFunctionNFI> hiddenFunctions = Arrays.asList(CUMLFunctionNFI.CUML_CUMLCREATE, CUMLFunctionNFI.CUML_CUMLDESTROY);
+        List<CUMLFunctionNFI> hiddenFunctions = Arrays.asList(CUMLFunctionNFI.CUML_CUMLCREATE, CUMLFunctionNFI.CUML_CUMLDESTROY, CUMLFunctionNFI.CUML_CUMLSETSTREAM);
         EnumSet.allOf(CUMLFunctionNFI.class).stream().filter(func -> !hiddenFunctions.contains(func)).forEach(func -> {
             final ExternalFunctionFactory factory = func.getFunctionFactory();
-            final Function wrapperFunction = new Function(factory.getName()) {
+            final Function wrapperFunction = new CUDALibraryFunction(factory.getName(), factory.getNFISignature()) {
 
                 private Function nfiFunction;
 
@@ -163,18 +201,13 @@ public void registerCUMLFunctions(Namespace namespace) {
                 public Object call(Object[] arguments) {
                     ensureInitialized();
 
-                    // Argument 0 is the function name in the frame, removing argument 0 and
-                    // replacing
-                    // it with the handle argument does not change the size of the argument array.
-                    Object[] argsWithHandle = new Object[arguments.length + 1];
-                    System.arraycopy(arguments, 0, argsWithHandle, 1, arguments.length);
-                    argsWithHandle[0] = cumlHandle;
                     try {
                         if (nfiFunction == null) {
                             CompilerDirectives.transferToInterpreterAndInvalidate();
                             nfiFunction = factory.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
                         }
-                        Object result = INTEROP.execute(nfiFunction, argsWithHandle);
+                        Object result = new CUDALibraryExecution(context.getGrCUDAExecutionContext(), nfiFunction, cumlLibrarySetStream,
+                                        this.createComputationArgumentWithValueList(arguments, cumlHandle)).schedule();
                         checkCUMLReturnCode(result, nfiFunction.getName());
                         return result;
                     } catch (InteropException e) {
@@ -216,9 +249,10 @@ private static String cumlReturnCodeToString(int returnCode) {
 
     public enum CUMLFunctionNFI {
         CUML_CUMLCREATE(new ExternalFunctionFactory("cumlCreate", "cumlCreate", "(pointer): sint32")),
-        CUML_CUMLDESTROY(new ExternalFunctionFactory("cumlDestroy", "cumlDestroy", "(sint32): sint32")),
-        CUML_DBSCANFITDOUBLE(new ExternalFunctionFactory("cumlDpDbscanFit", "cumlDpDbscanFit", "(sint32, pointer, sint32, sint32, double, sint32, pointer, uint64, sint32): sint32")),
-        CUML_DBSCANFITFLOAT(new ExternalFunctionFactory("cumlSpDbscanFit", "cumlSpDbscanFit", "(sint32, pointer, sint32, sint32, float, sint32, pointer, uint64, sint32): sint32"));
+        CUML_CUMLDESTROY(new ExternalFunctionFactory("cumlDestroy", "cumlDestroy", "(sint64): sint32")),
+        CUML_CUMLSETSTREAM(new ExternalFunctionFactory("cumlSetStream", "cumlSetStream", "(sint64, sint64): sint32")),
+        CUML_DBSCANFITDOUBLE(new ExternalFunctionFactory("cumlDpDbscanFit", "cumlDpDbscanFit", "(sint64, pointer, sint32, sint32, double, sint32, pointer, uint64, sint32): sint32")),
+        CUML_DBSCANFITFLOAT(new ExternalFunctionFactory("cumlSpDbscanFit", "cumlSpDbscanFit", "(sint64, pointer, sint32, sint32, float, sint32, pointer, uint64, sint32): sint32"));
 
         private final ExternalFunctionFactory factory;
 
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/CUSPARSERegistry.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/CUSPARSERegistry.java
new file mode 100644
index 00000000..9ebc2bf5
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/CUSPARSERegistry.java
@@ -0,0 +1,348 @@
+/*
+ * Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.cudalibraries.cusparse;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+import static com.nvidia.grcuda.functions.Function.expectLong;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import com.nvidia.grcuda.GrCUDAContext;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.GrCUDAInternalException;
+import com.nvidia.grcuda.Namespace;
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.cudalibraries.CUDALibraryFunction;
+import com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy.CUSPARSEProxy;
+import com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy.CUSPARSEProxyGemvi;
+import com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy.CUSPARSEProxySpMV;
+import com.nvidia.grcuda.functions.ExternalFunctionFactory;
+import com.nvidia.grcuda.functions.Function;
+import com.nvidia.grcuda.runtime.UnsafeHelper;
+import com.nvidia.grcuda.runtime.computation.CUDALibraryExecution;
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.stream.CUSPARSESetStreamFunction;
+import com.nvidia.grcuda.runtime.stream.LibrarySetStream;
+import com.oracle.truffle.api.CompilerAsserts;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.CompilerDirectives.CompilationFinal;
+import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.InteropException;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+public class CUSPARSERegistry {
+    public static final String DEFAULT_LIBRARY = (System.getenv("LIBCUSPARSE_DIR") != null ? System.getenv("LIBCUSPARSE_DIR") : "") + "libcusparse.so";
+    public static final String DEFAULT_LIBRARY_HINT = " (CuSPARSE library location can be set via the --grcuda.CuSPARSELibrary= option. " +
+                    "CuSPARSE support can be disabled via --grcuda.CuSPARSEEnabled=false.";
+    public static final String NAMESPACE = "SPARSE";
+
+    private final GrCUDAContext context;
+    private final String libraryPath;
+
+    private LibrarySetStream cusparseLibrarySetStream;
+
+    @CompilationFinal private TruffleObject cusparseCreateFunction;
+    @CompilationFinal private TruffleObject cusparseDestroyFunction;
+    @CompilationFinal private TruffleObject cusparseSetStreamFunction;
+
+    @CompilationFinal private TruffleObject cusparseCreateFunctionNFI;
+    @CompilationFinal private TruffleObject cusparseDestroyFunctionNFI;
+    @CompilationFinal private TruffleObject cusparseSetStreamFunctionNFI;
+
+
+    private Long cusparseHandle = null;
+
+    public enum CUSPARSEIndexType {
+        CUSPARSE_INDEX_UNUSED,
+        CUSPARSE_INDEX_16U,
+        CUSPARSE_INDEX_32I,
+        CUSPARSE_INDEX_64I;
+    }
+
+    public enum CUSPARSEIndexBase {
+        CUSPARSE_INDEX_BASE_ZERO,
+        CUSPARSE_INDEX_BASE_ONE;
+    }
+
+    public enum CUDADataType {
+        CUDA_R_32F, // 32 bit real
+        CUDA_R_64F, // 64 bit real
+        CUDA_R_16F, // 16 bit real
+        CUDA_R_8I, // 8 bit real as a signed integer
+        CUDA_C_32F, // 32 bit complex
+        CUDA_C_64F, // 64 bit complex
+        CUDA_C_16F, // 16 bit complex
+        CUDA_C_8I,  // 8 bit complex as a pair of signed integers
+        CUDA_R_8U,   // 8 bit real as a signed integer
+        CUDA_C_8U;  // 8 bit complex as a pair of signed integers
+    }
+
+    public enum CUSPARSEOperation {
+        CUSPARSE_OPERATION_NON_TRANSPOSE,
+        CUSPARSE_OPERATION_TRANSPOSE,
+        CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE;
+    }
+
+    public enum CUSPARSESpMVAlg {
+        CUSPARSE_SPMV_ALG_DEFAULT,
+        CUSPARSE_SPMV_COO_ALG1,
+        CUSPARSE_SPMV_COO_ALG2,
+        CUSPARSE_SPMV_CSR_ALG1,
+        CUSPARSE_SPMV_CSR_ALG2;
+    }
+
+    public CUSPARSERegistry(GrCUDAContext context) {
+        this.context = context;
+        // created field in GrCUDAOptions
+        libraryPath = context.getOptions().getCuSPARSELibrary();
+    }
+
+    public void ensureInitialized() {
+        if (cusparseHandle == null) {
+
+            CUSPARSEProxy.setContext(context);
+
+            CompilerDirectives.transferToInterpreterAndInvalidate();
+
+            // create NFI function objects for functions' management
+
+            cusparseCreateFunctionNFI = CUSPARSE_CUSPARSECREATE.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
+            cusparseDestroyFunctionNFI = CUSPARSE_CUSPARSEDESTROY.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
+            cusparseSetStreamFunctionNFI = CUSPARSE_CUSPARSESETSTREAM.makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
+
+            // cusparseStatus_t cusparseCreate(cusparseHandle_t handle)
+
+            cusparseCreateFunction = new Function(CUSPARSE_CUSPARSECREATE.getName()) {
+                @Override
+                @TruffleBoundary
+                public Object call(Object[] arguments) throws ArityException {
+                    checkArgumentLength(arguments, 0);
+                    try (UnsafeHelper.Integer64Object handle = UnsafeHelper.createInteger64Object()) {
+                        Object result = INTEROP.execute(cusparseCreateFunctionNFI, handle.getAddress());
+                        checkCUSPARSEReturnCode(result, "cusparseCreate");
+                        return handle.getValue();
+                    } catch (InteropException e) {
+                        throw new GrCUDAInternalException(e);
+                    }
+                }
+            };
+
+            // cusparseStatus_t cusparseDestroy(cusparseHandle_t* handle)
+
+            cusparseDestroyFunction = new Function(CUSPARSE_CUSPARSEDESTROY.getName()) {
+                @Override
+                @TruffleBoundary
+                public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                    checkArgumentLength(arguments, 1);
+                    long handle = expectLong(arguments[0]);
+                    try {
+                        Object result = INTEROP.execute(cusparseDestroyFunctionNFI, handle);
+                        checkCUSPARSEReturnCode(result, "cusparseDestroy");
+                        return result;
+                    } catch (InteropException e) {
+                        throw new GrCUDAInternalException(e);
+                    }
+                }
+            };
+
+            // cusparseStatus_t cusparseSetStream(cusparseHandle_t handle, cudaStream_t streamId)
+
+            cusparseSetStreamFunction = new Function(CUSPARSE_CUSPARSESETSTREAM.getName()) {
+                @Override
+                @TruffleBoundary
+                public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                    checkArgumentLength(arguments, 2);
+                    long handle = expectLong(arguments[0]);
+                    long streamId = expectLong(arguments[1]);
+                    try {
+                        Object result = INTEROP.execute(cusparseSetStreamFunctionNFI, handle, streamId);
+                        checkCUSPARSEReturnCode(result, "cusparseSetStream");
+                        return result;
+                    } catch (InteropException e) {
+                        throw new GrCUDAInternalException(e);
+                    }
+                }
+            };
+
+            try {
+                Object result = INTEROP.execute(cusparseCreateFunction);
+                cusparseHandle = expectLong(result);
+                context.addDisposable(this::cuSPARSEShutdown);
+            } catch (InteropException e) {
+                throw new GrCUDAInternalException(e);
+            }
+        }
+
+        cusparseLibrarySetStream = new CUSPARSESetStreamFunction((Function) cusparseSetStreamFunctionNFI, cusparseHandle);
+
+    }
+
+    private void cuSPARSEShutdown() {
+        CompilerAsserts.neverPartOfCompilation();
+        if (cusparseHandle != null) {
+            try {
+                Object result = InteropLibrary.getFactory().getUncached().execute(cusparseDestroyFunction, cusparseHandle);
+                checkCUSPARSEReturnCode(result, CUSPARSE_CUSPARSEDESTROY.getName());
+                cusparseHandle = null;
+            } catch (InteropException e) {
+                throw new GrCUDAInternalException(e);
+            }
+        }
+    }
+
+    public void registerCUSPARSEFunctions(Namespace namespace) {
+        // Create function wrappers
+        for (CUSPARSEProxy proxy : functions) {
+            final Function wrapperFunction = new CUDALibraryFunction(proxy.getExternalFunctionFactory().getName(), proxy.getExternalFunctionFactory().getNFISignature()) {
+
+                private Function nfiFunction;
+
+                @Override
+                public List<ComputationArgumentWithValue> createComputationArgumentWithValueList(Object[] args, Long libraryHandle) {
+                    ArrayList<ComputationArgumentWithValue> argumentsWithValue = new ArrayList<>();
+                    // Set the library handle;
+                    argumentsWithValue.add(new ComputationArgumentWithValue(this.computationArguments.get(0), libraryHandle));
+                    // Set the other arguments (size - 1 as we skip the handle, i.e. the argument 0);
+                    for (int i = 0; i < this.computationArguments.size() - 1; i++) {
+                        argumentsWithValue.add(new ComputationArgumentWithValue(this.computationArguments.get(i + 1), args[i]));
+                    }
+                    // Add extra arguments at the end: they are used to track input DeviceArrays;
+                    int numExtraArrays = args.length - (this.computationArguments.size() - 1);
+                    for (int i = 0; i < numExtraArrays; i++) {
+                        argumentsWithValue.add(new ComputationArgumentWithValue(
+                                "cusparse_extra_array_" + i, Type.NFI_POINTER, ComputationArgument.Kind.POINTER_INOUT,
+                                args[this.computationArguments.size() - 1 + i]));
+                    }
+                    return argumentsWithValue;
+                }
+
+                @Override
+                @TruffleBoundary
+                protected Object call(Object[] arguments) {
+                    ensureInitialized();
+
+                    try {
+                        if (nfiFunction == null) {
+                            CompilerDirectives.transferToInterpreterAndInvalidate();
+                            nfiFunction = proxy.getExternalFunctionFactory().makeFunction(context.getCUDARuntime(), libraryPath, DEFAULT_LIBRARY_HINT);
+                        }
+                        // This list of arguments might have extra arguments: the DeviceArrays that can cause dependencies but are not directly used by the cuSPARSE function,
+                        //   as these DeviceArrays might be wrapped using cuSparseMatrices/Vectors/Buffers.
+                        // We still need to pass these DeviceArrays to the GrCUDAComputationalElement so we track dependencies correctly,
+                        // but they are removed from the final list of arguments passed to the cuSPARSE library;
+                        Object[] formattedArguments = proxy.formatArguments(arguments, cusparseHandle);
+                        List<ComputationArgumentWithValue> computationArgumentsWithValue = this.createComputationArgumentWithValueList(formattedArguments, cusparseHandle);
+                        int extraArraysToTrack = computationArgumentsWithValue.size() - this.computationArguments.size();  // Both lists also contain the handle;
+                        Object result = new CUDALibraryExecution(context.getGrCUDAExecutionContext(), nfiFunction, cusparseLibrarySetStream,
+                                computationArgumentsWithValue, extraArraysToTrack).schedule();
+
+                        checkCUSPARSEReturnCode(result, nfiFunction.getName());
+                        return result;
+                    } catch (InteropException e) {
+                        throw new GrCUDAInternalException(e);
+                    }
+                }
+            };
+            namespace.addFunction(wrapperFunction);
+        }
+    }
+
+    public static void checkCUSPARSEReturnCode(Object result, String... function) {
+        CompilerAsserts.neverPartOfCompilation();
+        int returnCode;
+        try {
+            returnCode = InteropLibrary.getFactory().getUncached().asInt(result);
+        } catch (UnsupportedMessageException e) {
+            throw new GrCUDAInternalException("expected return code as Integer object in " + Arrays.toString(function) + ", got " + result.getClass().getName());
+        }
+        if (returnCode != 0) {
+            throw new GrCUDAException(returnCode, cusparseReturnCodeToString(returnCode), function);
+        }
+    }
+
+    public static String cusparseReturnCodeToString(int returnCode) {
+        switch (returnCode) {
+            case 0:
+                return "CUSPARSE_STATUS_SUCCESS";
+            case 1:
+                return "CUSPARSE_STATUS_NOT_INITIALIZED";
+            case 2:
+                return "CUSPARSE_STATUS_ALLOC_FAILED";
+            case 3:
+                return "CUSPARSE_STATUS_INVALID_VALUE";
+            case 4:
+                return "CUSPARSE_STATUS_ARCH_MISMATCH";
+            case 5:
+                return "CUSPARSE_STATUS_EXECUTION_FAILED";
+            case 6:
+                return "CUSPARSE_STATUS_INTERNAL_ERROR";
+            case 7:
+                return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
+            case 8:
+                return "CUSPARSE_STATUS_NOT_SUPPORTED";
+            case 9:
+                return "CUSPARSE_STATUS_INSUFFICIENT_RESOURCES";
+            default:
+                return "unknown error code: " + returnCode;
+        }
+    }
+
+    // functions exposed to the user
+
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSECREATE = new ExternalFunctionFactory("cusparseCreate", "cusparseCreate", "(pointer): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSEDESTROY = new ExternalFunctionFactory("cusparseDestroy", "cusparseDestroy", "(sint64): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSESETSTREAM = new ExternalFunctionFactory("cusparseSetStream", "cusparseSetStream", "(sint64, sint64): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSESPMV = new ExternalFunctionFactory("cusparseSpMV", "cusparseSpMV", "(sint64, sint32, pointer, sint64, " +
+                    "sint64, pointer, sint64, sint32, sint32, pointer): sint32");
+    private static final ArrayList<CUSPARSEProxy> functions = new ArrayList<>();
+
+    static {
+
+        for (char type : new char[]{'S', 'D', 'C', 'Z'}) {
+            final ExternalFunctionFactory CUSPARSE_CUSPARSEGEMVI = new ExternalFunctionFactory("cusparse" + type + "gemvi", "cusparseSgemvi", "(sint64, sint32, sint32, sint32," +
+                    "pointer, pointer, sint32, sint32, pointer, pointer, pointer, pointer, sint32, pointer): sint32");
+            functions.add(new CUSPARSEProxyGemvi(CUSPARSE_CUSPARSEGEMVI));
+        }
+
+        functions.add(new CUSPARSEProxySpMV(CUSPARSE_CUSPARSESPMV));
+    }
+
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxy.java
new file mode 100644
index 00000000..75cfcc59
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxy.java
@@ -0,0 +1,358 @@
+/*
+ * Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy;
+
+import static com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry.checkCUSPARSEReturnCode;
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+import static com.nvidia.grcuda.functions.Function.checkArgumentLength;
+
+import com.nvidia.grcuda.GrCUDAContext;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.GrCUDAInternalException;
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.GrCUDAOptions;
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry;
+import com.nvidia.grcuda.functions.CUDAFunction;
+import com.nvidia.grcuda.functions.ExternalFunctionFactory;
+import com.nvidia.grcuda.functions.Function;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.InteropException;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnknownIdentifierException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+public abstract class CUSPARSEProxy {
+
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseSetStreamFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseCreateCooFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseCreateCsrFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseCreateDnVecFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseSpMV_bufferSizeFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseSgemvi_bufferSizeFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseCgemvi_bufferSizeFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseDgemvi_bufferSizeFunctionNFI;
+    @CompilerDirectives.CompilationFinal private TruffleObject cusparseZgemvi_bufferSizeFunctionNFI;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseCreateCooFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseCreateCsrFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseCreateDnVecFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseSpMV_bufferSizeFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseSgemvi_bufferSizeFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseCgemvi_bufferSizeFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseDgemvi_bufferSizeFunction;
+    @CompilerDirectives.CompilationFinal protected TruffleObject cusparseZgemvi_bufferSizeFunction;
+
+    private final ExternalFunctionFactory externalFunctionFactory;
+    protected Object[] args;
+    private static GrCUDAContext context = null;
+
+    public CUSPARSEProxy(ExternalFunctionFactory externalFunctionFactory) {
+        this.externalFunctionFactory = externalFunctionFactory;
+    }
+
+    // we need to create a new context
+    public static void setContext(GrCUDAContext context) {
+        CUSPARSEProxy.context = context;
+    }
+
+    protected void initializeNfi() {
+
+        assert (context != null);
+
+        String libraryPath = context.getOptions().getCuSPARSELibrary(); //getOption(GrCUDAOptions.CuSPARSELibrary);
+
+        cusparseSetStreamFunctionNFI = CUSPARSE_CUSPARSESETSTREAM.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseCreateCooFunctionNFI = CUSPARSE_CUSPARSECREATECOO.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseCreateCsrFunctionNFI = CUSPARSE_CUSPARSECREATECSR.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseCreateDnVecFunctionNFI = CUSPARSE_CUSPARSECREATEDNVEC.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseSpMV_bufferSizeFunctionNFI = CUSPARSE_CUSPARSESPMV_BUFFERSIZE.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseSgemvi_bufferSizeFunctionNFI = CUSPARSE_CUSPARSESGEMVI_BUFFERSIZE.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseCgemvi_bufferSizeFunctionNFI = CUSPARSE_CUSPARSECGEMVI_BUFFERSIZE.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseDgemvi_bufferSizeFunctionNFI = CUSPARSE_CUSPARSEDGEMVI_BUFFERSIZE.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+        cusparseZgemvi_bufferSizeFunctionNFI = CUSPARSE_CUSPARSEZGEMVI_BUFFERSIZE.makeFunction(context.getCUDARuntime(), libraryPath, CUSPARSERegistry.DEFAULT_LIBRARY_HINT);
+
+        // cusparseStatus_t cusparseCreateCoo(cusparseSpMatDescr_t* spMatDescr,
+        // int64_t rows,
+        // int64_t cols,
+        // int64_t nnz,
+        // void* cooRowInd,
+        // void* cooColInd,
+        // void* cooValues,
+        // cusparseIndexType_t cooIdxType,
+        // cusparseIndexBase_t idxBase,
+        // cudaDataType valueType)
+
+        cusparseCreateCooFunction = new Function(CUSPARSE_CUSPARSECREATECOO.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 10);
+                Long cusparseSpMatDescr = expectLong(arguments[0]);
+                long rows = expectLong(arguments[1]);
+                long cols = expectLong(arguments[2]);
+                long nnz = expectLong(arguments[3]);
+                DeviceArray cooRowIdx = (DeviceArray) arguments[4];
+                DeviceArray cooColIdx = (DeviceArray) arguments[5];
+                DeviceArray cooValues = (DeviceArray) arguments[6];
+                CUSPARSERegistry.CUSPARSEIndexType cooIdxType = CUSPARSERegistry.CUSPARSEIndexType.values()[expectInt(arguments[7])];
+                CUSPARSERegistry.CUSPARSEIndexBase cooIdxBase = CUSPARSERegistry.CUSPARSEIndexBase.values()[expectInt(arguments[8])];
+                CUSPARSERegistry.CUDADataType valueType = CUSPARSERegistry.CUDADataType.values()[expectInt(arguments[9])];
+                try {
+                    Object result = INTEROP.execute(cusparseCreateCooFunctionNFI, cusparseSpMatDescr, rows, cols, nnz, cooRowIdx, cooColIdx, cooValues,
+                                    cooIdxType.ordinal(), cooIdxBase.ordinal(), valueType.ordinal());
+                    checkCUSPARSEReturnCode(result, "cusparseCreateCoo");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+
+        // cusparseStatus_t cusparseCreateCsr(cusparseSpMatDescr_t* spMatDescr,
+        // int64_t rows,
+        // int64_t cols,
+        // int64_t nnz,
+        // void* csrRowOffsets,
+        // void* csrColInd,
+        // void* csrValues,
+        // cusparseIndexType_t csrRowOffsetsType,
+        // cusparseIndexType_t csrColIndType,
+        // cusparseIndexBase_t idxBase,
+        // cudaDataType valueType)
+        cusparseCreateCsrFunction = new Function(CUSPARSE_CUSPARSECREATECSR.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 11);
+                Long cusparseSpMatDescr = expectLong(arguments[0]);
+                long rows = expectLong(arguments[1]);
+                long cols = expectLong(arguments[2]);
+                long nnz = expectLong(arguments[3]);
+                DeviceArray csrRowOffsets = (DeviceArray) arguments[4];
+                DeviceArray csrColIdx = (DeviceArray) arguments[5];
+                DeviceArray csrValues = (DeviceArray) arguments[6];
+                CUSPARSERegistry.CUSPARSEIndexType csrRowOffsetsType = CUSPARSERegistry.CUSPARSEIndexType.values()[expectInt(arguments[7])];
+                CUSPARSERegistry.CUSPARSEIndexType csrColIdxType = CUSPARSERegistry.CUSPARSEIndexType.values()[expectInt(arguments[8])];
+                CUSPARSERegistry.CUSPARSEIndexBase csrIdxBase = CUSPARSERegistry.CUSPARSEIndexBase.values()[expectInt(arguments[9])];
+                CUSPARSERegistry.CUDADataType valueType = CUSPARSERegistry.CUDADataType.values()[expectInt(arguments[10])];
+                try {
+                    Object result = INTEROP.execute(cusparseCreateCsrFunctionNFI, cusparseSpMatDescr, rows, cols, nnz, csrRowOffsets, csrColIdx, csrValues,
+                                    csrRowOffsetsType.ordinal(), csrColIdxType.ordinal(), csrIdxBase.ordinal(), valueType.ordinal());
+                    checkCUSPARSEReturnCode(result, "cusparseCreateCsr");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+
+        // cusparseStatus_t cusparseCreateDnVec(cusparseDnVecDescr_t* dnVecDescr,
+        // int64_t size,
+        // void* values,
+        // cudaDataType valueType)
+        cusparseCreateDnVecFunction = new Function(CUSPARSE_CUSPARSECREATEDNVEC.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 4);
+                Long cusparseDnVecDescr = expectLong(arguments[0]);
+                long size = expectLong(arguments[1]);
+                DeviceArray values = (DeviceArray) arguments[2];
+                CUSPARSERegistry.CUDADataType valueType = CUSPARSERegistry.CUDADataType.values()[expectInt(arguments[3])];
+                try {
+                    Object result = INTEROP.execute(cusparseCreateDnVecFunctionNFI, cusparseDnVecDescr, size, values, valueType.ordinal());
+                    checkCUSPARSEReturnCode(result, "cusparseCreateDnVec");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+
+        };
+
+        // cusparseStatus_t cusparseSpMV_bufferSize(cusparseHandle_t handle,
+        // cusparseOperation_t opA,
+        // const void* alpha,
+        // cusparseSpMatDescr_t matA,
+        // cusparseDnVecDescr_t vecX,
+        // const void* beta,
+        // cusparseDnVecDescr_t vecY,
+        // cudaDataType computeType,
+        // cusparseSpMVAlg_t alg,
+        // size_t* bufferSize)
+        cusparseSpMV_bufferSizeFunction = new Function(CUSPARSE_CUSPARSESPMV_BUFFERSIZE.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 10);
+                long handle = expectLong(arguments[0]);
+                CUSPARSERegistry.CUSPARSEOperation opA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(arguments[1])];
+                DeviceArray alpha = (DeviceArray) arguments[2];
+                long cusparseSpMatDesc = expectLong(arguments[3]);
+                long vecX = expectLong(arguments[4]);
+                DeviceArray beta = (DeviceArray) arguments[5];
+                long vecY = expectLong(arguments[6]);
+                CUSPARSERegistry.CUDADataType computeType = CUSPARSERegistry.CUDADataType.values()[expectInt(arguments[7])];
+                CUSPARSERegistry.CUSPARSESpMVAlg alg = CUSPARSERegistry.CUSPARSESpMVAlg.values()[expectInt(arguments[8])];
+                long bufferSize = expectLong(arguments[9]);
+                try {
+                    Object result = INTEROP.execute(cusparseSpMV_bufferSizeFunctionNFI, handle, opA.ordinal(), alpha, cusparseSpMatDesc, vecX, beta, vecY, computeType.ordinal(), alg.ordinal(),
+                                    bufferSize);
+                    checkCUSPARSEReturnCode(result, "cusparseSpMV_bufferSize");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+
+        // cusparseStatus_t cusparseSgemvi_bufferSize(cusparseHandle_t handle,
+        // cusparseOperation_t transA,
+        // int m,
+        // int n,
+        // int nnz,
+        // int* pBufferSize)
+        cusparseSgemvi_bufferSizeFunction = new Function(CUSPARSE_CUSPARSESGEMVI_BUFFERSIZE.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 6);
+                long handle = expectLong(arguments[0]);
+                CUSPARSERegistry.CUSPARSEOperation transA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(arguments[1])];
+                int rows = expectInt(arguments[2]);
+                int cols = expectInt(arguments[3]);
+                int nnz = expectInt(arguments[4]);
+                long pBufferSize = expectLong(arguments[5]);
+                try {
+                    Object result = INTEROP.execute(cusparseSgemvi_bufferSizeFunctionNFI, handle, transA.ordinal(), rows, cols, nnz, pBufferSize);
+                    checkCUSPARSEReturnCode(result, "cusparseSgemvi_bufferSize");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+
+        cusparseCgemvi_bufferSizeFunction = new Function(CUSPARSE_CUSPARSECGEMVI_BUFFERSIZE.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 6);
+                long handle = expectLong(arguments[0]);
+                CUSPARSERegistry.CUSPARSEOperation transA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(arguments[1])];
+                int rows = expectInt(arguments[2]);
+                int cols = expectInt(arguments[3]);
+                int nnz = expectInt(arguments[4]);
+                long pBufferSize = expectLong(arguments[5]);
+                try {
+                    Object result = INTEROP.execute(cusparseCgemvi_bufferSizeFunctionNFI, handle, transA.ordinal(), rows, cols, nnz, pBufferSize);
+                    checkCUSPARSEReturnCode(result, "cusparseCgemvi_bufferSize");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+
+        cusparseDgemvi_bufferSizeFunction = new Function(CUSPARSE_CUSPARSEDGEMVI_BUFFERSIZE.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 6);
+                long handle = expectLong(arguments[0]);
+                CUSPARSERegistry.CUSPARSEOperation transA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(arguments[1])];
+                int rows = expectInt(arguments[2]);
+                int cols = expectInt(arguments[3]);
+                int nnz = expectInt(arguments[4]);
+                long pBufferSize = expectLong(arguments[5]);
+                try {
+                    Object result = INTEROP.execute(cusparseDgemvi_bufferSizeFunctionNFI, handle, transA.ordinal(), rows, cols, nnz, pBufferSize);
+                    checkCUSPARSEReturnCode(result, "cusparseDgemvi_bufferSize");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+
+        cusparseZgemvi_bufferSizeFunction = new Function(CUSPARSE_CUSPARSEZGEMVI_BUFFERSIZE.getName()) {
+            @Override
+            @CompilerDirectives.TruffleBoundary
+            public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
+                checkArgumentLength(arguments, 6);
+                long handle = expectLong(arguments[0]);
+                CUSPARSERegistry.CUSPARSEOperation transA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(arguments[1])];
+                int rows = expectInt(arguments[2]);
+                int cols = expectInt(arguments[3]);
+                int nnz = expectInt(arguments[4]);
+                long pBufferSize = expectLong(arguments[5]);
+                try {
+                    Object result = INTEROP.execute(cusparseZgemvi_bufferSizeFunctionNFI, handle, transA.ordinal(), rows, cols, nnz, pBufferSize);
+                    checkCUSPARSEReturnCode(result, "cusparseZgemvi_bufferSize");
+                    return result;
+                } catch (InteropException e) {
+                    throw new GrCUDAInternalException(e);
+                }
+            }
+        };
+    }
+
+    public ExternalFunctionFactory getExternalFunctionFactory() {
+        return externalFunctionFactory;
+    }
+
+    public abstract Object[] formatArguments(Object[] rawArgs, long handle) throws UnsupportedTypeException, UnsupportedMessageException, ArityException;
+
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSESETSTREAM = new ExternalFunctionFactory("cusparseSetStream", "cusparseSetStream", "(sint64, sint64): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSECREATECOO = new ExternalFunctionFactory("cusparseCreateCoo", "cusparseCreateCoo", "(pointer, sint64, " +
+                    "sint64, sint64, pointer, pointer, pointer, sint32, sint32, sint32): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSECREATECSR = new ExternalFunctionFactory("cusparseCreateCsr", "cusparseCreateCsr", "(pointer, sint64, sint64, sint64," +
+                    "pointer, pointer, pointer, sint32, sint32, sint32, sint32): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSECREATEDNVEC = new ExternalFunctionFactory("cusparseCreateDnVec", "cusparseCreateDnVec", "(pointer, sint64, pointer, " +
+                    "sint32): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSESPMV_BUFFERSIZE = new ExternalFunctionFactory("cusparseSpMV_bufferSize", "cusparseSpMV_bufferSize", "(sint64, sint32," +
+                    "pointer, sint64, sint64, pointer, sint64, sint32, sint32, pointer): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSESGEMVI_BUFFERSIZE = new ExternalFunctionFactory("cusparseSgemvi_bufferSize", "cusparseSgemvi_bufferSize", "(sint64, sint32, " +
+                    "sint64, sint64, sint64, pointer): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSECGEMVI_BUFFERSIZE = new ExternalFunctionFactory("cusparseCgemvi_bufferSize", "cusparseCgemvi_bufferSize", "(sint64, sint32, " +
+                    "sint64, sint64, sint64, pointer): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSEDGEMVI_BUFFERSIZE = new ExternalFunctionFactory("cusparseDgemvi_bufferSize", "cusparseDgemvi_bufferSize", "(sint64, sint32, " +
+                    "sint64, sint64, sint64, pointer): sint32");
+    private static final ExternalFunctionFactory CUSPARSE_CUSPARSEZGEMVI_BUFFERSIZE = new ExternalFunctionFactory("cusparseZgemvi_bufferSize", "cusparseZgemvi_bufferSize", "(sint64, sint32, " +
+                    "sint64, sint64, sint64, pointer): sint32");
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxyGemvi.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxyGemvi.java
new file mode 100644
index 00000000..788a2dfb
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxyGemvi.java
@@ -0,0 +1,138 @@
+/*
+ * Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+import static com.nvidia.grcuda.functions.Function.expectInt;
+
+import com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry;
+import com.nvidia.grcuda.functions.ExternalFunctionFactory;
+import com.nvidia.grcuda.runtime.UnsafeHelper;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+public class CUSPARSEProxyGemvi extends CUSPARSEProxy {
+
+    private final int N_ARGS_RAW = 13; // args for library function
+
+    public CUSPARSEProxyGemvi(ExternalFunctionFactory externalFunctionFactory) {
+        super(externalFunctionFactory);
+    }
+
+    @Override
+    public Object[] formatArguments(Object[] rawArgs, long handle) throws UnsupportedTypeException, UnsupportedMessageException, ArityException {
+        this.initializeNfi();
+        if (rawArgs.length == 0) {
+            return rawArgs;
+        } else {
+
+            args = new Object[N_ARGS_RAW];
+
+            UnsafeHelper.Integer64Object bufferSize = UnsafeHelper.createInteger64Object();
+
+            bufferSize.setValue(0);
+
+            CUSPARSERegistry.CUSPARSEOperation transA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(rawArgs[0])];
+            int rows = expectInt(rawArgs[1]);
+            int cols = expectInt(rawArgs[2]);
+            DeviceArray alpha = (DeviceArray) rawArgs[3];
+            DeviceArray matA = (DeviceArray) rawArgs[4];
+            int lda = expectInt(rawArgs[5]);
+            int nnz = expectInt(rawArgs[6]);
+            DeviceArray x = (DeviceArray) rawArgs[7];
+            DeviceArray xInd = (DeviceArray) rawArgs[8];
+            DeviceArray beta = (DeviceArray) rawArgs[9];
+            DeviceArray outVec = (DeviceArray) rawArgs[10];
+            CUSPARSERegistry.CUSPARSEIndexBase idxBase = CUSPARSERegistry.CUSPARSEIndexBase.values()[expectInt(rawArgs[11])];
+            char type = (char) rawArgs[12];
+
+            // create buffer
+
+            switch (type) {
+                case 'S': {
+                    Object resultBufferSize = INTEROP.execute(cusparseSgemvi_bufferSizeFunction, handle, transA.ordinal(), rows, cols, nnz, bufferSize.getAddress());
+                    CUSPARSERegistry.checkCUSPARSEReturnCode(resultBufferSize, cusparseSgemvi_bufferSizeFunction.toString());
+                    break;
+                }
+                case 'D': {
+                    Object resultBufferSize = INTEROP.execute(cusparseDgemvi_bufferSizeFunction, handle, transA.ordinal(), rows, cols, nnz, bufferSize.getAddress());
+                    CUSPARSERegistry.checkCUSPARSEReturnCode(resultBufferSize, cusparseDgemvi_bufferSizeFunction.toString());
+                    break;
+                }
+                case 'C': {
+                    Object resultBufferSize = INTEROP.execute(cusparseCgemvi_bufferSizeFunction, handle, transA.ordinal(), rows, cols, nnz, bufferSize.getAddress());
+                    CUSPARSERegistry.checkCUSPARSEReturnCode(resultBufferSize, cusparseCgemvi_bufferSizeFunction.toString());
+                    break;
+                }
+                case 'Z': {
+                    Object resultBufferSize = INTEROP.execute(cusparseZgemvi_bufferSizeFunction, handle, transA.ordinal(), rows, cols, nnz, bufferSize.getAddress());
+                    CUSPARSERegistry.checkCUSPARSEReturnCode(resultBufferSize, cusparseZgemvi_bufferSizeFunction.toString());
+                    break;
+                }
+            }
+
+
+            long numElements;
+
+            if (bufferSize.getValue() == 0) {
+                numElements = 1;
+            } else {
+                numElements = (long) bufferSize.getValue();
+            }
+
+            DeviceArray buffer = new DeviceArray(alpha.getGrCUDAExecutionContext(), numElements, alpha.getElementType());
+
+            // FIXME: getting the runtime from an argument is not very clean, the proxy should maybe hold a direct reference of the runtime;
+            alpha.getGrCUDAExecutionContext().getCudaRuntime().cudaDeviceSynchronize();
+
+            args[0] = transA.ordinal();
+            args[1] = rows;
+            args[2] = cols;
+            args[3] = alpha;
+            args[4] = matA;
+            args[5] = lda;
+            args[6] = nnz;
+            args[7] = x;
+            args[8] = xInd;
+            args[9] = beta;
+            args[10] = outVec;
+            args[11] = idxBase.ordinal();
+            args[12] = buffer;
+
+            return args;
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxySpMV.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxySpMV.java
new file mode 100644
index 00000000..1cff430e
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/cusparse/cusparseproxy/CUSPARSEProxySpMV.java
@@ -0,0 +1,162 @@
+/*
+ * Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.cudalibraries.cusparse.cusparseproxy;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+import static com.nvidia.grcuda.functions.Function.expectInt;
+import static com.nvidia.grcuda.functions.Function.expectLong;
+
+import static com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry.CUSPARSEIndexType;
+import static com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry.CUSPARSEIndexBase;
+import static com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry.CUDADataType;
+import static com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry.CUSPARSESpMVAlg;
+
+import com.nvidia.grcuda.cudalibraries.cusparse.CUSPARSERegistry;
+import com.nvidia.grcuda.functions.ExternalFunctionFactory;
+import com.nvidia.grcuda.runtime.UnsafeHelper;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+import com.sun.jdi.Value;
+
+public class CUSPARSEProxySpMV extends CUSPARSEProxy {
+
+    public enum CUSPARSESpMVMatrixType {
+        SPMV_MATRIX_TYPE_COO,
+        SPMV_MATRIX_TYPE_CSR
+    }
+
+    // Number of arguments expected to directly call the original SpMV function in cuSPARSE;
+    private final int NUM_SPMV_ARGS_READ = 9;
+    // Number of arguments expected to call SpMV by automatically wrapping input arrays;
+    private final int NUM_SPMV_ARGS_WRAPPED = 14;
+
+    public CUSPARSEProxySpMV(ExternalFunctionFactory externalFunctionFactory) {
+        super(externalFunctionFactory);
+    }
+
+    @Override
+    public Object[] formatArguments(Object[] rawArgs, long handle) throws UnsupportedTypeException, UnsupportedMessageException, ArityException {
+        this.initializeNfi();
+        if (rawArgs.length == NUM_SPMV_ARGS_READ) {
+            return rawArgs;
+        } else {
+            args = new Object[NUM_SPMV_ARGS_WRAPPED];
+
+            // v1 and v2 can be X, Y, rowPtr
+            DeviceArray v1 = (DeviceArray) rawArgs[5];
+            DeviceArray v2 = (DeviceArray) rawArgs[6];
+            DeviceArray values = (DeviceArray) rawArgs[7];
+
+            // Last argument is the matrix type
+            CUSPARSESpMVMatrixType matrixType = CUSPARSESpMVMatrixType.values()[expectInt(rawArgs[rawArgs.length - 1])];
+
+            UnsafeHelper.Integer64Object dnVecXDescr = UnsafeHelper.createInteger64Object();
+            UnsafeHelper.Integer64Object dnVecYDescr = UnsafeHelper.createInteger64Object();
+            UnsafeHelper.Integer64Object matDescr = UnsafeHelper.createInteger64Object();
+            UnsafeHelper.Integer64Object bufferSize = UnsafeHelper.createInteger64Object();
+
+            CUSPARSERegistry.CUSPARSEOperation opA = CUSPARSERegistry.CUSPARSEOperation.values()[expectInt(rawArgs[0])];
+            DeviceArray alpha = (DeviceArray) rawArgs[1];
+            long rows = expectLong(rawArgs[2]);
+            long cols = expectLong(rawArgs[3]);
+            long nnz = expectLong(rawArgs[4]);
+            CUSPARSEIndexType idxType = CUSPARSEIndexType.values()[expectInt(rawArgs[8])];
+            CUSPARSEIndexBase idxBase = CUSPARSEIndexBase.values()[expectInt(rawArgs[9])];
+            CUDADataType valueType = CUDADataType.values()[expectInt(rawArgs[10])];
+            DeviceArray valuesX = (DeviceArray) rawArgs[11];
+            CUDADataType valueTypeVec = CUDADataType.values()[expectInt(rawArgs[12])];
+            DeviceArray beta = (DeviceArray) rawArgs[13];
+            DeviceArray valuesY = (DeviceArray) rawArgs[14];
+            CUSPARSESpMVAlg alg = CUSPARSESpMVAlg.values()[expectInt(rawArgs[15])];
+
+            switch (matrixType){
+                case SPMV_MATRIX_TYPE_COO: {
+                    Object resultCoo = INTEROP.execute(cusparseCreateCooFunction, matDescr.getAddress(), rows, cols, nnz, v1, v2, values, idxType.ordinal(), idxBase.ordinal(),
+                            valueType.ordinal());
+                    break;
+                }
+                case SPMV_MATRIX_TYPE_CSR: {
+                    Object resultCsr = INTEROP.execute(cusparseCreateCsrFunction, matDescr.getAddress(), rows, cols, nnz, v1, v2, values, idxType.ordinal(),
+                            idxType.ordinal(), idxBase.ordinal(), valueType.ordinal());
+                    break;
+                }
+
+            }
+
+            // create dense vectors X and Y descriptors
+            Object resultX = INTEROP.execute(cusparseCreateDnVecFunction, dnVecXDescr.getAddress(), cols, valuesX, valueTypeVec.ordinal());
+            Object resultY = INTEROP.execute(cusparseCreateDnVecFunction, dnVecYDescr.getAddress(), cols, valuesY, valueTypeVec.ordinal());
+
+            // create buffer
+            Object resultBufferSize = INTEROP.execute(cusparseSpMV_bufferSizeFunction, handle, opA.ordinal(), alpha, matDescr.getValue(), dnVecXDescr.getValue(), beta,
+                                dnVecYDescr.getValue(), valueType.ordinal(), alg.ordinal(), bufferSize.getAddress());
+
+            long numElements;
+
+            if (bufferSize.getValue() == 0) {
+                numElements = 1;
+            } else {
+                numElements = bufferSize.getValue() / 4;
+            }
+
+            DeviceArray buffer = new DeviceArray(alpha.getGrCUDAExecutionContext(), numElements, alpha.getElementType());
+
+            // FIXME: getting the runtime from an argument is not very clean, the proxy should maybe hold a direct reference of the runtime;
+            alpha.getGrCUDAExecutionContext().getCudaRuntime().cudaDeviceSynchronize();
+
+            // format new arguments
+            args[0] = opA.ordinal();
+            args[1] = alpha;
+            args[2] = matDescr.getValue();
+            args[3] = dnVecXDescr.getValue();
+            args[4] = beta;
+            args[5] = dnVecYDescr.getValue();
+            args[6] = valueType.ordinal();
+            args[7] = alg.ordinal();
+            args[8] = buffer;
+
+            // Extra arguments, required to track dependencies on the original input arrays;
+            args[9] = v1;
+            args[10] = v2;
+            args[11] = values;
+            args[12] = valuesX;
+            args[13] = valuesY;
+
+            return args;
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/tensorrt/TensorRTRegistry.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/tensorrt/TensorRTRegistry.java
similarity index 94%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/tensorrt/TensorRTRegistry.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/tensorrt/TensorRTRegistry.java
index e7de274d..2413bff9 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/tensorrt/TensorRTRegistry.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/cudalibraries/tensorrt/TensorRTRegistry.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,20 +32,20 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.tensorrt;
+package com.nvidia.grcuda.cudalibraries.tensorrt;
 
 import java.util.Arrays;
 import java.util.EnumSet;
 import java.util.List;
 
-import com.nvidia.grcuda.DeviceArray;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
 import com.nvidia.grcuda.GPUPointer;
 import com.nvidia.grcuda.GrCUDAContext;
 import com.nvidia.grcuda.GrCUDAOptions;
 import com.nvidia.grcuda.Namespace;
 import com.nvidia.grcuda.functions.ExternalFunctionFactory;
 import com.nvidia.grcuda.functions.Function;
-import com.nvidia.grcuda.gpu.UnsafeHelper;
+import com.nvidia.grcuda.runtime.UnsafeHelper;
 import com.nvidia.grcuda.GrCUDAException;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
@@ -63,7 +70,7 @@ public class TensorRTRegistry {
 
     public TensorRTRegistry(GrCUDAContext context) {
         this.context = context;
-        libraryPath = context.getOption(GrCUDAOptions.TensorRTLibrary);
+        libraryPath = context.getOptions().getTensorRTLibrary();
         context.addDisposable(this::shutdown);
     }
 
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindAllFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindAllFunction.java
index f4df3c29..4c71810d 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindAllFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindAllFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -38,8 +45,8 @@
 import com.nvidia.grcuda.GrCUDAException;
 import com.nvidia.grcuda.KernelBinding;
 import com.nvidia.grcuda.Namespace;
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.gpu.LazyKernel;
+import com.nvidia.grcuda.runtime.LazyKernel;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.nvidia.grcuda.parser.antlr.NIDLParser;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
@@ -47,13 +54,13 @@
 
 public class BindAllFunction extends Function {
 
-    @SuppressWarnings("unused") private final CUDARuntime cudaRuntime;
+    @SuppressWarnings("unused") private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
 
     private final GrCUDAContext context;
 
     public BindAllFunction(GrCUDAContext context) {
         super("bindall");
-        this.cudaRuntime = context.getCUDARuntime();
+        this.grCUDAExecutionContext = context.getGrCUDAExecutionContext();
         this.context = context;
     }
 
@@ -87,10 +94,10 @@ public Object call(Object[] arguments) throws UnsupportedTypeException, ArityExc
             throw new GrCUDAException("kernel and host function binding specified, can either bind kernel or host function");
         }
         if (kernelBindingPresent) {
-            bindings.forEach(binding -> namespaceTriple.leafNamespace.addKernel(new LazyKernel((KernelBinding) binding, cudaRuntime)));
+            bindings.forEach(binding -> namespaceTriple.leafNamespace.addKernel(new LazyKernel((KernelBinding) binding, grCUDAExecutionContext)));
         }
         if (functionBindingPresent) {
-            bindings.forEach(binding -> namespaceTriple.leafNamespace.addFunction(new HostFunction((FunctionBinding) binding, cudaRuntime)));
+            bindings.forEach(binding -> namespaceTriple.leafNamespace.addFunction(new HostFunction((FunctionBinding) binding, grCUDAExecutionContext.getCudaRuntime())));
         }
 
         if (namespaceTriple.childNamespace == null) {
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindFunction.java
index 9422fb70..1a2232b0 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -30,10 +37,10 @@
 
 import java.util.ArrayList;
 
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
 import com.nvidia.grcuda.FunctionBinding;
 import com.nvidia.grcuda.GrCUDAException;
 import com.nvidia.grcuda.GrCUDALanguage;
-import com.nvidia.grcuda.Parameter;
 import com.nvidia.grcuda.Type;
 import com.nvidia.grcuda.TypeException;
 import com.oracle.truffle.api.CompilerDirectives;
@@ -84,7 +91,7 @@ public HostFunction call(Object[] arguments) throws UnsupportedTypeException, Ar
 
         FunctionBinding binding = parseSignature(symbolName + signature);
         binding.setLibraryFileName(libraryFile);
-        HostFunction hf = new HostFunction(binding, GrCUDALanguage.getCurrentLanguage().getContextReference().get().getCUDARuntime());
+        HostFunction hf = new HostFunction(binding, GrCUDALanguage.getCurrentContext().getCUDARuntime());
         try {
             hf.resolveSymbol();
         } catch (UnknownIdentifierException e) {
@@ -124,7 +131,7 @@ private static FunctionBinding parseSignature(String signature) {
         String returnTypeString = s.substring(typeColonPos + 1).trim();
         try {
             Type returnType = Type.fromNIDLTypeString(returnTypeString);
-            ArrayList<Parameter> paramList = Parameter.parseParameterSignature(parenSignature);
+            ArrayList<ComputationArgument> paramList = ComputationArgument.parseParameterSignature(parenSignature);
             if (isCxxSymbol) {
                 return FunctionBinding.newCxxBinding(name, paramList, returnType);
             } else {
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindKernelFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindKernelFunction.java
index 7b0e4675..b8d29a4b 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindKernelFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BindKernelFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,25 +35,25 @@
  */
 package com.nvidia.grcuda.functions;
 
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import java.util.ArrayList;
 
 import com.nvidia.grcuda.GrCUDAException;
 import com.nvidia.grcuda.KernelBinding;
-import com.nvidia.grcuda.Parameter;
 import com.nvidia.grcuda.TypeException;
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.gpu.Kernel;
+import com.nvidia.grcuda.runtime.Kernel;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.UnsupportedTypeException;
 
 public final class BindKernelFunction extends Function {
 
-    private final CUDARuntime cudaRuntime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
 
-    public BindKernelFunction(CUDARuntime cudaRuntime) {
+    public BindKernelFunction(AbstractGrCUDAExecutionContext grCUDAExecutionContext) {
         super("bindkernel");
-        this.cudaRuntime = cudaRuntime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
     }
 
     @Override
@@ -85,7 +92,7 @@ public Kernel call(Object[] arguments) throws UnsupportedTypeException, ArityExc
             String symbolName = expectString(arguments[1], "argument 2 of bindkernel must be string (symbol name)").trim();
             String signature = expectString(arguments[2], "argument 3 of bind must be string (signature of kernel)").trim();
             try {
-                ArrayList<Parameter> paramList = Parameter.parseParameterSignature(signature);
+                ArrayList<ComputationArgument> paramList = ComputationArgument.parseParameterSignature(signature);
                 binding = KernelBinding.newCBinding(symbolName, paramList);
             } catch (TypeException e) {
                 throw new GrCUDAException("invalid type: " + e.getMessage());
@@ -102,7 +109,7 @@ public Kernel call(Object[] arguments) throws UnsupportedTypeException, ArityExc
             binding = parseSignature(signature);
         }
         binding.setLibraryFileName(fileName);
-        return cudaRuntime.loadKernel(binding);
+        return grCUDAExecutionContext.loadKernel(binding);
     }
 
     private static KernelBinding parseSignature(String signature) {
@@ -131,7 +138,7 @@ private static KernelBinding parseSignature(String signature) {
         String parenSignature = s.substring(firstLParenPos + 1, lastRParenPos).trim();
 
         try {
-            ArrayList<Parameter> paramList = Parameter.parseParameterSignature(parenSignature);
+            ArrayList<ComputationArgument> paramList = ComputationArgument.parseParameterSignature(parenSignature);
             if (isCxxSymbol) {
                 return KernelBinding.newCxxBinding(name, paramList);
             } else {
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BuildKernelFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BuildKernelFunction.java
index 3a7b24a4..f7f20d80 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BuildKernelFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/BuildKernelFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,18 +35,18 @@
  */
 package com.nvidia.grcuda.functions;
 
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.gpu.CUDARuntime;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.UnsupportedTypeException;
 
 public class BuildKernelFunction extends Function {
-    private final CUDARuntime cudaRuntime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
 
-    public BuildKernelFunction(CUDARuntime cudaRuntime) {
+    public BuildKernelFunction(AbstractGrCUDAExecutionContext grCUDAExecutionContext) {
         super("buildkernel");
-        this.cudaRuntime = cudaRuntime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
     }
 
     @Override
@@ -82,7 +89,6 @@ public Object call(Object[] arguments) throws UnsupportedTypeException, ArityExc
             // (comma-separated NFI types)
             kernelName = expectString(arguments[1], "argument 2 of buildkernel must be string (kernel name)").trim();
             parameterSignature = expectString(arguments[2], "argument 3 of build must be string (signature of kernel)").trim();
-
         } else {
             // parse NIDL kernel signature
             //
@@ -99,7 +105,7 @@ public Object call(Object[] arguments) throws UnsupportedTypeException, ArityExc
             kernelName = kernelNameSignaturePair.getKernelName();
             parameterSignature = kernelNameSignaturePair.getParameterSignature();
         }
-        return this.cudaRuntime.buildKernel(code, kernelName, parameterSignature);
+        return grCUDAExecutionContext.buildKernel(code, kernelName, parameterSignature);
     }
 
     private static KernelNameSignaturePair parseSignature(String signature) {
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/CUDAFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/CUDAFunction.java
index eb23a200..01e509a9 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/CUDAFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/CUDAFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -29,7 +36,7 @@
 package com.nvidia.grcuda.functions;
 
 import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.gpu.CUDARuntime;
+import com.nvidia.grcuda.runtime.CUDARuntime;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.InteropException;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayCopyFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayCopyFunction.java
index 3b287aed..08bb86f9 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayCopyFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayCopyFunction.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -27,7 +34,13 @@
  */
 package com.nvidia.grcuda.functions;
 
-import com.nvidia.grcuda.DeviceArray;
+import com.nvidia.grcuda.GrCUDALanguage;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArrayView;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.ArrayCopyFunctionExecutionDefault;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.ArrayCopyFunctionExecutionInitializer;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.ArrayCopyFunctionExecutionMemcpy;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.DeviceArrayCopyException;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.InteropLibrary;
@@ -46,11 +59,11 @@ public enum CopyDirection {
         TO_POINTER
     }
 
-    private final DeviceArray deviceArray;
+    private final AbstractArray array;
     private final CopyDirection direction;
 
-    public DeviceArrayCopyFunction(DeviceArray deviceArray, CopyDirection direction) {
-        this.deviceArray = deviceArray;
+    public DeviceArrayCopyFunction(AbstractArray array, CopyDirection direction) {
+        this.array = array;
         this.direction = direction;
     }
 
@@ -60,52 +73,94 @@ boolean isExecutable() {
         return true;
     }
 
-    private static long extractPointer(Object valueObj, String argumentName, InteropLibrary access) throws UnsupportedTypeException {
-        try {
-            if (access.isPointer(valueObj)) {
-                return access.asPointer(valueObj);
-            }
+    private static long extractPointer(Object valueObj, InteropLibrary access) throws UnsupportedMessageException {
+        if (access.isPointer(valueObj)) {
+            return access.asPointer(valueObj);
+        } else {
             return access.asLong(valueObj);
-        } catch (UnsupportedMessageException e) {
-            CompilerDirectives.transferToInterpreter();
-            throw UnsupportedTypeException.create(new Object[]{valueObj}, "integer expected for " + argumentName);
         }
     }
 
-    private static int extractNumber(Object valueObj, String argumentName, InteropLibrary access) throws UnsupportedTypeException {
+    private static int extractNumber(Object valueObj, InteropLibrary access) throws UnsupportedTypeException {
         try {
             return access.asInt(valueObj);
         } catch (UnsupportedMessageException e) {
             CompilerDirectives.transferToInterpreter();
-            throw UnsupportedTypeException.create(new Object[]{valueObj}, "integer expected for " + argumentName);
+            throw UnsupportedTypeException.create(new Object[]{valueObj}, "integer expected for numElements");
         }
     }
 
     @ExportMessage
     Object execute(Object[] arguments,
                     @CachedLibrary(limit = "3") InteropLibrary pointerAccess,
-                    @CachedLibrary(limit = "3") InteropLibrary numElementsAccess) throws UnsupportedTypeException, ArityException, IndexOutOfBoundsException {
+                    @CachedLibrary(limit = "3") InteropLibrary numElementsAccess) throws UnsupportedTypeException, ArityException, IndexOutOfBoundsException, DeviceArrayCopyException {
+        // Obtain the number of elements to copy;
         long numElements;
         if (arguments.length == 1) {
-            numElements = deviceArray.getArraySize();
+            numElements = array.getArraySize();
         } else if (arguments.length == 2) {
-            numElements = extractNumber(arguments[1], "numElements", numElementsAccess);
+            numElements = extractNumber(arguments[1], numElementsAccess);
         } else {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(1, arguments.length);
+            throw ArityException.create(1, 2, arguments.length);
+        }
+        // Obtain what kind of copy (pointer or array) should be executed.
+        // By default, see if we can use the fast CUDA memcpy: we cannot use it if the source and target arrays
+        // are stored with incompatible memory layouts.
+        ArrayCopyFunctionExecutionInitializer dependencyInitializer = new ArrayCopyFunctionExecutionInitializer(array, arguments[0], direction);
+        if (canUseMemcpy(arguments[0])) {
+            try {
+                // Try using the native pointer implementation;
+                long pointer = extractPointer(arguments[0], pointerAccess);
+                // Fast memcpy path;
+                return new ArrayCopyFunctionExecutionMemcpy(array, direction, numElements, pointer, dependencyInitializer).schedule();
+            } catch (UnsupportedMessageException e) {
+                GrCUDALanguage.LOGGER.info("cannot extract a native pointer; falling back to slow copy");
+            }
         }
-        long pointer = extractPointer(arguments[0], "fromPointer", pointerAccess);
-        if (direction == CopyDirection.FROM_POINTER) {
-            deviceArray.copyFrom(pointer, numElements);
+        // Perform the slow memcpy, if no other option is available;
+        return slowCopyPath(pointerAccess, arguments[0], numElements, dependencyInitializer);
+    }
+
+    private Object slowCopyPath(@CachedLibrary(limit = "3") InteropLibrary pointerAccess, Object otherArray,
+                                long numElements, ArrayCopyFunctionExecutionInitializer dependencyInitializer) throws UnsupportedTypeException {
+        // Slow array copy, suitable for generic arrays or incompatible memory layouts;
+        if (pointerAccess.hasArrayElements(otherArray)) {
+            return new ArrayCopyFunctionExecutionDefault(array, direction, numElements, pointerAccess, otherArray, dependencyInitializer).schedule();
+        } else {
+            // The target object is not an array;
+            CompilerDirectives.transferToInterpreter();
+            throw UnsupportedTypeException.create(new Object[]{otherArray}, "array or pointer expected for " + (direction.equals(CopyDirection.FROM_POINTER) ? "fromPointer" : "toPointer"));
         }
-        if (direction == CopyDirection.TO_POINTER) {
-            deviceArray.copyTo(pointer, numElements);
+    }
+
+    /**
+     * We can use fast memcpy only if both arrays are stored with the same memory layout. This also
+     * holds true if either the other array is not an Abstract array, but some other kind of native
+     * memory; in this second case, the user has the responsibility of providing meaningful data to
+     * copy.
+     * 
+     * @param other the other object involved in the copy
+     * @return if we can use the fast CUDA memcpy, under the assumption that "other" contains an accessible pointer
+     */
+    public boolean canUseMemcpy(Object other) {
+        if (other instanceof AbstractArray) {
+            boolean coherentMemoryLayout = array.isColumnMajorFormat() == ((AbstractArray) other).isColumnMajorFormat();
+            if (!coherentMemoryLayout) {
+                GrCUDALanguage.LOGGER.warning("both source and destination arrays should be row-major or column-major; falling back to slow copy");
+                return false;
+            }
+            if ((array instanceof MultiDimDeviceArrayView && array.isColumnMajorFormat()) ||
+                            (other instanceof MultiDimDeviceArrayView && ((MultiDimDeviceArrayView) other).isColumnMajorFormat())) {
+                GrCUDALanguage.LOGGER.warning("fast copy from/to column-major array views is not supported; falling back to slow copy");
+                return false;
+            }
         }
-        return deviceArray;
+        return true;
     }
 
     @Override
     public String toString() {
-        return "DeviceArrayCopyFunction(deviceArray=" + deviceArray + ", direction=" + direction.name() + ")";
+        return "DeviceArrayCopyFunction(deviceArray=" + array + ", direction=" + direction.name() + ")";
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayFunction.java
index dec75e04..4ece83d0 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/DeviceArrayFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -31,13 +38,13 @@
 import java.util.ArrayList;
 import java.util.Optional;
 
-import com.nvidia.grcuda.DeviceArray;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
 import com.nvidia.grcuda.Type;
 import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.MultiDimDeviceArray;
 import com.nvidia.grcuda.TypeException;
-import com.nvidia.grcuda.DeviceArray.MemberSet;
-import com.nvidia.grcuda.gpu.CUDARuntime;
+import com.nvidia.grcuda.MemberSet;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArray;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.dsl.Cached;
@@ -59,18 +66,18 @@ public final class DeviceArrayFunction extends Function {
 
     private static final MemberSet MEMBERS = new MemberSet(MAP);
 
-    private final CUDARuntime runtime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
 
-    public DeviceArrayFunction(CUDARuntime runtime) {
+    public DeviceArrayFunction(AbstractGrCUDAExecutionContext grCUDAExecutionContext) {
         super("DeviceArray");
-        this.runtime = runtime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
     }
 
     @Override
     @TruffleBoundary
     public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
         if (arguments.length < 1) {
-            throw ArityException.create(1, arguments.length);
+            throw ArityException.create(1, 2, arguments.length);
         }
         String typeName = expectString(arguments[0], "first argument of DeviceArray must be string (type name)");
         Type elementType;
@@ -80,13 +87,13 @@ public Object call(Object[] arguments) throws ArityException, UnsupportedTypeExc
             throw new GrCUDAException(e.getMessage());
         }
         if (arguments.length == 1) {
-            return new TypedDeviceArrayFunction(runtime, elementType);
+            return new TypedDeviceArrayFunction(grCUDAExecutionContext, elementType);
         } else {
-            return createArray(arguments, 1, elementType, runtime);
+            return createArray(arguments, 1, elementType, grCUDAExecutionContext);
         }
     }
 
-    static Object createArray(Object[] arguments, int start, Type elementType, CUDARuntime runtime) throws UnsupportedTypeException {
+    static Object createArray(Object[] arguments, int start, Type elementType, AbstractGrCUDAExecutionContext grCUDAExecutionContext) throws UnsupportedTypeException {
         ArrayList<Long> elementsPerDim = new ArrayList<>();
         Optional<Boolean> useColumnMajor = Optional.empty();
         for (int i = start; i < arguments.length; ++i) {
@@ -113,10 +120,10 @@ static Object createArray(Object[] arguments, int start, Type elementType, CUDAR
             }
         }
         if (elementsPerDim.size() == 1) {
-            return new DeviceArray(runtime, elementsPerDim.get(0), elementType);
+            return new DeviceArray(grCUDAExecutionContext, elementsPerDim.get(0), elementType);
         }
         long[] dimensions = elementsPerDim.stream().mapToLong(l -> l).toArray();
-        return new MultiDimDeviceArray(runtime, elementType, dimensions, useColumnMajor.orElse(false));
+        return new MultiDimDeviceArray(grCUDAExecutionContext, elementType, dimensions, useColumnMajor.orElse(false));
     }
 
     @ExportMessage
@@ -144,7 +151,7 @@ boolean isMemberExisting(String memberName,
     Object readMember(String memberName,
                     @Shared("memberName") @Cached("createIdentityProfile()") ValueProfile memberProfile) throws UnknownIdentifierException {
         if (MAP.equals(memberProfile.profile(memberName))) {
-            return new MapDeviceArrayFunction(runtime);
+            return new MapDeviceArrayFunction(grCUDAExecutionContext);
         }
         CompilerDirectives.transferToInterpreter();
         throw UnknownIdentifierException.create(memberName);
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunction.java
index 282f8be7..2f868ae8 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunctionFactory.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunctionFactory.java
index d7c9a191..f46ba065 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunctionFactory.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/ExternalFunctionFactory.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -29,7 +36,7 @@
 package com.nvidia.grcuda.functions;
 
 import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.gpu.CUDARuntime;
+import com.nvidia.grcuda.runtime.CUDARuntime;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.interop.InteropException;
 
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/Function.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/Function.java
index 0bce259c..00c817b4 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/Function.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/Function.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -105,7 +112,14 @@ public static long expectPositiveLong(Object number) throws UnsupportedTypeExcep
     public static void checkArgumentLength(Object[] arguments, int expected) throws ArityException {
         if (arguments.length != expected) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(expected, arguments.length);
+            throw ArityException.create(expected, expected, arguments.length);
+        }
+    }
+
+    public static void checkArgumentLength(Object[] arguments, int minExpected, int maxExpected) throws ArityException {
+        if (arguments.length < minExpected || arguments.length > maxExpected) {
+            CompilerDirectives.transferToInterpreter();
+            throw ArityException.create(minExpected, maxExpected, arguments.length);
         }
     }
 
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDeviceFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDeviceFunction.java
index 2d3b7fee..0bcb0749 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDeviceFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDeviceFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,19 +35,20 @@
  */
 package com.nvidia.grcuda.functions;
 
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.gpu.Device;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.UnsupportedTypeException;
 
 public class GetDeviceFunction extends Function {
 
-    private final CUDARuntime runtime;
+    private final AbstractGrCUDAExecutionContext context;
 
-    public GetDeviceFunction(CUDARuntime runtime) {
+    public GetDeviceFunction(AbstractGrCUDAExecutionContext context) {
         super("getdevice");
-        this.runtime = runtime;
+        this.context = context;
     }
 
     @Override
@@ -48,6 +56,6 @@ public GetDeviceFunction(CUDARuntime runtime) {
     public Object call(Object[] arguments) throws UnsupportedTypeException, ArityException {
         checkArgumentLength(arguments, 1);
         int deviceId = expectPositiveInt(arguments[0]);
-        return new Device(deviceId, runtime);
+        return context.getDevice(deviceId);
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDevicesFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDevicesFunction.java
index e1586673..b28957e2 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDevicesFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetDevicesFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,25 +35,23 @@
  */
 package com.nvidia.grcuda.functions;
 
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.gpu.DeviceList;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.UnsupportedTypeException;
 
 public class GetDevicesFunction extends Function {
-    private final CUDARuntime runtime;
+    private final AbstractGrCUDAExecutionContext context;
 
-    public GetDevicesFunction(CUDARuntime runtime) {
+    public GetDevicesFunction(AbstractGrCUDAExecutionContext context) {
         super("getdevices");
-        this.runtime = runtime;
+        this.context = context;
     }
 
     @Override
     @TruffleBoundary
     public Object call(Object[] arguments) throws UnsupportedTypeException, ArityException {
         checkArgumentLength(arguments, 0);
-        int numDevices = runtime.cudaGetDeviceCount();
-        return new DeviceList(numDevices, runtime);
+        return context.getDeviceList();
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetOptionsFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetOptionsFunction.java
new file mode 100644
index 00000000..2c269dbd
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/GetOptionsFunction.java
@@ -0,0 +1,53 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package com.nvidia.grcuda.functions;
+
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+public class GetOptionsFunction extends Function {
+    private final GrCUDAOptionMap optionMap;
+
+    public GetOptionsFunction(GrCUDAOptionMap optionMap) {
+        super("getoptions");
+        this.optionMap = optionMap;
+    }
+
+    @Override
+    @TruffleBoundary
+    public Object call(Object[] arguments) throws UnsupportedTypeException, ArityException {
+        checkArgumentLength(arguments, 0);
+        return optionMap;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/HostFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/HostFunction.java
index 1ef6ed59..ac88e877 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/HostFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/HostFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -30,7 +37,7 @@
 
 import com.nvidia.grcuda.FunctionBinding;
 import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.gpu.CUDARuntime;
+import com.nvidia.grcuda.runtime.CUDARuntime;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.UnknownIdentifierException;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/MapDeviceArrayFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/MapDeviceArrayFunction.java
index 31121366..993fbad4 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/MapDeviceArrayFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/MapDeviceArrayFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -30,7 +37,7 @@
 
 import java.util.concurrent.ConcurrentHashMap;
 
-import com.nvidia.grcuda.DeviceArray;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
 import com.nvidia.grcuda.Type;
 import com.nvidia.grcuda.GrCUDAContext;
 import com.nvidia.grcuda.GrCUDAException;
@@ -38,7 +45,7 @@
 import com.nvidia.grcuda.GrCUDALanguage;
 import com.nvidia.grcuda.NoneValue;
 import com.nvidia.grcuda.TypeException;
-import com.nvidia.grcuda.gpu.CUDARuntime;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.oracle.truffle.api.CallTarget;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.Truffle;
@@ -75,7 +82,7 @@
 @GenerateUncached
 abstract class MapArrayNode extends Node {
 
-    abstract Object execute(Object source, Type elementType, CUDARuntime runtime);
+    abstract Object execute(Object source, Type elementType, AbstractGrCUDAExecutionContext grCUDAExecutionContext);
 
     private static final FrameDescriptor DESCRIPTOR = new FrameDescriptor();
     private static final FrameSlot SIZE_SLOT = DESCRIPTOR.addFrameSlot("size", FrameSlotKind.Long);
@@ -150,7 +157,7 @@ public Object execute(VirtualFrame frame) {
             frame.setLong(INDEX_SLOT, 0);
             frame.setObject(SOURCE_SLOT, frame.getArguments()[1]);
             frame.setObject(RESULT_SLOT, frame.getArguments()[2]);
-            loop.executeLoop(frame);
+            loop.execute(frame);
             return NoneValue.get();
         }
     }
@@ -170,7 +177,7 @@ protected CallTarget createUncachedLoop(Object source, GrCUDAContext context) {
     }
 
     @Specialization(limit = "3")
-    Object doMap(Object source, Type elementType, CUDARuntime runtime,
+    Object doMap(Object source, Type elementType, AbstractGrCUDAExecutionContext grCUDAExecutionContext,
                     @CachedLibrary("source") InteropLibrary interop,
                     @CachedContext(GrCUDALanguage.class) @SuppressWarnings("unused") GrCUDAContext context,
                     @Cached(value = "createLoop(source)", uncached = "createUncachedLoop(source, context)") CallTarget loop) {
@@ -191,7 +198,7 @@ Object doMap(Object source, Type elementType, CUDARuntime runtime,
             CompilerDirectives.transferToInterpreter();
             throw new GrCUDAException("cannot read array size");
         }
-        DeviceArray result = new DeviceArray(runtime, size, elementType);
+        DeviceArray result = new DeviceArray(grCUDAExecutionContext, size, elementType);
         loop.call(size, source, result);
         return result;
     }
@@ -205,11 +212,11 @@ Object doMap(Object source, Type elementType, CUDARuntime runtime,
 @ExportLibrary(InteropLibrary.class)
 public final class MapDeviceArrayFunction extends Function {
 
-    private final CUDARuntime runtime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
 
-    public MapDeviceArrayFunction(CUDARuntime runtime) {
+    public MapDeviceArrayFunction(AbstractGrCUDAExecutionContext grCUDAExecutionContext) {
         super("MapDeviceArray");
-        this.runtime = runtime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
     }
 
     @ExportMessage
@@ -220,7 +227,7 @@ public Object execute(Object[] arguments,
                     @Cached MapArrayNode mapNode) throws ArityException, UnsupportedTypeException {
         if (arguments.length < 1) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(1, arguments.length);
+            throw ArityException.create(1, 2, arguments.length);
         }
         String typeName;
         try {
@@ -236,13 +243,12 @@ public Object execute(Object[] arguments,
             throw new GrCUDAInternalException(e.getMessage());
         }
         if (arguments.length == 1) {
-            return new TypedMapDeviceArrayFunction(runtime, elementType);
+            return new TypedMapDeviceArrayFunction(grCUDAExecutionContext, elementType);
+        } else if (arguments.length == 2) {
+            return mapNode.execute(arguments[1], elementType, grCUDAExecutionContext);
         } else {
-            if (arguments.length != 2) {
-                CompilerDirectives.transferToInterpreter();
-                throw ArityException.create(2, arguments.length);
-            }
-            return mapNode.execute(arguments[1], elementType, runtime);
+            CompilerDirectives.transferToInterpreter();
+            throw ArityException.create(1, 2, arguments.length);
         }
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedDeviceArrayFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedDeviceArrayFunction.java
index 04d6d226..10b28ff7 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedDeviceArrayFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedDeviceArrayFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,8 +35,8 @@
  */
 package com.nvidia.grcuda.functions;
 
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.nvidia.grcuda.Type;
-import com.nvidia.grcuda.gpu.CUDARuntime;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.ArityException;
@@ -40,12 +47,12 @@
  */
 public final class TypedDeviceArrayFunction extends Function {
 
-    private final CUDARuntime runtime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
     private final Type elementType;
 
-    public TypedDeviceArrayFunction(CUDARuntime runtime, Type elementType) {
+    public TypedDeviceArrayFunction(AbstractGrCUDAExecutionContext grCUDAExecutionContext, Type elementType) {
         super("TypedDeviceArray");
-        this.runtime = runtime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
         this.elementType = elementType;
     }
 
@@ -54,8 +61,10 @@ public TypedDeviceArrayFunction(CUDARuntime runtime, Type elementType) {
     public Object call(Object[] arguments) throws ArityException, UnsupportedTypeException {
         if (arguments.length < 1) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(1, arguments.length);
+            // FIXME: the maximum number of arguments is unbound (as each argument is a dimension of a N-dimensional tensor).
+            //  Truffle currently uses -1 to handle an unbound number of arguments;
+            throw ArityException.create(1, -1, arguments.length);
         }
-        return DeviceArrayFunction.createArray(arguments, 0, elementType, runtime);
+        return DeviceArrayFunction.createArray(arguments, 0, elementType, grCUDAExecutionContext);
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedMapDeviceArrayFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedMapDeviceArrayFunction.java
index 14f21bed..9b795bbd 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedMapDeviceArrayFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/TypedMapDeviceArrayFunction.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -28,8 +35,8 @@
  */
 package com.nvidia.grcuda.functions;
 
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.nvidia.grcuda.Type;
-import com.nvidia.grcuda.gpu.CUDARuntime;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.dsl.Cached;
 import com.oracle.truffle.api.interop.ArityException;
@@ -43,12 +50,12 @@
 @ExportLibrary(InteropLibrary.class)
 public final class TypedMapDeviceArrayFunction extends Function {
 
-    private final CUDARuntime runtime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
     private final Type elementType;
 
-    public TypedMapDeviceArrayFunction(CUDARuntime runtime, Type elementType) {
+    public TypedMapDeviceArrayFunction(AbstractGrCUDAExecutionContext grCUDAExecutionContext, Type elementType) {
         super("TypedMapDeviceArray");
-        this.runtime = runtime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
         this.elementType = elementType;
     }
 
@@ -57,8 +64,8 @@ public Object execute(Object[] arguments,
                     @Cached MapArrayNode mapNode) throws ArityException {
         if (arguments.length != 1) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(1, arguments.length);
+            throw ArityException.create(1, 1, arguments.length);
         }
-        return mapNode.execute(arguments[0], elementType, runtime);
+        return mapNode.execute(arguments[0], elementType, grCUDAExecutionContext);
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ArgumentSet.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ArgumentSet.java
index 000c2a30..7c3830cf 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ArgumentSet.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ArgumentSet.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -29,7 +36,7 @@
 
 import org.graalvm.collections.EconomicMap;
 
-import com.nvidia.grcuda.DeviceArray.MemberSet;
+import com.nvidia.grcuda.MemberSet;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.InteropLibrary;
 import com.oracle.truffle.api.interop.TruffleObject;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapArgObject.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapArgObject.java
index eff3d35c..0dc9b6d9 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapArgObject.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapArgObject.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -29,7 +36,7 @@
 
 import static com.nvidia.grcuda.functions.map.MapFunction.checkArity;
 
-import com.nvidia.grcuda.DeviceArray.MemberSet;
+import com.nvidia.grcuda.MemberSet;
 import com.oracle.truffle.api.CompilerAsserts;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapFunction.java
index 7c0252cf..bbaf9bea 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MapFunction.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -29,7 +36,7 @@
 
 import java.util.Arrays;
 
-import com.nvidia.grcuda.DeviceArray.MemberSet;
+import com.nvidia.grcuda.MemberSet;
 import com.nvidia.grcuda.GrCUDAInternalException;
 import com.oracle.truffle.api.CompilerAsserts;
 import com.oracle.truffle.api.CompilerDirectives;
@@ -113,7 +120,7 @@ public MappedFunction map(Object function, Object... arguments) throws ArityExce
     static void checkArity(Object[] arguments, int expectedArity) throws ArityException {
         if (arguments.length != expectedArity) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(expectedArity, arguments.length);
+            throw ArityException.create(expectedArity, expectedArity, arguments.length);
         }
     }
 
@@ -168,7 +175,9 @@ static MapFunctionBase readMemberArg(MapFunction receiver, String member) {
         static MapFunctionBase readMemberSize(MapFunction receiver, String member) {
             return new MapFunctionBase(arguments -> {
                 if (arguments.length == 0) {
-                    throw ArityException.create(1, 0);
+                    // FIXME: the maximum number of arguments is unbound.
+                    //  Truffle currently uses -1 to handle an unbound number of arguments;
+                    throw ArityException.create(1, -1, 0);
                 }
                 try {
                     return receiver.size((MapArgObject) arguments[0], Arrays.copyOfRange(arguments, 1, arguments.length, MapArgObject[].class));
@@ -182,7 +191,9 @@ static MapFunctionBase readMemberSize(MapFunction receiver, String member) {
         static MapFunctionBase readMemberValue(MapFunction receiver, String member) {
             return new MapFunctionBase(arguments -> {
                 if (arguments.length < 1) {
-                    throw ArityException.create(1, arguments.length);
+                    // FIXME: the maximum number of arguments is unbound.
+                    //  Truffle currently uses -1 to handle an unbound number of arguments;
+                    throw ArityException.create(1, -1, arguments.length);
                 }
                 String name = checkString(arguments[0], "name of created value expected");
                 if (arguments.length == 1) {
@@ -218,7 +229,9 @@ Object invokeMember(String member, Object[] arguments,
     @TruffleBoundary
     public MappedFunction execute(Object[] arguments) throws ArityException, UnsupportedTypeException {
         if (arguments.length == 0) {
-            throw ArityException.create(1, 0);
+            // FIXME: the maximum number of arguments is unbound.
+            //  Truffle currently uses -1 to handle an unbound number of arguments;
+            throw ArityException.create(1, -1, 0);
         }
         Object function = arguments[0];
         if (!INTEROP.isExecutable(function)) {
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MappedFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MappedFunction.java
index 84ef302b..59ed039d 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MappedFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/MappedFunction.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -33,9 +40,9 @@
 
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.CompilationFinal;
-import com.oracle.truffle.api.TruffleException;
 import com.oracle.truffle.api.dsl.Cached;
 import com.oracle.truffle.api.dsl.Specialization;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.InteropLibrary;
 import com.oracle.truffle.api.interop.InvalidArrayIndexException;
@@ -47,27 +54,16 @@
 import com.oracle.truffle.api.library.ExportLibrary;
 import com.oracle.truffle.api.library.ExportMessage;
 import com.oracle.truffle.api.nodes.ExplodeLoop;
-import com.oracle.truffle.api.nodes.Node;
 import com.oracle.truffle.api.profiles.ConditionProfile;
 import com.oracle.truffle.api.profiles.PrimitiveValueProfile;
 
-final class MapException extends RuntimeException implements TruffleException {
+final class MapException extends AbstractTruffleException {
 
     private static final long serialVersionUID = -1472390370115466332L;
 
     MapException(String message) {
         super(message);
     }
-
-    public Node getLocation() {
-        return null;
-    }
-
-    @SuppressWarnings("sync-override")
-    @Override
-    public Throwable fillInStackTrace() {
-        return this;
-    }
 }
 
 /**
@@ -240,7 +236,7 @@ long execute(Object[] arguments,
                     @Cached("createEqualityProfile()") PrimitiveValueProfile lengthProfile) throws ArityException {
         if (arguments.length != 3) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(3, arguments.length);
+            throw ArityException.create(3, 3, arguments.length);
         }
         long size;
         try {
@@ -359,7 +355,7 @@ Object execute(Object[] arguments,
                     @Cached("createEqualityProfile()") PrimitiveValueProfile lengthProfile) throws ArityException {
         if (arguments.length != 3) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(3, arguments.length);
+            throw ArityException.create(3, 3, arguments.length);
         }
         int length = lengthProfile.profile(args.length);
         Object[] mappedArgs = new Object[length];
@@ -447,7 +443,7 @@ Object execute(Object[] arguments,
                     @CachedLibrary(limit = "2") InteropLibrary memberInterop) throws ArityException, UnsupportedTypeException, UnsupportedMessageException {
         if (arguments.length != 3) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(3, arguments.length);
+            throw ArityException.create(3, 3, arguments.length);
         }
         Object value = parentInterop.execute(parent, arguments);
         try {
@@ -507,7 +503,7 @@ Object execute(Object[] arguments,
                     @CachedLibrary(limit = "2") InteropLibrary elementInterop) throws ArityException, UnsupportedTypeException, UnsupportedMessageException {
         if (arguments.length != 3) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(3, arguments.length);
+            throw ArityException.create(3, 3, arguments.length);
         }
         Object value = parentInterop.execute(parent, arguments);
         try {
@@ -519,7 +515,6 @@ Object execute(Object[] arguments,
     }
 }
 
-@ExportLibrary(InteropLibrary.class)
 final class MapArgObjectMap extends MapArgObjectBase {
     final MapArgObjectBase parent;
     final Object function;
@@ -568,7 +563,7 @@ Object execute(Object[] arguments,
                     @CachedLibrary("this.function") InteropLibrary mapInterop) throws UnsupportedTypeException, ArityException, UnsupportedMessageException {
         if (arguments.length != 3) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(3, arguments.length);
+            throw ArityException.create(3, 3, arguments.length);
         }
         Object value = parentInterop.execute(parent, arguments);
         try {
@@ -644,7 +639,7 @@ Object execute(Object[] arguments,
                     @CachedLibrary("this.parent") InteropLibrary parentInterop) throws UnsupportedTypeException, ArityException, UnsupportedMessageException {
         if (arguments.length != 3) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(3, arguments.length);
+            throw ArityException.create(3, 3, arguments.length);
         }
         return new ShreddedObject(parentInterop.execute(parent, arguments));
     }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShredFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShredFunction.java
index 9eb3d291..8265030f 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShredFunction.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShredFunction.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShreddedObject.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShreddedObject.java
index 3a828aef..1fae527a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShreddedObject.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/functions/map/ShreddedObject.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -27,7 +34,7 @@
  */
 package com.nvidia.grcuda.functions.map;
 
-import com.nvidia.grcuda.DeviceArray.MemberSet;
+import com.nvidia.grcuda.MemberSet;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.InteropLibrary;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/CUDARuntime.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/CUDARuntime.java
deleted file mode 100644
index 90517a0b..00000000
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/CUDARuntime.java
+++ /dev/null
@@ -1,980 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-package com.nvidia.grcuda.gpu;
-
-import static com.nvidia.grcuda.functions.Function.checkArgumentLength;
-import static com.nvidia.grcuda.functions.Function.expectInt;
-import static com.nvidia.grcuda.functions.Function.expectLong;
-import static com.nvidia.grcuda.functions.Function.expectPositiveLong;
-
-import java.util.HashMap;
-import org.graalvm.collections.Pair;
-
-import com.nvidia.grcuda.Binding;
-import com.nvidia.grcuda.FunctionBinding;
-import com.nvidia.grcuda.GPUPointer;
-import com.nvidia.grcuda.GrCUDAContext;
-import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.Namespace;
-import com.nvidia.grcuda.NoneValue;
-import com.nvidia.grcuda.functions.CUDAFunction;
-import com.nvidia.grcuda.gpu.UnsafeHelper.Integer32Object;
-import com.nvidia.grcuda.gpu.UnsafeHelper.Integer64Object;
-import com.oracle.truffle.api.CompilerAsserts;
-import com.oracle.truffle.api.CompilerDirectives;
-import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
-import com.oracle.truffle.api.TruffleLanguage.Env;
-import com.oracle.truffle.api.interop.ArityException;
-import com.oracle.truffle.api.interop.InteropException;
-import com.oracle.truffle.api.interop.InteropLibrary;
-import com.oracle.truffle.api.interop.TruffleObject;
-import com.oracle.truffle.api.interop.UnknownIdentifierException;
-import com.oracle.truffle.api.interop.UnsupportedMessageException;
-import com.oracle.truffle.api.interop.UnsupportedTypeException;
-import com.oracle.truffle.api.source.Source;
-
-public final class CUDARuntime {
-
-    public static final String CUDA_RUNTIME_LIBRARY_NAME = "cudart";
-    public static final String CUDA_LIBRARY_NAME = "cuda";
-    static final String NVRTC_LIBRARY_NAME = "nvrtc";
-
-    private final GrCUDAContext context;
-    private final NVRuntimeCompiler nvrtc;
-
-    /**
-     * Map from library-path to NFI library.
-     */
-    private final HashMap<String, TruffleObject> loadedLibraries = new HashMap<>();
-
-    /**
-     * Map of (library-path, symbol-name) to callable.
-     */
-    private final HashMap<Pair<String, String>, Object> boundFunctions = new HashMap<>();
-
-    public CUDARuntime(GrCUDAContext context, Env env) {
-        this.context = context;
-        try {
-            TruffleObject libcudart = (TruffleObject) env.parseInternal(
-                            Source.newBuilder("nfi", "load " + "lib" + CUDA_RUNTIME_LIBRARY_NAME + ".so", "cudaruntime").build()).call();
-            TruffleObject libcuda = (TruffleObject) env.parseInternal(
-                            Source.newBuilder("nfi", "load " + "lib" + CUDA_LIBRARY_NAME + ".so", "cuda").build()).call();
-            TruffleObject libnvrtc = (TruffleObject) env.parseInternal(
-                            Source.newBuilder("nfi", "load " + "lib" + NVRTC_LIBRARY_NAME + ".so", "nvrtc").build()).call();
-            loadedLibraries.put(CUDA_RUNTIME_LIBRARY_NAME, libcudart);
-            loadedLibraries.put(CUDA_LIBRARY_NAME, libcuda);
-            loadedLibraries.put(NVRTC_LIBRARY_NAME, libnvrtc);
-        } catch (UnsatisfiedLinkError e) {
-            throw new GrCUDAException(e.getMessage());
-        }
-
-        nvrtc = new NVRuntimeCompiler(this);
-        context.addDisposable(this::shutdown);
-    }
-
-    // using this slow/uncached instance since all calls are non-critical
-    private static final InteropLibrary INTEROP = InteropLibrary.getFactory().getUncached();
-
-    interface CallSupport {
-        String getName();
-
-        Object getSymbol(CUDARuntime runtime) throws UnknownIdentifierException;
-
-        default void callSymbol(CUDARuntime runtime, Object... arguments) throws UnsupportedTypeException, ArityException, UnsupportedMessageException, UnknownIdentifierException {
-            CompilerAsserts.neverPartOfCompilation();
-            Object result = INTEROP.execute(getSymbol(runtime), arguments);
-            runtime.checkCUDAReturnCode(result, getName());
-        }
-    }
-
-    @TruffleBoundary
-    public GPUPointer cudaMalloc(long numBytes) {
-        try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
-            Object callable = CUDARuntimeFunction.CUDA_MALLOC.getSymbol(this);
-            Object result = INTEROP.execute(callable, outPointer.getAddress(), numBytes);
-            checkCUDAReturnCode(result, "cudaMalloc");
-            long addressAllocatedMemory = outPointer.getValueOfPointer();
-            return new GPUPointer(addressAllocatedMemory);
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public LittleEndianNativeArrayView cudaMallocManaged(long numBytes) {
-        final int cudaMemAttachGlobal = 0x01;
-        try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
-            Object callable = CUDARuntimeFunction.CUDA_MALLOCMANAGED.getSymbol(this);
-            Object result = INTEROP.execute(callable, outPointer.getAddress(), numBytes, cudaMemAttachGlobal);
-            checkCUDAReturnCode(result, "cudaMallocManaged");
-            long addressAllocatedMemory = outPointer.getValueOfPointer();
-            return new LittleEndianNativeArrayView(addressAllocatedMemory, numBytes);
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cudaFree(LittleEndianNativeArrayView memory) {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_FREE.getSymbol(this);
-            Object result = INTEROP.execute(callable, memory.getStartAddress());
-            checkCUDAReturnCode(result, "cudaFree");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cudaFree(GPUPointer pointer) {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_FREE.getSymbol(this);
-            Object result = INTEROP.execute(callable, pointer.getRawPointer());
-            checkCUDAReturnCode(result, "cudaFree");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cudaDeviceSynchronize() {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_DEVICESYNCHRONIZE.getSymbol(this);
-            Object result = INTEROP.execute(callable);
-            checkCUDAReturnCode(result, "cudaDeviceSynchronize");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cudaMemcpy(long destPointer, long fromPointer, long numBytesToCopy) {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_MEMCPY.getSymbol(this);
-            if (numBytesToCopy < 0) {
-                throw new IllegalArgumentException("requested negative number of bytes to copy " + numBytesToCopy);
-            }
-            // cudaMemcpyKind from driver_types.h (default: direction of transfer is inferred
-            // from the pointer values, uses virtual addressing)
-            final long cudaMemcpyDefault = 4;
-            Object result = INTEROP.execute(callable, destPointer, fromPointer, numBytesToCopy, cudaMemcpyDefault);
-            checkCUDAReturnCode(result, "cudaMemcpy");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public DeviceMemoryInfo cudaMemGetInfo() {
-        final String symbol = "cudaMemGetInfo";
-        final String nfiSignature = "(pointer, pointer): sint32";
-        try (Integer64Object freeBytes = UnsafeHelper.createInteger64Object();
-                        Integer64Object totalBytes = UnsafeHelper.createInteger64Object()) {
-            Object callable = getSymbol(CUDA_RUNTIME_LIBRARY_NAME, symbol, nfiSignature);
-            Object result = INTEROP.execute(callable, freeBytes.getAddress(), totalBytes.getAddress());
-            checkCUDAReturnCode(result, symbol);
-            return new DeviceMemoryInfo(freeBytes.getValue(), totalBytes.getValue());
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cudaDeviceReset() {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_DEVICERESET.getSymbol(this);
-            Object result = INTEROP.execute(callable);
-            checkCUDAReturnCode(result, "cudaDeviceReset");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public int cudaGetDeviceCount() {
-        try (UnsafeHelper.Integer32Object deviceCount = UnsafeHelper.createInteger32Object()) {
-            Object callable = CUDARuntimeFunction.CUDA_GETDEVICECOUNT.getSymbol(this);
-            Object result = INTEROP.execute(callable, deviceCount.getAddress());
-            checkCUDAReturnCode(result, "cudaGetDeviceCount");
-            return deviceCount.getValue();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cudaSetDevice(int device) {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_SETDEVICE.getSymbol(this);
-            Object result = INTEROP.execute(callable, device);
-            checkCUDAReturnCode(result, "cudaSetDevice");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public int cudaGetDevice() {
-        try (Integer32Object deviceId = UnsafeHelper.createInteger32Object()) {
-            Object callable = CUDARuntimeFunction.CUDA_GETDEVICE.getSymbol(this);
-            Object result = INTEROP.execute(callable, deviceId.getAddress());
-            checkCUDAReturnCode(result, "cudaGetDevice");
-            return deviceId.getValue();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public int cudaDeviceGetAttribute(CUDADeviceAttribute attribute, int deviceId) {
-        try (Integer32Object value = UnsafeHelper.createInteger32Object()) {
-            Object callable = CUDARuntimeFunction.CUDA_DEVICEGETATTRIBUTE.getSymbol(this);
-            Object result = INTEROP.execute(callable, value.getAddress(), attribute.getAttributeCode(), deviceId);
-            checkCUDAReturnCode(result, "cudaDeviceGetAttribute");
-            return value.getValue();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public Object getDeviceName(int deviceOrdinal) {
-        return cuDeviceGetName(cuDeviceGet(deviceOrdinal));
-    }
-
-    @TruffleBoundary
-    public String cudaGetErrorString(int errorCode) {
-        try {
-            Object callable = CUDARuntimeFunction.CUDA_GETERRORSTRING.getSymbol(this);
-            Object result = INTEROP.execute(callable, errorCode);
-            return INTEROP.asString(result);
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    /**
-     * Get function as callable from native library.
-     *
-     * @param binding function binding
-     * @return a callable as a TruffleObject
-     */
-    @TruffleBoundary
-    public Object getSymbol(FunctionBinding binding) throws UnknownIdentifierException {
-        return getSymbol(binding.getLibraryFileName(), binding.getSymbolName(), binding.toNFISignature(), "");
-    }
-
-    /**
-     * Get function as callable from native library.
-     *
-     * @param libraryPath path to library (.so file)
-     * @param symbolName name of the function (symbol) too look up
-     * @param nfiSignature NFI signature of the function
-     * @return a callable as a TruffleObject
-     */
-    @TruffleBoundary
-    public Object getSymbol(String libraryPath, String symbolName, String nfiSignature) throws UnknownIdentifierException {
-        return getSymbol(libraryPath, symbolName, nfiSignature, "");
-    }
-
-    /**
-     * Get function as callable from native library.
-     *
-     * @param libraryPath path to library (.so file)
-     * @param symbolName name of the function (symbol) too look up
-     * @param nfiSignature NFI signature of the function
-     * @param hint additional string shown to user when symbol cannot be loaded
-     * @return a callable as a TruffleObject
-     */
-    @TruffleBoundary
-    public Object getSymbol(String libraryPath, String symbolName, String nfiSignature, String hint) throws UnknownIdentifierException {
-
-        Pair<String, String> functionKey = Pair.create(libraryPath, symbolName);
-        Object callable = boundFunctions.get(functionKey);
-        if (callable == null) {
-            // symbol does not exist or not yet bound
-            TruffleObject library = loadedLibraries.get(libraryPath);
-            if (library == null) {
-                try {
-                    // library does not exist or is not loaded yet
-                    library = (TruffleObject) context.getEnv().parseInternal(
-                                    Source.newBuilder("nfi", "load \"" + libraryPath + "\"", libraryPath).build()).call();
-                } catch (UnsatisfiedLinkError e) {
-                    throw new GrCUDAException("unable to load shared library '" + libraryPath + "': " + e.getMessage() + hint);
-                }
-
-                loadedLibraries.put(libraryPath, library);
-            }
-            try {
-                Object symbol = INTEROP.readMember(library, symbolName);
-                callable = INTEROP.invokeMember(symbol, "bind", nfiSignature);
-            } catch (UnsatisfiedLinkError | UnsupportedMessageException | ArityException | UnsupportedTypeException e) {
-                throw new GrCUDAException("unexpected behavior: " + e.getMessage());
-            }
-            boundFunctions.put(functionKey, callable);
-        }
-        return callable;
-    }
-
-    private void checkCUDAReturnCode(Object result, String... function) {
-        if (!(result instanceof Integer)) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException("expected return code as Integer object in " + GrCUDAException.format(function) + ", got " + result.getClass().getName());
-        }
-        Integer returnCode = (Integer) result;
-        if (returnCode != 0) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(returnCode, cudaGetErrorString(returnCode), function);
-        }
-    }
-
-    public void registerCUDAFunctions(Namespace rootNamespace) {
-        for (CUDARuntimeFunction function : CUDARuntimeFunction.values()) {
-            rootNamespace.addFunction(new CUDAFunction(function, this));
-        }
-    }
-
-    public enum CUDARuntimeFunction implements CUDAFunction.Spec, CallSupport {
-        CUDA_DEVICEGETATTRIBUTE("cudaDeviceGetAttribute", "(pointer, sint32, sint32): sint32") {
-            @Override
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 2);
-                int attributeCode = expectInt(args[0]);
-                int deviceId = expectInt(args[1]);
-                try (UnsafeHelper.Integer32Object value = UnsafeHelper.createInteger32Object()) {
-                    callSymbol(cudaRuntime, value.getAddress(), attributeCode, deviceId);
-                    return value.getValue();
-                }
-            }
-        },
-        CUDA_DEVICERESET("cudaDeviceReset", "(): sint32") {
-            @Override
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, InteropException {
-                checkArgumentLength(args, 0);
-                callSymbol(cudaRuntime);
-                return NoneValue.get();
-            }
-        },
-        CUDA_DEVICESYNCHRONIZE("cudaDeviceSynchronize", "(): sint32") {
-            @Override
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, InteropException, UnsupportedMessageException {
-                checkArgumentLength(args, 0);
-                callSymbol(cudaRuntime);
-                return NoneValue.get();
-            }
-        },
-        CUDA_FREE("cudaFree", "(pointer): sint32") {
-            @Override
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 1);
-                Object pointerObj = args[0];
-                long addr;
-                if (pointerObj instanceof GPUPointer) {
-                    addr = ((GPUPointer) pointerObj).getRawPointer();
-                } else if (pointerObj instanceof LittleEndianNativeArrayView) {
-                    addr = ((LittleEndianNativeArrayView) pointerObj).getStartAddress();
-                } else {
-                    throw new GrCUDAException("expected GPUPointer or LittleEndianNativeArrayView");
-                }
-                callSymbol(cudaRuntime, addr);
-                return NoneValue.get();
-            }
-        },
-        CUDA_GETDEVICE("cudaGetDevice", "(pointer): sint32") {
-            @Override
-            @TruffleBoundary
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, InteropException {
-                checkArgumentLength(args, 0);
-                try (UnsafeHelper.Integer32Object deviceId = UnsafeHelper.createInteger32Object()) {
-                    callSymbol(cudaRuntime, deviceId.getAddress());
-                    return deviceId.getValue();
-                }
-            }
-        },
-        CUDA_GETDEVICECOUNT("cudaGetDeviceCount", "(pointer): sint32") {
-            @Override
-            @TruffleBoundary
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, InteropException {
-                checkArgumentLength(args, 0);
-                try (UnsafeHelper.Integer32Object deviceCount = UnsafeHelper.createInteger32Object()) {
-                    callSymbol(cudaRuntime, deviceCount.getAddress());
-                    return deviceCount.getValue();
-                }
-            }
-        },
-        CUDA_GETERRORSTRING("cudaGetErrorString", "(sint32): string") {
-            @Override
-            @TruffleBoundary
-            public String call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 1);
-                int errorCode = expectInt(args[0]);
-                Object result = INTEROP.execute(getSymbol(cudaRuntime), errorCode);
-                return INTEROP.asString(result);
-            }
-        },
-        CUDA_MALLOC("cudaMalloc", "(pointer, uint64): sint32") {
-            @Override
-            @TruffleBoundary
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 1);
-                long numBytes = expectLong(args[0]);
-                try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
-                    callSymbol(cudaRuntime, outPointer.getAddress(), numBytes);
-                    long addressAllocatedMemory = outPointer.getValueOfPointer();
-                    return new GPUPointer(addressAllocatedMemory);
-                }
-            }
-        },
-        CUDA_MALLOCMANAGED("cudaMallocManaged", "(pointer, uint64, uint32): sint32") {
-            @Override
-            @TruffleBoundary
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 1);
-                final int cudaMemAttachGlobal = 0x01;
-                long numBytes = expectLong(args[0]);
-                try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
-                    callSymbol(cudaRuntime, outPointer.getAddress(), numBytes, cudaMemAttachGlobal);
-                    long addressAllocatedMemory = outPointer.getValueOfPointer();
-                    return new GPUPointer(addressAllocatedMemory);
-                }
-            }
-        },
-        CUDA_SETDEVICE("cudaSetDevice", "(sint32): sint32") {
-            @Override
-            @TruffleBoundary
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 1);
-                int device = expectInt(args[0]);
-                callSymbol(cudaRuntime, device);
-                return NoneValue.get();
-            }
-        },
-        CUDA_MEMCPY("cudaMemcpy", "(pointer, pointer, uint64, sint32): sint32") {
-            @Override
-            @TruffleBoundary
-            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, InteropException {
-                checkArgumentLength(args, 3);
-                long destPointer = expectLong(args[0]);
-                long fromPointer = expectLong(args[1]);
-                long numBytesToCopy = expectPositiveLong(args[2]);
-                // cudaMemcpyKind from driver_types.h (default: direction of transfer is
-                // inferred from the pointer values, uses virtual addressing)
-                final long cudaMemcpyDefault = 4;
-                callSymbol(cudaRuntime, destPointer, fromPointer, numBytesToCopy, cudaMemcpyDefault);
-                return NoneValue.get();
-            }
-        };
-
-        private final String name;
-        private final String nfiSignature;
-
-        CUDARuntimeFunction(String name, String nfiSignature) {
-            this.name = name;
-            this.nfiSignature = nfiSignature;
-        }
-
-        public String getName() {
-            return name;
-        }
-
-        public Object getSymbol(CUDARuntime runtime) throws UnknownIdentifierException {
-            return runtime.getSymbol(CUDA_RUNTIME_LIBRARY_NAME, name, nfiSignature);
-        }
-    }
-
-    private HashMap<String, CUModule> loadedModules = new HashMap<>();
-
-    @TruffleBoundary
-    public Kernel loadKernel(Binding binding) {
-        return loadKernel(binding.getLibraryFileName(), binding.getName(), binding.getSymbolName(), binding.getNIDLParameterSignature());
-    }
-
-    @TruffleBoundary
-    public Kernel loadKernel(String cubinFile, String kernelName, String symbolName, String signature) {
-        CUModule module = loadedModules.get(cubinFile);
-        if (module == null) {
-            // load module as it is not yet loaded
-            module = cuModuleLoad(cubinFile);
-            loadedModules.put(cubinFile, module);
-        }
-        long kernelFunction = cuModuleGetFunction(module, symbolName);
-        return new Kernel(this, kernelName, symbolName, kernelFunction, signature, module);
-    }
-
-    @TruffleBoundary
-    public Kernel buildKernel(String code, String kernelName, String signature) {
-        String moduleName = "truffle" + context.getNextModuleId();
-        PTXKernel ptx = nvrtc.compileKernel(code, kernelName, moduleName, "--std=c++14");
-        CUModule module = cuModuleLoadData(ptx.getPtxSource(), moduleName);
-        loadedModules.put(moduleName, module);
-        long kernelFunctionHandle = cuModuleGetFunction(module, ptx.getLoweredKernelName());
-        return new Kernel(this, kernelName, ptx.getLoweredKernelName(), kernelFunctionHandle,
-                        signature, module, ptx.getPtxSource());
-    }
-
-    @TruffleBoundary
-    public CUModule cuModuleLoad(String cubinName) {
-        assertCUDAInitialized();
-        if (loadedModules.containsKey(cubinName)) {
-            throw new GrCUDAException("A module for " + cubinName + " was already loaded.");
-        }
-        try (UnsafeHelper.Integer64Object modulePtr = UnsafeHelper.createInteger64Object()) {
-            Object callable = CUDADriverFunction.CU_MODULELOAD.getSymbol(this);
-            Object result = INTEROP.execute(callable, modulePtr.getAddress(), cubinName);
-            checkCUReturnCode(result, "cuModuleLoad");
-            return new CUModule(cubinName, modulePtr.getValue());
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public CUModule cuModuleLoadData(String ptx, String moduleName) {
-        assertCUDAInitialized();
-        if (loadedModules.containsKey(moduleName)) {
-            throw new GrCUDAException("A module for " + moduleName + " was already loaded.");
-        }
-        try (UnsafeHelper.Integer64Object modulePtr = UnsafeHelper.createInteger64Object()) {
-            Object callable = CUDADriverFunction.CU_MODULELOADDATA.getSymbol(this);
-            Object result = INTEROP.execute(callable,
-                            modulePtr.getAddress(), ptx);
-            checkCUReturnCode(result, "cuModuleLoadData");
-            return new CUModule(moduleName, modulePtr.getValue());
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cuModuleUnload(CUModule module) {
-        try {
-            Object callable = CUDADriverFunction.CU_MODULEUNLOAD.getSymbol(this);
-            Object result = INTEROP.execute(callable, module.modulePointer);
-            checkCUReturnCode(result, "cuModuleUnload");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    /**
-     * Get function handle to kernel in module.
-     *
-     * @param kernelModule CUmodule containing the kernel function
-     * @param kernelName
-     * @return native CUfunction function handle
-     */
-    public long cuModuleGetFunction(CUModule kernelModule, String kernelName) {
-        try (UnsafeHelper.Integer64Object functionPtr = UnsafeHelper.createInteger64Object()) {
-            Object callable = CUDADriverFunction.CU_MODULEGETFUNCTION.getSymbol(this);
-            Object result = INTEROP.execute(callable,
-                            functionPtr.getAddress(), kernelModule.getModulePointer(), kernelName);
-            checkCUReturnCode(result, "cuModuleGetFunction");
-            return functionPtr.getValue();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cuCtxSynchronize() {
-        assertCUDAInitialized();
-        try {
-            Object callable = CUDADriverFunction.CU_CTXSYNCHRONIZE.getSymbol(this);
-            Object result = INTEROP.execute(callable);
-            checkCUReturnCode(result, "cuCtxSynchronize");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    public void cuLaunchKernel(Kernel kernel, KernelConfig config, KernelArguments args) {
-        try {
-            Object callable = CUDADriverFunction.CU_LAUNCHKERNEL.getSymbol(this);
-            Dim3 gridSize = config.getGridSize();
-            Dim3 blockSize = config.getBlockSize();
-            Object result = INTEROP.execute(callable,
-                            kernel.getKernelFunctionHandle(),
-                            gridSize.getX(),
-                            gridSize.getY(),
-                            gridSize.getZ(),
-                            blockSize.getX(),
-                            blockSize.getY(),
-                            blockSize.getZ(),
-                            config.getDynamicSharedMemoryBytes(),
-                            config.getStream(),
-                            args.getPointer(),              // pointer to kernel arguments array
-                            0                               // extra args
-            );
-            checkCUReturnCode(result, "cuLaunchKernel");
-            cudaDeviceSynchronize();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private void cuInit() {
-        try {
-            Object callable = CUDADriverFunction.CU_INIT.getSymbol(this);
-            int flags = 0; // must be zero as per CUDA Driver API documentation
-            Object result = INTEROP.execute(callable, flags);
-            checkCUReturnCode(result, "cuInit");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private int cuDeviceGetCount() {
-        try (UnsafeHelper.Integer32Object devCount = UnsafeHelper.createInteger32Object()) {
-            Object callable = CUDADriverFunction.CU_DEVICEGETCOUNT.getSymbol(this);
-            Object result = INTEROP.execute(callable, devCount.getAddress());
-            checkCUReturnCode(result, "cuDeviceGetCount");
-            return devCount.getValue();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private int cuDeviceGet(int deviceOrdinal) {
-        assertCUDAInitialized();
-        try (UnsafeHelper.Integer32Object deviceObj = UnsafeHelper.createInteger32Object()) {
-            Object callable = CUDADriverFunction.CU_DEVICEGET.getSymbol(this);
-            Object result = INTEROP.execute(callable, deviceObj.getAddress(), deviceOrdinal);
-            checkCUReturnCode(result, "cuDeviceGet");
-            return deviceObj.getValue();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private String cuDeviceGetName(int cuDeviceId) {
-        final int maxLength = 256;
-        try (UnsafeHelper.StringObject nameString = new UnsafeHelper.StringObject(maxLength)) {
-            Object callable = CUDADriverFunction.CU_DEVICEGETNAME.getSymbol(this);
-            Object result = INTEROP.execute(callable, nameString.getAddress(), maxLength, cuDeviceId);
-            checkCUReturnCode(result, "cuDeviceGetName");
-            return nameString.getZeroTerminatedString();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private long cuCtxCreate(int flags, int cudevice) {
-        try (UnsafeHelper.PointerObject pctx = UnsafeHelper.createPointerObject()) {
-            Object callable = CUDADriverFunction.CU_CTXCREATE.getSymbol(this);
-            Object result = INTEROP.execute(callable, pctx.getAddress(), flags, cudevice);
-            checkCUReturnCode(result, "cuCtxCreate");
-            return pctx.getValueOfPointer();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private long cuDevicePrimaryCtxRetain(int cudevice) {
-        try (UnsafeHelper.PointerObject pctx = UnsafeHelper.createPointerObject()) {
-            Object callable = CUDADriverFunction.CU_DEVICEPRIMARYCTXRETAIN.getSymbol(this);
-            Object result = INTEROP.execute(callable, pctx.getAddress(), cudevice);
-            checkCUReturnCode(result, "cuDevicePrimaryCtxRetain");
-            return pctx.getValueOfPointer();
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private void cuCtxDestroy(long ctx) {
-        try {
-            Object callable = CUDADriverFunction.CU_CTXCREATE.getSymbol(this);
-            Object result = INTEROP.execute(callable, ctx);
-            checkCUReturnCode(result, "cuCtxDestroy");
-        } catch (InteropException e) {
-            throw new GrCUDAException(e);
-        }
-    }
-
-    @TruffleBoundary
-    private void assertCUDAInitialized() {
-        if (!context.isCUDAInitialized()) {
-            // a simple way to create the device context in the driver is to call CUDA function
-            cudaDeviceSynchronize();
-            context.setCUDAInitialized();
-        }
-    }
-
-    @SuppressWarnings("static-method")
-    private static void checkCUReturnCode(Object result, String... function) {
-        int returnCode;
-        try {
-            returnCode = INTEROP.asInt(result);
-        } catch (UnsupportedMessageException e) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(
-                            "expected return code as Integer object in " + function + ", got " +
-                                            result.getClass().getName());
-        }
-        if (returnCode != 0) {
-            throw new GrCUDAException(returnCode, DriverAPIErrorMessages.getString(returnCode), function);
-        }
-    }
-
-    private void shutdown() {
-        // unload all modules
-        for (CUModule module : loadedModules.values()) {
-            try {
-                module.close();
-            } catch (Exception e) {
-                /* ignore exception */
-            }
-        }
-        loadedModules.clear();
-    }
-
-    public enum CUDADriverFunction {
-        CU_CTXCREATE("cuCtxCreate", "(pointer, uint32, sint32) :sint32"),
-        CU_CTXDESTROY("cuCtxDestroy", "(pointer): sint32"),
-        CU_CTXSYNCHRONIZE("cuCtxSynchronize", "(): sint32"),
-        CU_DEVICEGETCOUNT("cuDeviceGetCount", "(pointer): sint32"),
-        CU_DEVICEGET("cuDeviceGet", "(pointer, sint32): sint32"),
-        CU_DEVICEGETNAME("cuDeviceGetName", "(pointer, sint32, sint32): sint32"),
-        CU_DEVICEPRIMARYCTXRETAIN("cuDevicePrimaryCtxRetain", "(pointer, sint32): sint32"),
-        CU_INIT("cuInit", "(uint32): sint32"),
-        CU_LAUNCHKERNEL("cuLaunchKernel", "(uint64, uint32, uint32, uint32, uint32, uint32, uint32, uint32, uint64, pointer, pointer): sint32"),
-        CU_MODULELOAD("cuModuleLoad", "(pointer, string): sint32"),
-        CU_MODULELOADDATA("cuModuleLoadData", "(pointer, string): sint32"),
-        CU_MODULEUNLOAD("cuModuleUnload", "(uint64): sint32"),
-        CU_MODULEGETFUNCTION("cuModuleGetFunction", "(pointer, uint64, string): sint32");
-
-        private final String symbolName;
-        private final String signature;
-
-        CUDADriverFunction(String symbolName, String nfiSignature) {
-            this.symbolName = symbolName;
-            this.signature = nfiSignature;
-        }
-
-        public Object getSymbol(CUDARuntime runtime) throws UnknownIdentifierException {
-            return runtime.getSymbol(CUDA_LIBRARY_NAME, symbolName, signature);
-        }
-    }
-
-    /** CUDA device attributes from driver_types.h CUDA header. */
-    public enum CUDADeviceAttribute {
-        MAX_THREADS_PER_BLOCK("maxThreadsPerBlock", 1),
-        MAX_BLOCK_DIMX("maxBlockDimX", 2),
-        MAX_BLOCK_DIMY("maxBlockDimY", 3),
-        MAX_BLOCK_DIMZ("maxBlockDimZ", 4),
-        MAX_GRID_DIMX("maxGridDimX", 5),
-        MAX_GRID_DIMY("maxGridDimY", 6),
-        MAX_GRID_DIMZ("maxGridDimZ", 7),
-        MAX_SHARED_MEMORY_PER_BLOCK("maxSharedMemoryPerBlock", 8),
-        TOTAL_CONSTANT_MEMORY("totalConstantMemory", 9),
-        WARPSIZE("warpSize", 10),
-        MAX_PITCH("maxPitch", 11),
-        MAX_REGISTERS_PER_BLOCK("maxRegistersPerBlock", 12),
-        CLOCK_RATE("clockRate", 13),
-        TEXTURE_ALIGNMENT("textureAlignment", 14),
-        GPU_OVERLAP("gpuOverlap", 15),
-        MULTI_PROCESSOR_COUNT("multiProcessorCount", 16),
-        KERNEL_EXEC_TIMEOUT("kernelExecTimeout", 17),
-        INTEGRATED("integrated", 18),
-        CAN_MAP_HOST_MEMORY("canMapHostMemory", 19),
-        COMPUTE_MODE("computeMode", 20),
-        MAX_TEXTURE1D_WIDTH("maxTexture1DWidth", 21),
-        MAX_TEXTURE2D_WIDTH("maxTexture2DWidth", 22),
-        MAX_TEXTURE2D_HEIGHT("maxTexture2DHeight", 23),
-        MAX_TEXTURE3D_WIDTH("maxTexture3DWidth", 24),
-        MAX_TEXTURE3D_HEIGHT("maxTexture3DHeight", 25),
-        MAX_TEXTURE3D_DEPTH("maxTexture3DDepth", 26),
-        MAX_TEXTURE2D_LAYERED_WIDTH("maxTexture2DLayeredWidth", 27),
-        MAX_TEXTURE2D_LAYERED_HEIGHT("maxTexture2DLayeredHeight", 28),
-        MAX_TEXTURE2D_LAYERED_LAYERS("maxTexture2DLayeredLayers", 29),
-        SURFACE_ALIGNMENT("surfaceAlignment", 30),
-        CONCURRENT_KERNELS("concurrentKernels", 31),
-        ECC_ENABLED("eccEnabled", 32),
-        PCI_BUS_ID("pciBusId", 33),
-        PCI_DEVICE_ID("pciDeviceId", 34),
-        TCC_DRIVER("tccDriver", 35),
-        MEMORY_CLOCK_RATE("memoryClockRate", 36),
-        GLOBAL_MEMORY_BUS_WIDTH("globalMemoryBusWidth", 37),
-        L2_CACHE_SIZE("l2CacheSize", 38),
-        MAX_THREADS_PER_MULTIPROCESSOR("maxThreadsPerMultiProcessor", 39),
-        ASYNC_ENGINE_COUNT("asyncEngineCount", 40),
-        UNIFIED_ADDRESSING("unifiedAddressing", 41),
-        MAX_TEXTURE1D_LAYERED_WIDTH("maxTexture1DLayeredWidth", 42),
-        MAX_TEXTURE1D_LAYERED_LAYERS("maxTexture1DLayeredLayers", 43),
-        MAX_TEXTURE2D_GATHER_WIDTH("maxTexture2DGatherWidth", 45),
-        MAX_TEXTURE2D_GATHER_HEIGHT("maxTexture2DGatherHeight", 46),
-        MAX_TEXTURE3D_WIDTH_ALT("maxTexture3DWidthAlt", 47),
-        MAX_TEXTURE3D_HEIGHT_ALT("maxTexture3DHeightAlt", 48),
-        MAX_TEXTURE3D_DEPTH_ALT("maxTexture3DDepthAlt", 49),
-        PCI_DOMAIN_ID("pciDomainId", 50),
-        TEXTURE_PITCH_ALIGNMENT("texturePitchAlignment", 51),
-        MAX_TEXTURE_CUBEMAP_WIDTH("maxTextureCubemapWidth", 52),
-        MAX_TEXTURE_CUBEMAP_LAYERED_WIDTH("maxTextureCubemapLayeredWidth", 53),
-        MAX_TEXTURE_CUBEMAP_LAYERED_LAYERS("maxTextureCubemapLayeredLayers", 54),
-        MAX_SURFACE1D_WIDTH("maxSurface1DWidth", 55),
-        MAX_SURFACE2D_WIDTH("maxSurface2DWidth", 56),
-        MAX_SURFACE2D_HEIGHT("maxSurface2DHeight", 57),
-        MAX_SURFACE3D_WIDTH("maxSurface3DWidth", 58),
-        MAX_SURFACE3D_HEIGHT("maxSurface3DHeight", 59),
-        MAX_SURFACE3D_DEPTH("maxSurface3DDepth", 60),
-        MAX_SURFACE1D_LAYERED_WIDTH("maxSurface1DLayeredWidth", 61),
-        MAX_SURFACE1D_LAYERED_LAYERS("maxSurface1DLayeredLayers", 62),
-        MAX_SURFACE2D_LAYERED_WIDTH("maxSurface2DLayeredWidth", 63),
-        MAX_SURFACE2D_LAYERED_HEIGHT("maxSurface2DLayeredHeight", 64),
-        MAX_SURFACE2D_LAYERED_LAYERS("maxSurface2DLayeredLayers", 65),
-        MAX_SURFACE_CUBEMAP_WIDTH("maxSurfaceCubemapWidth", 66),
-        MAX_SURFACE_CUBEMAP_LAYERED_WIDTH("maxSurfaceCubemapLayeredWidth", 67),
-        MAX_SURFACE_CUBEMAP_LAYERED_LAYERS("maxSurfaceCubemapLayeredLayers", 68),
-        MAX_TEXTURE1D_LINEAR_WIDTH("maxTexture1DLinearWidth", 69),
-        MAX_TEXTURE2D_LINEAR_WIDTH("maxTexture2DLinearWidth", 70),
-        MAX_TEXTURE2D_LINEAR_HEIGHT("maxTexture2DLinearHeight", 71),
-        MAX_TEXTURE2D_LINEAR_PITCH("maxTexture2DLinearPitch", 72),
-        MAX_TEXTURE2D_MIPMAPPED_WIDTH("maxTexture2DMipmappedWidth", 73),
-        MAX_TEXTURE2D_MIPMAPPED_HEIGHT("maxTexture2DMipmappedHeight", 74),
-        COMPUTE_CAPABILITY_MAJOR("computeCapabilityMajor", 75),
-        COMPUTE_CAPABILITY_MINOR("computeCapabilityMinor", 76),
-        MAX_TEXTURE1D_MIPMAPPED_WIDTH("maxTexture1DMipmappedWidth", 77),
-        STREAM_PRIORITIES_SUPPORTED("streamPrioritiesSupported", 78),
-        GLOBAL_L1_CACHE_SUPPORTED("globalL1CacheSupported", 79),
-        LOCAL_L1_CACHE_SUPPORTED("localL1CacheSupported", 80),
-        MAX_SHARED_MEMORY_PER_MULTIPROCESSOR("maxSharedMemoryPerMultiprocessor", 81),
-        MAX_REGISTERS_PER_MULTIPROCESSOR("maxRegistersPerMultiprocessor", 82),
-        MANAGED_MEMORY("managedMemory", 83),
-        IS_MULTI_GPU_BOARD("isMultiGpuBoard", 84),
-        MULTI_GPU_BOARD_GROUP_ID("multiGpuBoardGroupID", 85),
-        HOST_NATIVE_ATOMIC_SUPPORTED("hostNativeAtomicSupported", 86),
-        SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO("singleToDoublePrecisionPerfRatio", 87),
-        PAGEABLE_MEMORY_ACCESS("pageableMemoryAccess", 88),
-        CONCURRENT_MANAGED_ACCESS("concurrentManagedAccess", 89),
-        COMPUTE_PREEMPTION_SUPPORTED("computePreemptionSupported", 90),
-        CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM("canUseHostPointerForRegisteredMem", 91),
-        COOPERATIVE_LAUNCH("cooperativeLaunch", 95),
-        COOPERATIVE_MULTI_DEVICE_LAUNCH("cooperativeMultiDeviceLaunch", 96),
-        MAX_SHARED_MEMORY_PER_BLOCK_OPTIN("maxSharedMemoryPerBlockOptin", 97),
-        CAN_FLUSH_REMOTE_WRITES("canFlushRemoteWrites", 98),
-        HOST_REGISTER_SUPPORTED("hostRegisterSupported", 99),
-        PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES("pageableMemoryAccessUsesHostPageTables", 100),
-        DIRECT_MANAGED_MEM_ACCESS_FROM_HOST("directManagedMemAccessFromHost", 101);
-
-        final String attributeName;
-        final int attributeCode;
-
-        String getAttributeName() {
-            return attributeName;
-        }
-
-        int getAttributeCode() {
-            return attributeCode;
-        }
-
-        CUDADeviceAttribute(String name, int code) {
-            this.attributeName = name;
-            this.attributeCode = code;
-        }
-    }
-
-    final class CUModule implements AutoCloseable {
-        final String cubinFile;
-        /** Pointer to the native CUmodule object. */
-        final long modulePointer;
-        boolean closed = false;
-
-        CUModule(String cubinFile, long modulePointer) {
-            this.cubinFile = cubinFile;
-            this.modulePointer = modulePointer;
-            this.closed = false;
-        }
-
-        public long getModulePointer() {
-            if (closed) {
-                CompilerDirectives.transferToInterpreter();
-                throw new GrCUDAException(String.format("cannot get module pointer, module (%016x) already closed", modulePointer));
-            }
-            return modulePointer;
-        }
-
-        public boolean isClosed() {
-            return closed;
-        }
-
-        @Override
-        public boolean equals(Object other) {
-            if (other instanceof CUModule) {
-                CUModule otherModule = (CUModule) other;
-                return otherModule.cubinFile.equals(cubinFile) && otherModule.closed == closed;
-            } else {
-                return false;
-            }
-        }
-
-        @Override
-        public int hashCode() {
-            return cubinFile.hashCode();
-        }
-
-        @Override
-        public void close() {
-            if (!closed) {
-                cuModuleUnload(this);
-                closed = true;
-            }
-        }
-    }
-}
-
-final class DeviceMemoryInfo {
-    private final long freeBytes;
-    private final long totalBytes;
-
-    DeviceMemoryInfo(long freeBytes, long totalBytes) {
-        this.freeBytes = freeBytes;
-        this.totalBytes = totalBytes;
-    }
-
-    public long getFreeBytes() {
-        return freeBytes;
-    }
-
-    public long getTotalBytes() {
-        return totalBytes;
-    }
-
-    @Override
-    public String toString() {
-        return String.format("DeviceMemoryInfo(freeBytes=%d bytes, totalBytes=%d bytes", freeBytes, totalBytes);
-    }
-}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/Kernel.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/Kernel.java
deleted file mode 100644
index 03ee2310..00000000
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/Kernel.java
+++ /dev/null
@@ -1,459 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
- * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *  * Neither the name of NVIDIA CORPORATION nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-package com.nvidia.grcuda.gpu;
-
-import java.io.Closeable;
-import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Arrays;
-
-import com.nvidia.grcuda.Parameter;
-import com.nvidia.grcuda.DeviceArray;
-import com.nvidia.grcuda.DeviceArray.MemberSet;
-import com.nvidia.grcuda.GrCUDAException;
-import com.nvidia.grcuda.GrCUDAInternalException;
-import com.nvidia.grcuda.MultiDimDeviceArray;
-import com.nvidia.grcuda.Type;
-import com.nvidia.grcuda.TypeException;
-import com.nvidia.grcuda.gpu.CUDARuntime.CUModule;
-import com.nvidia.grcuda.gpu.UnsafeHelper.MemoryObject;
-import com.oracle.truffle.api.CompilerDirectives;
-import com.oracle.truffle.api.dsl.Fallback;
-import com.oracle.truffle.api.dsl.Specialization;
-import com.oracle.truffle.api.interop.ArityException;
-import com.oracle.truffle.api.interop.InteropLibrary;
-import com.oracle.truffle.api.interop.InvalidArrayIndexException;
-import com.oracle.truffle.api.interop.TruffleObject;
-import com.oracle.truffle.api.interop.UnknownIdentifierException;
-import com.oracle.truffle.api.interop.UnsupportedMessageException;
-import com.oracle.truffle.api.interop.UnsupportedTypeException;
-import com.oracle.truffle.api.library.CachedLibrary;
-import com.oracle.truffle.api.library.ExportLibrary;
-import com.oracle.truffle.api.library.ExportMessage;
-
-@ExportLibrary(InteropLibrary.class)
-public final class Kernel implements TruffleObject {
-
-    private final CUDARuntime cudaRuntime;
-    private final String kernelName;
-    private final String kernelSymbol;
-    private final long nativeKernelFunctionHandle;
-    private final CUModule module;
-    private final Parameter[] kernelParameters;
-    private int launchCount = 0;
-    private String ptxCode;
-
-    /**
-     * Create a kernel without PTX code.
-     *
-     * @param cudaRuntime captured reference to the CUDA runtime instance
-     * @param kernelName name of the kernel as exposed through Truffle
-     * @param kernelSymbol name of the kernel symbol*
-     * @param kernelFunction native pointer to the kernel function (CUfunction)
-     * @param kernelSignature signature string of the kernel (NFI or NIDL)
-     * @param module CUmodule that contains the kernel function
-     */
-    public Kernel(CUDARuntime cudaRuntime, String kernelName,
-                    String kernelSymbol, long kernelFunction,
-                    String kernelSignature, CUModule module) {
-        this(cudaRuntime, kernelName, kernelSymbol, kernelFunction, kernelSignature, module, "");
-    }
-
-    /**
-     * Create a kernel and hold on to the PTX code.
-     *
-     * @param cudaRuntime captured reference to the CUDA runtime instance
-     * @param kernelName name of kernel as exposed through Truffle
-     * @param kernelSymbol name of the kernel symbol
-     * @param kernelFunction native pointer to the kernel function (CUfunction)
-     * @param kernelSignature signature string of the kernel (NFI or NIDL)
-     * @param module CUmodule that contains the kernel function
-     * @param ptx PTX source code for the kernel.
-     */
-    public Kernel(CUDARuntime cudaRuntime, String kernelName, String kernelSymbol,
-                    long kernelFunction, String kernelSignature, CUModule module, String ptx) {
-        try {
-            ArrayList<Parameter> paramList = Parameter.parseParameterSignature(kernelSignature);
-            Parameter[] params = new Parameter[paramList.size()];
-            this.kernelParameters = paramList.toArray(params);
-        } catch (TypeException e) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException(e.getMessage());
-        }
-        this.cudaRuntime = cudaRuntime;
-        this.kernelName = kernelName;
-        this.kernelSymbol = kernelSymbol;
-        this.nativeKernelFunctionHandle = kernelFunction;
-        this.module = module;
-        this.ptxCode = ptx;
-    }
-
-    public void incrementLaunchCount() {
-        launchCount++;
-    }
-
-    public CUDARuntime getCudaRuntime() {
-        return cudaRuntime;
-    }
-
-    public Parameter[] getKernelParameters() {
-        return kernelParameters;
-    }
-
-    KernelArguments createKernelArguments(Object[] args, InteropLibrary booleanAccess,
-                    InteropLibrary int8Access, InteropLibrary int16Access,
-                    InteropLibrary int32Access, InteropLibrary int64Access, InteropLibrary doubleAccess)
-                    throws UnsupportedTypeException, ArityException {
-        if (args.length != kernelParameters.length) {
-            CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(kernelParameters.length, args.length);
-        }
-        KernelArguments kernelArgs = new KernelArguments(args.length);
-        for (int paramIdx = 0; paramIdx < kernelParameters.length; paramIdx++) {
-            Object arg = args[paramIdx];
-            Parameter param = kernelParameters[paramIdx];
-            Type paramType = param.getType();
-            try {
-                if (param.isPointer()) {
-                    if (arg instanceof DeviceArray) {
-                        DeviceArray deviceArray = (DeviceArray) arg;
-                        if (!param.isSynonymousWithPointerTo(deviceArray.getElementType())) {
-                            throw new GrCUDAException("device array of " + deviceArray.getElementType() + " cannot be used as pointer argument " + paramType);
-                        }
-                        UnsafeHelper.PointerObject pointer = UnsafeHelper.createPointerObject();
-                        pointer.setValueOfPointer(deviceArray.getPointer());
-                        kernelArgs.setArgument(paramIdx, pointer);
-                    } else if (arg instanceof MultiDimDeviceArray) {
-                        MultiDimDeviceArray deviceArray = (MultiDimDeviceArray) arg;
-                        if (!param.isSynonymousWithPointerTo(deviceArray.getElementType())) {
-                            throw new GrCUDAException("multi-dimensional device array of " +
-                                            deviceArray.getElementType() + " cannot be used as pointer argument " + paramType);
-                        }
-                        UnsafeHelper.PointerObject pointer = UnsafeHelper.createPointerObject();
-                        pointer.setValueOfPointer(deviceArray.getPointer());
-                        kernelArgs.setArgument(paramIdx, pointer);
-                    } else {
-                        CompilerDirectives.transferToInterpreter();
-                        throw UnsupportedTypeException.create(new Object[]{arg}, "expected DeviceArray type");
-                    }
-                } else {
-                    // by-value argument
-                    switch (paramType) {
-                        case BOOLEAN: {
-                            UnsafeHelper.Integer8Object int8 = UnsafeHelper.createInteger8Object();
-                            int8.setValue(booleanAccess.asBoolean(arg) ? ((byte) 1) : ((byte) 0));
-                            kernelArgs.setArgument(paramIdx, int8);
-                            break;
-                        }
-                        case SINT8:
-                        case CHAR: {
-                            UnsafeHelper.Integer8Object int8 = UnsafeHelper.createInteger8Object();
-                            int8.setValue(int8Access.asByte(arg));
-                            kernelArgs.setArgument(paramIdx, int8);
-                            break;
-                        }
-                        case SINT16: {
-                            UnsafeHelper.Integer16Object int16 = UnsafeHelper.createInteger16Object();
-                            int16.setValue(int16Access.asShort(arg));
-                            kernelArgs.setArgument(paramIdx, int16);
-                            break;
-                        }
-                        case SINT32:
-                        case WCHAR: {
-                            UnsafeHelper.Integer32Object int32 = UnsafeHelper.createInteger32Object();
-                            int32.setValue(int32Access.asInt(arg));
-                            kernelArgs.setArgument(paramIdx, int32);
-                            break;
-                        }
-                        case SINT64:
-                        case SLL64:
-                            // no larger primitive type than long -> interpret long as unsigned
-                        case UINT64:
-                        case ULL64: {
-                            UnsafeHelper.Integer64Object int64 = UnsafeHelper.createInteger64Object();
-                            int64.setValue(int64Access.asLong(arg));
-                            kernelArgs.setArgument(paramIdx, int64);
-                            break;
-                        }
-                        case UINT8:
-                        case CHAR8: {
-                            int uint8 = int16Access.asShort(arg);
-                            if (uint8 < 0 || uint8 > 0xff) {
-                                CompilerDirectives.transferToInterpreter();
-                                throw createExceptionValueOutOfRange(paramType, uint8);
-                            }
-                            UnsafeHelper.Integer8Object int8 = UnsafeHelper.createInteger8Object();
-                            int8.setValue((byte) (0xff & uint8));
-                            kernelArgs.setArgument(paramIdx, int8);
-                            break;
-                        }
-                        case UINT16:
-                        case CHAR16: {
-                            int uint16 = int32Access.asInt(arg);
-                            if (uint16 < 0 || uint16 > 0xffff) {
-                                CompilerDirectives.transferToInterpreter();
-                                throw createExceptionValueOutOfRange(paramType, uint16);
-                            }
-                            UnsafeHelper.Integer16Object int16 = UnsafeHelper.createInteger16Object();
-                            int16.setValue((short) (0xffff & uint16));
-                            kernelArgs.setArgument(paramIdx, int16);
-                            break;
-                        }
-                        case UINT32: {
-                            long uint32 = int64Access.asLong(arg);
-                            if (uint32 < 0 || uint32 > 0xffffffffL) {
-                                CompilerDirectives.transferToInterpreter();
-                                throw createExceptionValueOutOfRange(paramType, uint32);
-                            }
-                            UnsafeHelper.Integer32Object int32 = UnsafeHelper.createInteger32Object();
-                            int32 = UnsafeHelper.createInteger32Object();
-                            int32.setValue((int) (0xffffffffL & uint32));
-                            kernelArgs.setArgument(paramIdx, int32);
-                            break;
-                        }
-                        case FLOAT: {
-                            UnsafeHelper.Float32Object fp32 = UnsafeHelper.createFloat32Object();
-                            // going via "double" to allow floats to be initialized with doubles
-                            fp32.setValue((float) doubleAccess.asDouble(arg));
-                            kernelArgs.setArgument(paramIdx, fp32);
-                            break;
-                        }
-                        case DOUBLE: {
-                            UnsafeHelper.Float64Object fp64 = UnsafeHelper.createFloat64Object();
-                            fp64.setValue(doubleAccess.asDouble(arg));
-                            kernelArgs.setArgument(paramIdx, fp64);
-                            break;
-                        }
-                        default:
-                            CompilerDirectives.transferToInterpreter();
-                            throw UnsupportedTypeException.create(new Object[]{arg},
-                                            "unsupported by-value parameter type: " + paramType);
-                    }
-                }
-            } catch (UnsupportedMessageException e) {
-                CompilerDirectives.transferToInterpreter();
-                throw UnsupportedTypeException.create(new Object[]{arg},
-                                "expected type " + paramType + " in argument " + arg);
-            }
-        }
-        return kernelArgs;
-
-    }
-
-    private static GrCUDAException createExceptionValueOutOfRange(Type type, long value) {
-        return new GrCUDAException("value " + value + " is out of range for type " + type);
-    }
-
-    public long getKernelFunctionHandle() {
-        if (module.isClosed()) {
-            CompilerDirectives.transferToInterpreter();
-            throw new GrCUDAException("CUmodule containing kernel " + kernelName + " is already closed");
-        }
-        return nativeKernelFunctionHandle;
-    }
-
-    @Override
-    public String toString() {
-        return "Kernel(" + kernelName + ", " + Arrays.toString(kernelParameters) + ", launchCount=" + launchCount + ")";
-    }
-
-    public String getPTX() {
-        return ptxCode;
-    }
-
-    public String getKernelName() {
-        return kernelName;
-    }
-
-    public String getSymbolName() {
-        return kernelSymbol;
-    }
-
-    public int getLaunchCount() {
-        return launchCount;
-    }
-
-    // implementation of InteropLibrary
-
-    protected static final String PTX = "ptx";
-    protected static final String NAME = "name";
-    protected static final String LAUNCH_COUNT = "launchCount";
-    static final MemberSet MEMBERS = new MemberSet(PTX, NAME, LAUNCH_COUNT);
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean hasMembers() {
-        return true;
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    Object getMembers(@SuppressWarnings("unused") boolean includeInternal) {
-        return MEMBERS;
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isMemberReadable(String member) {
-        return PTX.equals(member) || NAME.equals(member) || LAUNCH_COUNT.equals(member);
-    }
-
-    @ExportMessage
-    @SuppressWarnings("unused")
-    abstract static class ReadMember {
-        @Specialization(guards = "PTX.equals(member)")
-        public static String readMemberPtx(Kernel receiver, String member) {
-            String ptx = receiver.getPTX();
-            if (ptx == null) {
-                return "<no PTX code>";
-            } else {
-                return ptx;
-            }
-        }
-
-        @Specialization(guards = "NAME.equals(member)")
-        public static String readMemberName(Kernel receiver, String member) {
-            return receiver.getKernelName();
-        }
-
-        @Specialization(guards = "LAUNCH_COUNT.equals(member)")
-        public static int readMemberLaunchCount(Kernel receiver, String member) {
-            return receiver.getLaunchCount();
-        }
-
-        @Fallback
-        public static Object readMemberOther(Kernel receiver, String member) throws UnknownIdentifierException {
-            throw UnknownIdentifierException.create(member);
-        }
-    }
-
-    private static int extractNumber(Object valueObj, String argumentName, InteropLibrary access) throws UnsupportedTypeException {
-        try {
-            return access.asInt(valueObj);
-        } catch (UnsupportedMessageException e) {
-            CompilerDirectives.transferToInterpreter();
-            throw UnsupportedTypeException.create(new Object[]{valueObj}, "integer expected for " + argumentName);
-        }
-    }
-
-    private static Dim3 extractDim3(Object valueObj, String argumentName, InteropLibrary access, InteropLibrary elementAccess) throws UnsupportedTypeException {
-        if (access.hasArrayElements(valueObj)) {
-            long size;
-            try {
-                size = access.getArraySize(valueObj);
-            } catch (UnsupportedMessageException e) {
-                CompilerDirectives.transferToInterpreter();
-                throw new GrCUDAInternalException("unexpected behavior");
-            }
-            if (size < 1 || size > 3) {
-                CompilerDirectives.transferToInterpreter();
-                throw UnsupportedTypeException.create(new Object[]{valueObj}, argumentName + " needs to have between 1 and 3 elements");
-            }
-            int[] dim3 = new int[]{1, 1, 1};
-            final char[] suffix = {'x', 'y', 'z'};
-            for (int i = 0; i < size; i++) {
-                Object elementObj;
-                try {
-                    elementObj = access.readArrayElement(valueObj, i);
-                } catch (UnsupportedMessageException e) {
-                    CompilerDirectives.transferToInterpreter();
-                    throw new GrCUDAInternalException("unexpected behavior");
-                } catch (InvalidArrayIndexException e) {
-                    CompilerDirectives.transferToInterpreter();
-                    throw UnsupportedTypeException.create(new Object[]{valueObj}, argumentName + " needs to have between 1 and 3 elements");
-                }
-                dim3[i] = extractNumber(elementObj, "dim3." + suffix[i], elementAccess);
-            }
-            return new Dim3(dim3[0], dim3[1], dim3[2]);
-        }
-        return new Dim3(extractNumber(valueObj, argumentName, access));
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isExecutable() {
-        return true;
-    }
-
-    @ExportMessage
-    Object execute(Object[] arguments,
-                    @CachedLibrary(limit = "3") InteropLibrary gridSizeAccess,
-                    @CachedLibrary(limit = "3") InteropLibrary gridSizeElementAccess,
-                    @CachedLibrary(limit = "3") InteropLibrary blockSizeAccess,
-                    @CachedLibrary(limit = "3") InteropLibrary blockSizeElementAccess,
-                    @CachedLibrary(limit = "3") InteropLibrary sharedMemoryAccess) throws UnsupportedTypeException, ArityException {
-        int dynamicSharedMemoryBytes;
-        if (arguments.length == 2) {
-            dynamicSharedMemoryBytes = 0;
-        } else if (arguments.length == 3) {
-            // dynamic shared memory specified
-            dynamicSharedMemoryBytes = extractNumber(arguments[2], "dynamicSharedMemory", sharedMemoryAccess);
-        } else {
-            CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(2, arguments.length);
-        }
-
-        Dim3 gridSize = extractDim3(arguments[0], "gridSize", gridSizeAccess, gridSizeElementAccess);
-        Dim3 blockSize = extractDim3(arguments[1], "blockSize", blockSizeAccess, blockSizeElementAccess);
-        KernelConfig config = new KernelConfig(gridSize, blockSize, dynamicSharedMemoryBytes);
-
-        return new ConfiguredKernel(this, config);
-    }
-}
-
-final class KernelArguments implements Closeable {
-
-    private final UnsafeHelper.PointerArray argumentArray;
-    private final ArrayList<Closeable> argumentValues = new ArrayList<>();
-
-    KernelArguments(int numArgs) {
-        this.argumentArray = UnsafeHelper.createPointerArray(numArgs);
-    }
-
-    public void setArgument(int argIdx, MemoryObject obj) {
-        argumentArray.setValueAt(argIdx, obj.getAddress());
-        argumentValues.add(obj);
-    }
-
-    long getPointer() {
-        return argumentArray.getAddress();
-    }
-
-    @Override
-    public void close() {
-        this.argumentArray.close();
-        for (Closeable c : argumentValues) {
-            try {
-                c.close();
-            } catch (IOException e) {
-                /* ignored */
-            }
-        }
-    }
-}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArithmeticNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArithmeticNode.java
index 3986f6c6..7fde8576 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArithmeticNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArithmeticNode.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArrayNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArrayNode.java
index d1fde58e..1f2ce73f 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArrayNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ArrayNode.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -30,13 +37,14 @@
 
 import java.util.ArrayList;
 
-import com.nvidia.grcuda.DeviceArray;
 import com.nvidia.grcuda.Type;
 import com.nvidia.grcuda.GrCUDAContext;
 import com.nvidia.grcuda.GrCUDAInternalException;
 import com.nvidia.grcuda.GrCUDALanguage;
-import com.nvidia.grcuda.MultiDimDeviceArray;
-import com.nvidia.grcuda.gpu.CUDARuntime;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArray;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.dsl.CachedContext;
 import com.oracle.truffle.api.dsl.Specialization;
@@ -55,9 +63,8 @@ public abstract class ArrayNode extends ExpressionNode {
     }
 
     @Specialization
-    Object doDefault(VirtualFrame frame,
-                    @CachedContext(GrCUDALanguage.class) GrCUDAContext context) {
-        final CUDARuntime runtime = context.getCUDARuntime();
+    AbstractArray doDefault(VirtualFrame frame, @CachedContext(GrCUDALanguage.class) GrCUDAContext context) {
+        final AbstractGrCUDAExecutionContext grCUDAExecutionContext = context.getGrCUDAExecutionContext();
         long[] elementsPerDim = new long[sizeNodes.length];
         int dim = 0;
         for (ExpressionNode sizeNode : sizeNodes) {
@@ -70,10 +77,10 @@ Object doDefault(VirtualFrame frame,
             dim += 1;
         }
         if (sizeNodes.length == 1) {
-            return new DeviceArray(runtime, elementsPerDim[0], elementType);
+            return new DeviceArray(grCUDAExecutionContext, elementsPerDim[0], elementType);
         } else {
             final boolean columnMajorOrder = false;
-            return new MultiDimDeviceArray(runtime, elementType, elementsPerDim, columnMajorOrder);
+            return new MultiDimDeviceArray(grCUDAExecutionContext, elementType, elementsPerDim, columnMajorOrder);
         }
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/BinaryNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/BinaryNode.java
index 5a32f425..b624fd9d 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/BinaryNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/BinaryNode.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,7 +32,6 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-
 package com.nvidia.grcuda.nodes;
 
 public abstract class BinaryNode extends ExpressionNode {
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/CallNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/CallNode.java
index f452be57..3c306e5a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/CallNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/CallNode.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -61,7 +68,7 @@ Object doDefault(VirtualFrame frame,
         String[] functionName = identifier.getIdentifierName();
         Namespace namespace = context.getRootNamespace();
         Optional<Object> maybeFunction = namespace.lookup(functionName);
-        if (!maybeFunction.isPresent() || !(maybeFunction.get() instanceof Function)) {
+        if (maybeFunction.isEmpty() || !(maybeFunction.get() instanceof Function)) {
             CompilerDirectives.transferToInterpreter();
             throw new GrCUDAException("function '" + GrCUDAException.format(functionName) + "' not found", this);
         }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ExpressionNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ExpressionNode.java
index b5ff75b9..0c44600a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ExpressionNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/ExpressionNode.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/GrCUDARootNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/GrCUDARootNode.java
index 2b4816dc..01a8795b 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/GrCUDARootNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/GrCUDARootNode.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IdentifierNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IdentifierNode.java
index 4a5f397c..36187f2e 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IdentifierNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IdentifierNode.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IntegerLiteral.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IntegerLiteral.java
index a57399c0..b22f2b1e 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IntegerLiteral.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/IntegerLiteral.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/RootNamespaceNode.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/RootNamespaceNode.java
index fa034924..8b1b0c57 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/RootNamespaceNode.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/RootNamespaceNode.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/StringLiteral.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/StringLiteral.java
index 993f4f03..56ef1f14 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/StringLiteral.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/nodes/StringLiteral.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/GrCUDAParserException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/GrCUDAParserException.java
index fdb4ef8b..78d835e5 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/GrCUDAParserException.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/GrCUDAParserException.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -27,12 +34,17 @@
  */
 package com.nvidia.grcuda.parser;
 
-import com.oracle.truffle.api.TruffleException;
-import com.oracle.truffle.api.nodes.Node;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
+import com.oracle.truffle.api.interop.ExceptionType;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
 import com.oracle.truffle.api.source.Source;
 import com.oracle.truffle.api.source.SourceSection;
 
-public class GrCUDAParserException extends RuntimeException implements TruffleException {
+@ExportLibrary(InteropLibrary.class)
+public class GrCUDAParserException extends AbstractTruffleException {
 
     private static final long serialVersionUID = -6653370806148433373L;
     private final Source source;
@@ -48,19 +60,21 @@ public GrCUDAParserException(String message, Source source, int line, int charPo
         this.length = length;
     }
 
-    @Override
-    public SourceSection getSourceLocation() {
-        return source.createSection(line, column, length);
+    @ExportMessage
+    ExceptionType getExceptionType() {
+        return ExceptionType.PARSE_ERROR;
     }
 
-    @Override
-    public Node getLocation() {
-        return null;
+    @ExportMessage
+    boolean hasSourceLocation() {
+        return source != null;
     }
 
-    @Override
-    public boolean isSyntaxError() {
-        return true;
+    @ExportMessage(name = "getSourceLocation")
+    SourceSection getSourceSection() throws UnsupportedMessageException {
+        if (source == null) {
+            throw UnsupportedMessageException.create();
+        }
+        return source.createSection(line, column, length);
     }
-
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NIDLParserException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NIDLParserException.java
index 512d538c..f56991d2 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NIDLParserException.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NIDLParserException.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -27,10 +34,14 @@
  */
 package com.nvidia.grcuda.parser;
 
-import com.oracle.truffle.api.TruffleException;
-import com.oracle.truffle.api.nodes.Node;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
+import com.oracle.truffle.api.interop.ExceptionType;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
 
-public class NIDLParserException extends RuntimeException implements TruffleException {
+@ExportLibrary(InteropLibrary.class)
+public class NIDLParserException extends AbstractTruffleException {
 
     private static final long serialVersionUID = -7520277230665801341L;
     private final String message;
@@ -46,19 +57,13 @@ public NIDLParserException(String message, String filename, int line, int charPo
         this.column = charPositionInLine;
     }
 
-    @Override
-    public String getMessage() {
-        return "NIDL parse error: [" + filename + " " + line + ":" + column + "] " + message;
-    }
-
-    @Override
-    public Node getLocation() {
-        // null = location not available
-        return null;
+    @ExportMessage
+    ExceptionType getExceptionType() {
+        return ExceptionType.PARSE_ERROR;
     }
 
     @Override
-    public boolean isSyntaxError() {
-        return true;
+    public String getMessage() {
+        return "NIDL parse error: [" + filename + " " + line + ":" + column + "] " + message;
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NodeFactory.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NodeFactory.java
index a79cbe7d..6d2eacba 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NodeFactory.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/NodeFactory.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/ParserAntlr.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/ParserAntlr.java
index 08594c29..aeb11fab 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/ParserAntlr.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/ParserAntlr.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDA.g4 b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDA.g4
index c1c00e91..add3aec2 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDA.g4
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDA.g4
@@ -1,5 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.g4 b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.g4
index b4d4d9ca..f7dafa55 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.g4
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.g4
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -37,7 +44,7 @@ import java.util.Optional;
 import com.nvidia.grcuda.Binding;
 import com.nvidia.grcuda.FunctionBinding;
 import com.nvidia.grcuda.KernelBinding;
-import com.nvidia.grcuda.Parameter;
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
 import com.nvidia.grcuda.Type;
 import com.nvidia.grcuda.TypeException;
 import com.nvidia.grcuda.parser.NIDLParserException;
@@ -136,7 +143,7 @@ kernels returns [ArrayList<Binding> result]
   ;
 
 kernel returns [Binding result]
-  : n=Identifier '(' pl=parameterList ')'
+  : n=Identifier '(' pl=computationArgumentList ')'
     { $result = KernelBinding.newCxxBinding($n.getText(), $pl.result); }
   ;
 
@@ -147,7 +154,7 @@ ckernels returns [ArrayList<Binding> result]
   ;
 
 ckernel returns [Binding result]
-  : n=Identifier '(' pl=parameterList ')'
+  : n=Identifier '(' pl=computationArgumentList ')'
     { $result = KernelBinding.newCBinding($n.getText(), $pl.result); }
   ;
 
@@ -162,7 +169,7 @@ hostFunctions returns [ArrayList<Binding> result]
   ;
 
 hostFunction returns [Binding result]
-  : n=Identifier '(' pl=parameterList ')' ':' rt=Identifier
+  : n=Identifier '(' pl=computationArgumentList ')' ':' rt=Identifier
     { try {
       $result = FunctionBinding.newCxxBinding($n.getText(), $pl.result,
     		  Type.fromNIDLTypeString($rt.getText()));
@@ -179,7 +186,7 @@ chostFunctions returns [ArrayList<Binding> result]
   ;
 
 chostFunction returns [Binding result]
-  : n=Identifier '(' pl=parameterList ')' ':' rt=Identifier
+  : n=Identifier '(' pl=computationArgumentList ')' ':' rt=Identifier
     { try {
       $result = FunctionBinding.newCBinding($n.getText(), $pl.result,
               Type.fromNIDLTypeString($rt.getText()));
@@ -189,8 +196,8 @@ chostFunction returns [Binding result]
     }
   ;
 
-parameterList returns [ArrayList<Parameter> result]
-  : pl=parameterList ',' paramExpr
+computationArgumentList returns [ArrayList<ComputationArgument> result]
+  : pl=computationArgumentList ',' paramExpr
     { $result = $pl.result;
       if ($paramExpr.result != null) {
         // avoids NPE during parser error
@@ -199,34 +206,34 @@ parameterList returns [ArrayList<Parameter> result]
       }
     }
   | paramExpr
-    { $result = new ArrayList<Parameter>();
+    { $result = new ArrayList<ComputationArgument>();
       $result.add($paramExpr.result);
     }
   |
-    { $result = new ArrayList<Parameter>(); }
+    { $result = new ArrayList<ComputationArgument>(); }
   ;
 
-paramExpr returns [Parameter result]
+paramExpr returns [ComputationArgument result]
   : n=Identifier ':' t=Identifier
     { try {
-        $result = Parameter.createByValueParameter($n.getText(), Type.fromNIDLTypeString($t.getText()));
+        $result = ComputationArgument.createByValueComputationArgument($n.getText(), Type.fromNIDLTypeString($t.getText()));
       } catch(TypeException e) {
         throw new NIDLParserException(e.getMessage(), filename, $t.getLine(), $t.getCharPositionInLine());
       }
     }
   | n=Identifier ':' direction 'pointer' t=Identifier
     { try {
-        $result = Parameter.createPointerParameter($n.getText(), Type.fromNIDLTypeString($t.getText()), $direction.result);
+        $result = ComputationArgument.createPointerComputationArgument($n.getText(), Type.fromNIDLTypeString($t.getText()), $direction.result);
       } catch(TypeException e) {
         throw new NIDLParserException(e.getMessage(), filename, $t.getLine(), $t.getCharPositionInLine());
       }
     }
   ;
 
-direction returns [Parameter.Kind result]
-  : 'in'    { $result = Parameter.Kind.POINTER_IN; }
-  | 'out'   { $result = Parameter.Kind.POINTER_OUT; }
-  | 'inout' { $result = Parameter.Kind.POINTER_INOUT; }
+direction returns [ComputationArgument.Kind result]
+  : 'in'    { $result = ComputationArgument.Kind.POINTER_IN; }
+  | 'out'   { $result = ComputationArgument.Kind.POINTER_OUT; }
+  | 'inout' { $result = ComputationArgument.Kind.POINTER_INOUT; }
   ;
 
 namespaceIdentifier returns [ArrayList<String> result]
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/AbstractDevice.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/AbstractDevice.java
new file mode 100644
index 00000000..a9f6ba23
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/AbstractDevice.java
@@ -0,0 +1,36 @@
+package com.nvidia.grcuda.runtime;
+
+import java.util.Objects;
+
+/**
+ * Abstract device representation, used to distinguish between CPU and GPU devices inside the GrCUDA scheduler.
+ */
+public abstract class AbstractDevice {
+    protected final int deviceId;
+
+    public AbstractDevice(int deviceId) {
+        this.deviceId = deviceId;
+    }
+
+    public int getDeviceId() {
+        return deviceId;
+    }
+
+    @Override
+    public String toString() {
+        return "Device(id=" + deviceId + ")";
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        AbstractDevice that = (AbstractDevice) o;
+        return deviceId == that.deviceId;
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(deviceId);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/CPUDevice.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/CPUDevice.java
new file mode 100644
index 00000000..ba8b88d4
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/CPUDevice.java
@@ -0,0 +1,22 @@
+package com.nvidia.grcuda.runtime;
+
+public class CPUDevice extends AbstractDevice {
+    public static final int CPU_DEVICE_ID = -1;
+
+    public CPUDevice() {
+        super(CPU_DEVICE_ID);
+    }
+
+    @Override
+    public String toString() {
+        return "CPU(id=" + deviceId + ")";
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        CPUDevice that = (CPUDevice) o;
+        return deviceId == that.deviceId;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/CUDARuntime.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/CUDARuntime.java
new file mode 100644
index 00000000..732159cc
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/CUDARuntime.java
@@ -0,0 +1,1979 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime;
+
+import com.nvidia.grcuda.CUDAEvent;
+import static com.nvidia.grcuda.functions.Function.checkArgumentLength;
+import static com.nvidia.grcuda.functions.Function.expectInt;
+import static com.nvidia.grcuda.functions.Function.expectLong;
+import static com.nvidia.grcuda.functions.Function.expectPositiveLong;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.oracle.truffle.api.TruffleLogger;
+import org.graalvm.collections.Pair;
+
+import com.nvidia.grcuda.Binding;
+import com.nvidia.grcuda.FunctionBinding;
+import com.nvidia.grcuda.GPUPointer;
+import com.nvidia.grcuda.GrCUDAContext;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.Namespace;
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.functions.CUDAFunction;
+import com.nvidia.grcuda.runtime.UnsafeHelper.Integer32Object;
+import com.nvidia.grcuda.runtime.UnsafeHelper.Integer64Object;
+import com.nvidia.grcuda.runtime.computation.streamattach.StreamAttachArchitecturePolicy;
+import com.nvidia.grcuda.runtime.computation.streamattach.PostPascalStreamAttachPolicy;
+import com.nvidia.grcuda.runtime.computation.streamattach.PrePascalStreamAttachPolicy;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.DefaultStream;
+import com.oracle.truffle.api.CompilerAsserts;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
+import com.oracle.truffle.api.TruffleLanguage.Env;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.InteropException;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnknownIdentifierException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.source.Source;
+
+public final class CUDARuntime {
+
+    public static final String CUDA_RUNTIME_LIBRARY_NAME = "cudart";
+    public static final String CUDA_LIBRARY_NAME = "cuda";
+    static final String NVRTC_LIBRARY_NAME = "nvrtc";
+
+    public static final int DEFAULT_DEVICE = 0;
+
+    private final GrCUDAContext context;
+    private final NVRuntimeCompiler nvrtc;
+
+    private final List<GPUPointer> innerCudaContexts = new ArrayList<>();
+
+    /**
+     * Total number of GPUs available in the system, even if they are not used. It must be > 0;
+     */
+    private int numberOfAvailableGPUs = GrCUDAOptionMap.DEFAULT_NUMBER_OF_GPUs;
+    /**
+     * How many GPUs are actually used by GrCUDA. It must hold 1 <= numberOfGPUsToUse <= numberOfAvailableGPUs;
+     */
+    private int numberOfGPUsToUse = GrCUDAOptionMap.DEFAULT_NUMBER_OF_GPUs;
+
+    public int getNumberOfAvailableGPUs() {
+        return numberOfAvailableGPUs;
+    }
+
+    public int getNumberOfGPUsToUse() {
+        return numberOfGPUsToUse;
+    }
+
+    /**
+     * Identifier of the GPU that is currently active;
+     */
+    private int currentGPU = DEFAULT_DEVICE;
+    
+    public boolean isMultiGPUEnabled() {
+        return this.numberOfGPUsToUse > 1;
+    }
+
+    public static final TruffleLogger RUNTIME_LOGGER = GrCUDALogger.getLogger(GrCUDALogger.RUNTIME_LOGGER);
+
+    /**
+     * Users can manually create streams that are not managed directly by a {@link com.nvidia.grcuda.runtime.stream.GrCUDAStreamManager}.
+     * We keep track of how many of these streams have been created;
+     */
+    private int numUserAllocatedStreams = 0;
+
+    public void incrementNumStreams() {
+        numUserAllocatedStreams++;
+    }
+
+    public int getNumStreams() {
+        return numUserAllocatedStreams;
+    }
+
+    /**
+     * CUDA events are used to synchronize stream computations, and guarantee that a computation
+     * starts only when all computations that depends from it are completed. Keep track of the
+     * number of events created;
+     */
+    private long numEvents = 0;
+
+    public void incrementNumEvents() {
+        numEvents++;
+    }
+
+    public long getNumEvents() {
+        return numEvents;
+    }
+
+    /**
+     * Map from library-path to NFI library.
+     */
+    private final HashMap<String, TruffleObject> loadedLibraries = new HashMap<>();
+
+    /**
+     * Store one map between loaded functions and CUModules for every device;
+     */
+    private final List<HashMap<String, CUModule>> loadedModules = new ArrayList<>();
+
+    /**
+     * Map of (library-path, symbol-name) to callable.
+     */
+    private final HashMap<Pair<String, String>, Object> boundFunctions = new HashMap<>();
+
+    /**
+     * Depending on the available GPU, use a different policy to associate managed memory arrays to streams,
+     * as specified in {@link StreamAttachArchitecturePolicy}
+     */
+    private final StreamAttachArchitecturePolicy streamAttachArchitecturePolicy;
+
+    /**
+     * True if the GPU architecture is Pascal or newer;
+     */
+    private final boolean architectureIsPascalOrNewer;
+
+    /**
+     * Interface used to load and build GPU kernels, optimized for single or multi-GPU systems;
+     */
+    private final KernelManagementInterface kernelManagement;
+
+    public CUDARuntime(GrCUDAContext context, Env env) {
+        this.context = context;
+        try {
+            TruffleObject libcudart = (TruffleObject) env.parseInternal(
+                            Source.newBuilder("nfi", "load " + "lib" + CUDA_RUNTIME_LIBRARY_NAME + ".so", "cudaruntime").build()).call();
+            TruffleObject libcuda = (TruffleObject) env.parseInternal(
+                            Source.newBuilder("nfi", "load " + "lib" + CUDA_LIBRARY_NAME + ".so", "cuda").build()).call();
+            TruffleObject libnvrtc = (TruffleObject) env.parseInternal(
+                            Source.newBuilder("nfi", "load " + "lib" + NVRTC_LIBRARY_NAME + ".so", "nvrtc").build()).call();
+            this.loadedLibraries.put(CUDA_RUNTIME_LIBRARY_NAME, libcudart);
+            this.loadedLibraries.put(CUDA_LIBRARY_NAME, libcuda);
+            this.loadedLibraries.put(NVRTC_LIBRARY_NAME, libnvrtc);
+
+            // Initialize support for multiple GPUs in GrCUDA;
+            setupSupportForMultiGPU();
+            // Setup the right interface for loading and building kernels;
+            if (isMultiGPUEnabled()) {
+                this.kernelManagement = new KernelManagementMultiGPU();
+            } else {
+                this.kernelManagement = new KernelManagementSingleGPU();
+            }
+        } catch (UnsatisfiedLinkError e) {
+            throw new GrCUDAException(e.getMessage());
+        }
+
+        this.nvrtc = new NVRuntimeCompiler(this);
+        context.addDisposable(this::shutdown);
+
+        // Check if the GPU available in the system has Compute Capability >= 6.0 (Pascal architecture);
+        this.architectureIsPascalOrNewer = cudaDeviceGetAttribute(CUDADeviceAttribute.COMPUTE_CAPABILITY_MAJOR, 0) >= 6;
+
+        // Use pre-Pascal stream attachment policy if the CC is < 6 or if the attachment is forced by options;
+        this.streamAttachArchitecturePolicy = (!this.architectureIsPascalOrNewer || context.getOptions().isForceStreamAttach()) ? new PrePascalStreamAttachPolicy() : new PostPascalStreamAttachPolicy();
+    }
+
+    /**
+     * Initialize support for multiple GPUs. Validate that the selected number of options is coherent (1 <= numberOfGPUsToUse <= numberOfAvailableGPUs),
+     * then initialize the map that stores CUModules on every device used by GrCUDA;
+     */
+    private void setupSupportForMultiGPU() {
+        // Find how many GPUs are available on this system;
+        this.numberOfAvailableGPUs = cudaGetDeviceCount();
+        RUNTIME_LOGGER.fine(() -> "identified " + numberOfAvailableGPUs + " GPUs available on this machine");
+        this.numberOfGPUsToUse = numberOfAvailableGPUs;
+        if (numberOfAvailableGPUs <= 0) {
+            RUNTIME_LOGGER.severe(() -> "GrCUDA initialization failed, no GPU device is available (devices count = " + numberOfAvailableGPUs + ")");
+            throw new GrCUDAException("GrCUDA initialization failed, no GPU device is available");
+        }
+        // Validate and update the number of GPUs used in the context;
+        int numberOfSelectedGPUs = context.getOptions().getNumberOfGPUs();
+        if (numberOfSelectedGPUs <= 0) {
+            RUNTIME_LOGGER.warning(() -> "non-positive number of GPUs selected (" + numberOfSelectedGPUs + "), defaulting to 1");
+            numberOfGPUsToUse = 1;
+            context.getOptions().setNumberOfGPUs(numberOfGPUsToUse);  // Update the option value;
+        } else if (numberOfSelectedGPUs > numberOfAvailableGPUs) {
+            RUNTIME_LOGGER.warning(() -> "the number of GPUs selected is greater than what's available (selected=" + numberOfSelectedGPUs + ", available=" + numberOfAvailableGPUs + "), using all the available GPUs (" + numberOfAvailableGPUs + ")");
+            numberOfGPUsToUse = numberOfAvailableGPUs;
+            context.getOptions().setNumberOfGPUs(numberOfGPUsToUse);  // Update the option value;
+        } else {
+            // Select how many GPUs to use;
+            numberOfGPUsToUse = numberOfSelectedGPUs;
+        }
+        for (int i = 0; i < this.numberOfGPUsToUse; i++) {
+            this.loadedModules.add(new HashMap<String, CUModule>());
+        }
+        RUNTIME_LOGGER.info(() -> "initialized GrCUDA to use " + this.numberOfGPUsToUse + "/" + numberOfAvailableGPUs + " GPUs");
+    }
+    
+    // using this slow/uncached instance since all calls are non-critical
+    private static final InteropLibrary INTEROP = InteropLibrary.getFactory().getUncached();
+
+    public GrCUDAContext getContext() {
+        return context;
+    }
+
+    public boolean isArchitectureIsPascalOrNewer() {
+        return architectureIsPascalOrNewer;
+    }
+
+    public int getCurrentGPU() {
+        return currentGPU;
+    }
+
+    private void setCurrentGPU(int currentGPU) {
+        this.currentGPU = currentGPU;
+    }
+
+    interface CallSupport {
+        String getName();
+
+        Object getSymbol(CUDARuntime runtime) throws UnknownIdentifierException;
+
+        default void callSymbol(CUDARuntime runtime, Object... arguments) throws UnsupportedTypeException, ArityException, UnsupportedMessageException, UnknownIdentifierException {
+            CompilerAsserts.neverPartOfCompilation();
+            Object result = INTEROP.execute(getSymbol(runtime), arguments);
+            runtime.checkCUDAReturnCode(result, getName());
+        }
+    }
+
+    /**************************************************************
+     **************************************************************
+     * Implementation of CUDA runtime API available within GrCUDA *
+     * (not exposed to the host language);                        *
+     **************************************************************
+     **************************************************************/
+
+    @TruffleBoundary
+    public GPUPointer cudaMalloc(long numBytes) {
+        try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDARuntimeFunction.CUDA_MALLOC.getSymbol(this);
+            Object result = INTEROP.execute(callable, outPointer.getAddress(), numBytes);
+            checkCUDAReturnCode(result, "cudaMalloc");
+            long addressAllocatedMemory = outPointer.getValueOfPointer();
+            return new GPUPointer(addressAllocatedMemory);
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public LittleEndianNativeArrayView cudaMallocManaged(long numBytes) {
+        final int cudaMemAttachGlobal = 0x01;
+        try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDARuntimeFunction.CUDA_MALLOCMANAGED.getSymbol(this);
+            Object result = INTEROP.execute(callable, outPointer.getAddress(), numBytes, cudaMemAttachGlobal);
+            checkCUDAReturnCode(result, "cudaMallocManaged");
+            long addressAllocatedMemory = outPointer.getValueOfPointer();
+            return new LittleEndianNativeArrayView(addressAllocatedMemory, numBytes);
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaFree(LittleEndianNativeArrayView memory) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_FREE.getSymbol(this);
+            Object result = INTEROP.execute(callable, memory.getStartAddress());
+            checkCUDAReturnCode(result, "cudaFree");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaFree(GPUPointer pointer) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_FREE.getSymbol(this);
+            Object result = INTEROP.execute(callable, pointer.getRawPointer());
+            checkCUDAReturnCode(result, "cudaFree");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaDeviceSynchronize() {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_DEVICESYNCHRONIZE.getSymbol(this);
+            Object result = INTEROP.execute(callable);
+            checkCUDAReturnCode(result, "cudaDeviceSynchronize");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaMemcpy(long destPointer, long fromPointer, long numBytesToCopy) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_MEMCPY.getSymbol(this);
+            if (numBytesToCopy < 0) {
+                throw new IllegalArgumentException("requested negative number of bytes to copy " + numBytesToCopy);
+            }
+            // cudaMemcpyKind from driver_types.h (default: direction of transfer is inferred
+            // from the pointer values, uses virtual addressing)
+            final long cudaMemcpyDefault = 4;
+            Object result = INTEROP.execute(callable, destPointer, fromPointer, numBytesToCopy, cudaMemcpyDefault);
+            checkCUDAReturnCode(result, "cudaMemcpy");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaMemcpy(long destPointer, long fromPointer, long numBytesToCopy, CUDAStream stream) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_MEMCPYASYNC.getSymbol(this);
+            if (numBytesToCopy < 0) {
+                throw new IllegalArgumentException("requested negative number of bytes to copy " + numBytesToCopy);
+            }
+            // cudaMemcpyKind from driver_types.h (default: direction of transfer is inferred
+            // from the pointer values, uses virtual addressing)
+            final long cudaMemcpyDefault = 4;
+            Object result = INTEROP.execute(callable, destPointer, fromPointer, numBytesToCopy, cudaMemcpyDefault, stream.getRawPointer());
+            cudaStreamSynchronize(stream);
+            checkCUDAReturnCode(result, "cudaMemcpyAsync");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public DeviceMemoryInfo cudaMemGetInfo() {
+        final String symbol = "cudaMemGetInfo";
+        final String nfiSignature = "(pointer, pointer): sint32";
+        try (Integer64Object freeBytes = UnsafeHelper.createInteger64Object();
+                        Integer64Object totalBytes = UnsafeHelper.createInteger64Object()) {
+            Object callable = getSymbol(CUDA_RUNTIME_LIBRARY_NAME, symbol, nfiSignature);
+            Object result = INTEROP.execute(callable, freeBytes.getAddress(), totalBytes.getAddress());
+            checkCUDAReturnCode(result, symbol);
+            return new DeviceMemoryInfo(freeBytes.getValue(), totalBytes.getValue());
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaDeviceReset() {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_DEVICERESET.getSymbol(this);
+            Object result = INTEROP.execute(callable);
+            checkCUDAReturnCode(result, "cudaDeviceReset");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public int cudaGetDeviceCount() {
+        try (UnsafeHelper.Integer32Object deviceCount = UnsafeHelper.createInteger32Object()) {
+            Object callable = CUDARuntimeFunction.CUDA_GETDEVICECOUNT.getSymbol(this);
+            Object result = INTEROP.execute(callable, deviceCount.getAddress());
+            checkCUDAReturnCode(result, "cudaGetDeviceCount");
+            return deviceCount.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaSetDevice(int device) {
+        try {
+            if (device < 0 || device > this.numberOfGPUsToUse) {
+                throw new GrCUDAException("the selected GPU is not valid (" + device + "), it should be 0 <= x < " + this.numberOfGPUsToUse);
+            }
+            Object callable = CUDARuntimeFunction.CUDA_SETDEVICE.getSymbol(this);
+            Object result = INTEROP.execute(callable, device);
+            RUNTIME_LOGGER.finest(() -> "selected current GPU = " + device);
+            checkCUDAReturnCode(result, "cudaSetDevice");
+            setCurrentGPU(device);
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private int cudaGetDevice() {
+        try (Integer32Object deviceId = UnsafeHelper.createInteger32Object()) {
+            Object callable = CUDARuntimeFunction.CUDA_GETDEVICE.getSymbol(this);
+            Object result = INTEROP.execute(callable, deviceId.getAddress());
+            checkCUDAReturnCode(result, "cudaGetDevice");
+            return deviceId.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public int cudaDeviceGetAttribute(CUDADeviceAttribute attribute, int deviceId) {
+        try (Integer32Object value = UnsafeHelper.createInteger32Object()) {
+            Object callable = CUDARuntimeFunction.CUDA_DEVICEGETATTRIBUTE.getSymbol(this);
+            Object result = INTEROP.execute(callable, value.getAddress(), attribute.getAttributeCode(), deviceId);
+            checkCUDAReturnCode(result, "cudaDeviceGetAttribute");
+            return value.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public Object getDeviceName(int deviceOrdinal) {
+        return cuDeviceGetName(cuDeviceGet(deviceOrdinal));
+    }
+
+    @TruffleBoundary
+    public String cudaGetErrorString(int errorCode) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_GETERRORSTRING.getSymbol(this);
+            Object result = INTEROP.execute(callable, errorCode);
+            return INTEROP.asString(result);
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public CUDAStream cudaStreamCreate(int streamId) {
+        try (UnsafeHelper.PointerObject streamPointer = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDARuntimeFunction.CUDA_STREAMCREATE.getSymbol(this);
+            Object result = INTEROP.execute(callable, streamPointer.getAddress());
+            checkCUDAReturnCode(result, "cudaStreamCreate");
+            return new CUDAStream(streamPointer.getValueOfPointer(), streamId, getCurrentGPU());
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaStreamSynchronize(CUDAStream stream) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_STREAMSYNCHRONIZE.getSymbol(this);
+            Object result = INTEROP.execute(callable, stream.getRawPointer());
+            checkCUDAReturnCode(result, "cudaStreamSynchronize");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cudaStreamDestroy(CUDAStream stream) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_STREAMDESTROY.getSymbol(this);
+            Object result = INTEROP.execute(callable, stream.getRawPointer());
+            checkCUDAReturnCode(result, "cudaStreamDestroy");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Limit the visibility of a managed memory array to the specified stream;
+     * 
+     * @param stream the stream to which we attach the array
+     * @param array an array that should be assigned exclusively to a stream
+     */
+    @TruffleBoundary
+    public void cudaStreamAttachMemAsync(CUDAStream stream, AbstractArray array) {
+
+        final int MEM_ATTACH_SINGLE = 0x04;
+        final int MEM_ATTACH_GLOBAL = 0x01;
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_STREAMATTACHMEMASYNC.getSymbol(this);
+            int flag = stream.isDefaultStream() ? MEM_ATTACH_GLOBAL : MEM_ATTACH_SINGLE;
+            RUNTIME_LOGGER.finest(() -> "\t* attach array=" + System.identityHashCode(array) + " to " + stream + "; flag=" + flag);
+
+            // Book-keeping of the stream attachment within the array;
+            array.setStreamMapping(stream);
+            // FIXME: might be required for multi-GPU;
+//            array.setArrayLocation(stream.getStreamDeviceId());
+//            array.addArrayLocation(stream.getStreamDeviceId());
+
+            Object result = INTEROP.execute(callable, stream.getRawPointer(), array.getFullArrayPointer(), array.getFullArraySizeBytes(), flag);
+            checkCUDAReturnCode(result, "cudaStreamAttachMemAsync");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Synchronous version of "cudaStreamAttachMemAsync". This function doesn't exist in the CUDA
+     * API, but it is useful to have;
+     * 
+     * @param stream the stream to which we attach the array
+     * @param array an array that should be assigned exclusively to a stream
+     */
+    @TruffleBoundary
+    public void cudaStreamAttachMem(CUDAStream stream, AbstractArray array) {
+        cudaStreamAttachMemAsync(stream, array);
+        cudaStreamSynchronize(stream);
+    }
+
+    @TruffleBoundary
+    public void cudaMemPrefetchAsync(AbstractArray array, CUDAStream stream) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_MEMPREFETCHASYNC.getSymbol(this);
+            Object result = INTEROP.execute(callable, array.getFullArrayPointer(), array.getFullArraySizeBytes(), stream.getStreamDeviceId(), stream.getRawPointer());
+            checkCUDAReturnCode(result, "cudaMemPrefetchAsync");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public List<GPUPointer> getInnerCudaContexts() {
+        if (this.innerCudaContexts.size() == 0) {
+            assertCUDAInitialized();
+        }
+        return this.innerCudaContexts;
+    }
+
+    @TruffleBoundary
+    public GPUPointer initializeInnerCudaContext(int deviceId) {
+        int CU_CTX_SCHED_YIELD = 0x02; // Optimal multi-threaded host flag;
+        return new GPUPointer(cuDevicePrimaryCtxRetain(deviceId));
+    }
+
+    /**
+     * Create a new {@link CUDAEvent} and keep track of it;
+     * 
+     * @return a new CUDA event
+     */
+    @TruffleBoundary
+    public CUDAEvent cudaEventCreate() {
+        try (UnsafeHelper.PointerObject eventPointer = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDARuntimeFunction.CUDA_EVENTCREATE.getSymbol(this);
+            Object result = INTEROP.execute(callable, eventPointer.getAddress());
+            checkCUDAReturnCode(result, "cudaEventCreate");
+            CUDAEvent event = new CUDAEvent(eventPointer.getValueOfPointer(), getNumEvents());
+            incrementNumEvents();
+            return event;
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Destroy a given CUDA event;
+     * 
+     * @param event a CUDA Event to destroy
+     */
+    @TruffleBoundary
+    public void cudaEventDestroy(CUDAEvent event) {
+        if (!event.isAlive()) {
+            throw new RuntimeException("CUDA event=" + event + " has already been destroyed");
+        }
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_EVENTDESTROY.getSymbol(this);
+            Object result = INTEROP.execute(callable, event.getRawPointer());
+            checkCUDAReturnCode(result, "cudaEventDestroy");
+            event.setDead();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Computes the elapsed time between two CUDA events, return the time in milliseconds;
+     * @param start starting event
+     * @param end ending event
+     */
+    @TruffleBoundary
+    public float cudaEventElapsedTime(CUDAEvent start, CUDAEvent end) {
+        try(UnsafeHelper.Float32Object outPointer = UnsafeHelper.createFloat32Object()) {
+            Object callable = CUDARuntimeFunction.CUDA_EVENTELAPSEDTIME.getSymbol(this);
+            Object result = INTEROP.execute(callable, outPointer.getAddress(), start.getRawPointer(), end.getRawPointer());
+            checkCUDAReturnCode(result, "cudaEventElapsedTime");
+            return outPointer.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Add a given event to a stream. The event is a stream-ordered checkpoint on which we can
+     * perform synchronization, or force another stream to wait for the event to occur before
+     * executing any other scheduled operation queued on that stream;
+     * 
+     * @param event a CUDA event
+     * @param stream a CUDA stream to which the event is associated
+     */
+    @TruffleBoundary
+    public void cudaEventRecord(CUDAEvent event, CUDAStream stream) {
+        if (!event.isAlive()) {
+            throw new RuntimeException("CUDA event=" + event + " has already been destroyed");
+        }
+        try {
+            // Make sure that the stream is on the right device, otherwise we cannot record the event;
+            assert stream.getStreamDeviceId() == getCurrentGPU();
+
+            Object callable = CUDARuntimeFunction.CUDA_EVENTRECORD.getSymbol(this);
+            Object result = INTEROP.execute(callable, event.getRawPointer(), stream.getRawPointer());
+            checkCUDAReturnCode(result, "cudaEventRecord");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Tell a stream to wait for a given event to occur on another stream before executing any other
+     * computation;
+     * 
+     * @param stream a CUDA stream to which the event is associated
+     * @param event a CUDA event that the stream should wait for
+     */
+    @TruffleBoundary
+    public void cudaStreamWaitEvent(CUDAStream stream, CUDAEvent event) {
+        if (!event.isAlive()) {
+            throw new RuntimeException("CUDA event=" + event + " has already been destroyed");
+        }
+        try {
+            final int FLAGS = 0x0; // Must be 0 according to CUDA documentation;
+            Object callable = CUDARuntimeFunction.CUDA_STREAMWAITEVENT.getSymbol(this);
+            Object result = INTEROP.execute(callable, stream.getRawPointer(), event.getRawPointer(), FLAGS);
+            checkCUDAReturnCode(result, "cudaStreamWaitEvent");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Set the cudaMemAdvise flag for a given array and a specified device,
+     * for example that a given array is exclusively read but not written by a given device;
+     *
+     * @param array: array for which we set the advice
+     * @param device: device for which we set the advice
+     * @param cudaMemoryAdvise: advice flag to be applied
+     */
+    public void cudaMemAdvise(AbstractArray array, Device device, MemAdviseFlagEnum cudaMemoryAdvise) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_MEM_ADVISE.getSymbol(this);
+            Object result = INTEROP.execute(callable, array.getPointer(), array.getSizeBytes(), cudaMemoryAdvise.id, device.getDeviceId());
+            checkCUDAReturnCode(result, "cudaMemAdvise");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    public enum MemAdviseFlagEnum {
+        CUDA_MEM_ADVISE_SET_READ_MOSTLY(1),
+        CUDA_MEM_ADVISE_UNSET_READ_MOSTLY(2),
+        CUDA_MEM_ADVISE_SET_PREFERRED_LOCATION(3),
+        CUDA_MEM_ADVISE_UNSET_PREFERRED_LOCATION(4),
+        CUDA_MEM_ADVISE_SET_ACCESSED_BY(5),
+        CUDA_MEM_ADVISE_UNSET_ACCESSED_BY(6);
+
+        private final int id;
+
+        MemAdviseFlagEnum(int id) {
+            this.id = id;
+        }
+
+        @Override
+        public String toString() {
+            return String.valueOf(id);
+        }
+    }
+
+    /**
+     * Queries if a device may directly access a peer device's memory.
+     * @param device Device from which allocations on peerDevice are to be directly accessed.
+     * @param peerDevice Device on which the allocations to be directly accessed by device reside.
+     * @return canAccessPeer a value of 1 if device device is capable of directly accessing memory from peerDevice and 0 otherwise.
+     * If direct access of peerDevice from device is possible, then access may be enabled by calling cudaDeviceEnablePeerAccess().
+     */
+    @TruffleBoundary
+    public int cudaDeviceCanAccessPeer(int device, int peerDevice) {
+
+        try(UnsafeHelper.Integer32Object canAccessPeer = UnsafeHelper.createInteger32Object()) {
+            Object callable = CUDARuntimeFunction.CUDA_DEVICE_CAN_ACCESS_PEER.getSymbol(this);
+            Object result = INTEROP.execute(callable, canAccessPeer.getAddress(), device, peerDevice);
+            checkCUDAReturnCode(result, "cudaDeviceCanAccessPeer");
+            return canAccessPeer.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Enable the current device to transfer memory from/to the specified peerDevice,
+     * using a fast communication channel (e.g. NVLink) if available.
+     * By default, p2p communication should be already enabled,
+     * there should be no need to call this function at the startup of GrCUDA.
+     *
+     * @param peerDevice Device for which we enable direct access from the current device
+     */
+    @TruffleBoundary
+    public void cudaDeviceEnablePeerAccess(Device peerDevice) {
+        // flag is reserved for future use and must be set to 0.
+        final int flag = 0;
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_DEVICE_ENABLE_PEER_ACCESS.getSymbol(this);
+            Object result = INTEROP.execute(callable, peerDevice.getDeviceId(), flag);
+            checkCUDAReturnCode(result, "cudaDeviceEnablePeerAccess");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Disable the current device from transferring memory from/to the specified peerDevice,
+     * and force utilization of PCIe.
+     *
+     * @param peerDevice Device for which we disable direct access from the current device
+     */
+    @TruffleBoundary
+    public void cudaDeviceDisablePeerAccess(Device peerDevice) {
+        try {
+            Object callable = CUDARuntimeFunction.CUDA_DEVICE_DISABLE_PEER_ACCESS.getSymbol(this);
+            Object result = INTEROP.execute(callable, peerDevice.getDeviceId());
+            checkCUDAReturnCode(result, "cudaDeviceDisablePeerAccess");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Get function as callable from native library.
+     *
+     * @param binding function binding
+     * @return a callable as a TruffleObject
+     */
+    @TruffleBoundary
+    public Object getSymbol(FunctionBinding binding) throws UnknownIdentifierException {
+        return getSymbol(binding.getLibraryFileName(), binding.getSymbolName(), binding.toNFISignature(), "");
+    }
+
+    /**
+     * Get function as callable from native library.
+     *
+     * @param libraryPath path to library (.so file)
+     * @param symbolName name of the function (symbol) too look up
+     * @param nfiSignature NFI signature of the function
+     * @return a callable as a TruffleObject
+     */
+    @TruffleBoundary
+    public Object getSymbol(String libraryPath, String symbolName, String nfiSignature) throws UnknownIdentifierException {
+        return getSymbol(libraryPath, symbolName, nfiSignature, "");
+    }
+
+    /**
+     * Get function as callable from native library.
+     *
+     * @param libraryPath path to library (.so file)
+     * @param symbolName name of the function (symbol) too look up
+     * @param nfiSignature NFI signature of the function
+     * @param hint additional string shown to user when symbol cannot be loaded
+     * @return a callable as a TruffleObject
+     */
+    @TruffleBoundary
+    public Object getSymbol(String libraryPath, String symbolName, String nfiSignature, String hint) throws UnknownIdentifierException {
+
+        Pair<String, String> functionKey = Pair.create(libraryPath, symbolName);
+        Object callable = boundFunctions.get(functionKey);
+        if (callable == null) {
+            // symbol does not exist or not yet bound
+            TruffleObject library = loadedLibraries.get(libraryPath);
+            if (library == null) {
+                try {
+                    // library does not exist or is not loaded yet
+                    library = (TruffleObject) context.getEnv().parseInternal(
+                                    Source.newBuilder("nfi", "load \"" + libraryPath + "\"", libraryPath).build()).call();
+                } catch (UnsatisfiedLinkError e) {
+                    throw new GrCUDAException("unable to load shared library '" + libraryPath + "': " + e.getMessage() + hint);
+                }
+
+                loadedLibraries.put(libraryPath, library);
+            }
+            try {
+                Object symbol = INTEROP.readMember(library, symbolName);
+                callable = INTEROP.invokeMember(symbol, "bind", nfiSignature);
+            } catch (UnsatisfiedLinkError | UnsupportedMessageException | ArityException | UnsupportedTypeException e) {
+                throw new GrCUDAException("unexpected behavior: " + e.getMessage());
+            }
+            boundFunctions.put(functionKey, callable);
+        }
+        return callable;
+    }
+
+    private void checkCUDAReturnCode(Object result, String... function) {
+        if (!(result instanceof Integer)) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException("expected return code as Integer object in " + GrCUDAException.format(function) + ", got " + result.getClass().getName());
+        }
+        Integer returnCode = (Integer) result;
+        if (returnCode != 0) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(returnCode, cudaGetErrorString(returnCode), function);
+        }
+    }
+
+    public void registerCUDAFunctions(Namespace rootNamespace) {
+        for (CUDARuntimeFunction function : CUDARuntimeFunction.values()) {
+            rootNamespace.addFunction(new CUDAFunction(function, this));
+        }
+    }
+
+    public StreamAttachArchitecturePolicy getArrayStreamArchitecturePolicy() {
+        return streamAttachArchitecturePolicy;
+    }
+
+    /*****************************************************************
+     *****************************************************************
+     * Implementation of CUDA runtime API exposed to host languages; *
+     *****************************************************************
+     *****************************************************************/
+
+    public enum CUDARuntimeFunction implements CUDAFunction.Spec, CallSupport {
+        CUDA_DEVICEGETATTRIBUTE("cudaDeviceGetAttribute", "(pointer, sint32, sint32): sint32") {
+            @Override
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 2);
+                int attributeCode = expectInt(args[0]);
+                int deviceId = expectInt(args[1]);
+                try (UnsafeHelper.Integer32Object value = UnsafeHelper.createInteger32Object()) {
+                    callSymbol(cudaRuntime, value.getAddress(), attributeCode, deviceId);
+                    return value.getValue();
+                }
+            }
+        },
+        CUDA_DEVICERESET("cudaDeviceReset", "(): sint32") {
+            @Override
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                callSymbol(cudaRuntime);
+                return NoneValue.get();
+            }
+        },
+        CUDA_DEVICESYNCHRONIZE("cudaDeviceSynchronize", "(): sint32") {
+            @Override
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                callSymbol(cudaRuntime);
+                return NoneValue.get();
+            }
+        },
+        CUDA_FREE("cudaFree", "(pointer): sint32") {
+            @Override
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                Object pointerObj = args[0];
+                long addr;
+                if (pointerObj instanceof GPUPointer) {
+                    addr = ((GPUPointer) pointerObj).getRawPointer();
+                } else if (pointerObj instanceof LittleEndianNativeArrayView) {
+                    addr = ((LittleEndianNativeArrayView) pointerObj).getStartAddress();
+                } else {
+                    throw new GrCUDAException("expected GPUPointer or LittleEndianNativeArrayView");
+                }
+                callSymbol(cudaRuntime, addr);
+                return NoneValue.get();
+            }
+        },
+        CUDA_GETDEVICE("cudaGetDevice", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                try (UnsafeHelper.Integer32Object deviceId = UnsafeHelper.createInteger32Object()) {
+                    callSymbol(cudaRuntime, deviceId.getAddress());
+                    return deviceId.getValue();
+                }
+            }
+        },
+        CUDA_GETDEVICECOUNT("cudaGetDeviceCount", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                try (UnsafeHelper.Integer32Object deviceCount = UnsafeHelper.createInteger32Object()) {
+                    callSymbol(cudaRuntime, deviceCount.getAddress());
+                    return deviceCount.getValue();
+                }
+            }
+        },
+        CUDA_GETERRORSTRING("cudaGetErrorString", "(sint32): string") {
+            @Override
+            @TruffleBoundary
+            public String call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                int errorCode = expectInt(args[0]);
+                Object result = INTEROP.execute(getSymbol(cudaRuntime), errorCode);
+                return INTEROP.asString(result);
+            }
+        },
+        CUDA_MALLOC("cudaMalloc", "(pointer, uint64): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                long numBytes = expectLong(args[0]);
+                try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
+                    callSymbol(cudaRuntime, outPointer.getAddress(), numBytes);
+                    long addressAllocatedMemory = outPointer.getValueOfPointer();
+                    return new GPUPointer(addressAllocatedMemory);
+                }
+            }
+        },
+        CUDA_MALLOCMANAGED("cudaMallocManaged", "(pointer, uint64, uint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                final int cudaMemAttachGlobal = 0x01;
+                long numBytes = expectLong(args[0]);
+                try (UnsafeHelper.PointerObject outPointer = UnsafeHelper.createPointerObject()) {
+                    callSymbol(cudaRuntime, outPointer.getAddress(), numBytes, cudaMemAttachGlobal);
+                    long addressAllocatedMemory = outPointer.getValueOfPointer();
+                    return new GPUPointer(addressAllocatedMemory);
+                }
+            }
+        },
+        CUDA_SETDEVICE("cudaSetDevice", "(sint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                int device = expectInt(args[0]);
+                if (cudaRuntime.isMultiGPUEnabled() && device >= 0 && device < cudaRuntime.numberOfGPUsToUse) {
+                    callSymbol(cudaRuntime, device);
+                    cudaRuntime.setCurrentGPU(device);
+                }
+                return NoneValue.get();
+            }
+        },
+        CUDA_MEMCPY("cudaMemcpy", "(pointer, pointer, uint64, sint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 3);
+                long destPointer = expectLong(args[0]);
+                long fromPointer = expectLong(args[1]);
+                long numBytesToCopy = expectPositiveLong(args[2]);
+                // cudaMemcpyKind from driver_types.h (default: direction of transfer is
+                // inferred from the pointer values, uses virtual addressing)
+                final long cudaMemcpyDefault = 4;
+                callSymbol(cudaRuntime, destPointer, fromPointer, numBytesToCopy, cudaMemcpyDefault);
+                return NoneValue.get();
+            }
+        },
+        CUDA_MEMCPYASYNC("cudaMemcpyAsync", "(pointer, pointer, uint64, sint32, pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 3);
+                long destPointer = expectLong(args[0]);
+                long fromPointer = expectLong(args[1]);
+                long numBytesToCopy = expectPositiveLong(args[2]);
+                long streamPointer = expectLong(args[3]);
+                // cudaMemcpyKind from driver_types.h (default: direction of transfer is
+                // inferred from the pointer values, uses virtual addressing)
+                final long cudaMemcpyDefault = 4;
+                callSymbol(cudaRuntime, destPointer, fromPointer, numBytesToCopy, cudaMemcpyDefault, streamPointer);
+                return NoneValue.get();
+            }
+        },
+        CUDA_STREAMCREATE("cudaStreamCreate", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                try (UnsafeHelper.PointerObject streamPointer = UnsafeHelper.createPointerObject()) {
+                    callSymbol(cudaRuntime, streamPointer.getAddress());
+                    CUDAStream stream = new CUDAStream(streamPointer.getValueOfPointer(), cudaRuntime.getNumStreams(), cudaRuntime.getCurrentGPU());
+                    cudaRuntime.incrementNumStreams();
+                    return stream;
+                }
+            }
+        },
+        CUDA_STREAMSYNCHRONIZE("cudaStreamSynchronize", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                Object pointerObj = args[0];
+                long addr;
+                if (pointerObj instanceof CUDAStream) {
+                    addr = ((CUDAStream) pointerObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAStream object");
+                }
+                callSymbol(cudaRuntime, addr);
+                return NoneValue.get();
+            }
+        },
+        CUDA_STREAMDESTROY("cudaStreamDestroy", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                Object pointerObj = args[0];
+                long addr;
+                if (pointerObj instanceof CUDAStream) {
+                    addr = ((CUDAStream) pointerObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAStream object");
+                }
+                callSymbol(cudaRuntime, addr);
+                return NoneValue.get();
+            }
+        },
+        CUDA_STREAMATTACHMEMASYNC("cudaStreamAttachMemAsync", "(pointer, pointer, uint64, uint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+
+                Object streamObj;
+                Object arrayObj;
+                final int MEM_ATTACH_SINGLE = 0x04;
+                final int MEM_ATTACH_GLOBAL = 0x01;
+                int flag = MEM_ATTACH_SINGLE;
+
+                if (args.length == 1) {
+                    arrayObj = args[0];
+                    streamObj = DefaultStream.get();
+                    flag = MEM_ATTACH_GLOBAL;
+                } else if (args.length == 2) {
+                    streamObj = args[0];
+                    arrayObj = args[1];
+                } else if (args.length == 3) {
+                    streamObj = args[0];
+                    arrayObj = args[1];
+                    if (args[2] instanceof Integer) {
+                        flag = ((Integer) args[2]);
+                    } else {
+                        throw new GrCUDAException("expected Integer object");
+                    }
+                } else {
+                    CompilerDirectives.transferToInterpreter();
+                    throw ArityException.create(1, 3, args.length);
+                }
+
+                // Extract pointers;
+                long streamAddr;
+                long arrayAddr;
+                streamAddr = extractStreamPointer(streamObj);
+                arrayAddr = extractArrayPointer(arrayObj);
+
+                // If using the default stream (0 address) use the "cudaMemAttachGlobal" flag;
+                if (streamAddr == 0) {
+                    flag = MEM_ATTACH_GLOBAL;
+                }
+
+                // Track the association between the stream and the array, if possible;
+                if (streamObj instanceof CUDAStream) {
+                    if (arrayObj instanceof AbstractArray) {
+                        ((AbstractArray) arrayObj).setStreamMapping((CUDAStream) streamObj);
+                    }
+                }
+
+                // Always set "size" to 0 to cover the entire array;
+                callSymbol(cudaRuntime, streamAddr, arrayAddr, 0, flag);
+                return NoneValue.get();
+            }
+        },
+        CUDA_MEMPREFETCHASYNC("cudaMemPrefetchAsync", "(pointer, uint64, sint32, pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+
+                Object streamObj;
+                Object arrayObj;
+                long size;
+                int destinationDevice;
+
+                if (args.length == 3) {
+                    arrayObj = args[0];
+                    streamObj = DefaultStream.get();
+                } else if (args.length == 4) {
+                    arrayObj = args[0];
+                    streamObj = args[3];
+                } else {
+                    CompilerDirectives.transferToInterpreter();
+                    throw ArityException.create(3, 4, args.length);
+                }
+
+                if (args[1] instanceof Long) {
+                    size = ((Long) args[1]);
+                } else {
+                    throw new GrCUDAException("expected Long object for array size");
+                }
+
+                if (args[2] instanceof Integer) {
+                    destinationDevice = ((Integer) args[2]);
+                } else {
+                    throw new GrCUDAException("expected Integer object for destination device");
+                }
+
+                // Extract pointers;
+                long streamAddr;
+                long arrayAddr;
+                streamAddr = extractStreamPointer(streamObj);
+                arrayAddr = extractArrayPointer(arrayObj);
+
+                // Always set "size" to 0 to cover the entire array;
+                callSymbol(cudaRuntime, arrayAddr, size, destinationDevice, streamAddr);
+                return NoneValue.get();
+            }
+        },
+        CUDA_EVENTCREATE("cudaEventCreate", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                try (UnsafeHelper.PointerObject eventPointer = UnsafeHelper.createPointerObject()) {
+                    callSymbol(cudaRuntime, eventPointer.getAddress());
+                    CUDAEvent event = new CUDAEvent(eventPointer.getValueOfPointer(), cudaRuntime.getNumEvents());
+                    cudaRuntime.incrementNumEvents();
+                    return event;
+                }
+            }
+        },
+        CUDA_EVENTDESTROY("cudaEventDestroy", "(pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                Object pointerObj = args[0];
+                long addr;
+                if (pointerObj instanceof CUDAEvent) {
+                    addr = ((CUDAEvent) pointerObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAEvent object");
+                }
+                callSymbol(cudaRuntime, addr);
+                return NoneValue.get();
+            }
+        },
+        CUDA_EVENTELAPSEDTIME("cudaEventElapsedTime", "(pointer, pointer, pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedTypeException, UnsupportedMessageException, UnknownIdentifierException {
+                checkArgumentLength(args, 2);
+
+                Object pointerStartEvent = args[1];
+                Object pointerEndEvent = args[2];
+                long addrStart;
+                long addrEnd;
+
+                if (pointerStartEvent instanceof CUDAEvent) {
+                    addrStart = ((CUDAEvent) pointerStartEvent).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAEvent object");
+                }
+                if (pointerEndEvent instanceof CUDAEvent) {
+                    addrEnd = ((CUDAEvent) pointerEndEvent).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAEvent object");
+                }
+                try (UnsafeHelper.Float32Object elapsedTimePointer = UnsafeHelper.createFloat32Object()) {
+                    callSymbol(cudaRuntime, elapsedTimePointer.getAddress(), addrStart, addrEnd );
+                    return elapsedTimePointer.getValue();
+                }
+            }
+        },
+        CUDA_EVENTRECORD("cudaEventRecord", "(pointer, pointer): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 2);
+                Object eventObj = args[0];
+                Object streamObj = args[1];
+                long eventAddr, streamAddr;
+                if (eventObj instanceof CUDAEvent) {
+                    eventAddr = ((CUDAEvent) eventObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAEvent object");
+                }
+                if (streamObj instanceof CUDAStream) {
+                    streamAddr = ((CUDAStream) streamObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAStream object");
+                }
+                callSymbol(cudaRuntime, eventAddr, streamAddr);
+                return NoneValue.get();
+            }
+        },
+        CUDA_STREAMWAITEVENT("cudaStreamWaitEvent", "(pointer, pointer, uint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 2);
+                Object streamObj = args[0];
+                Object eventObj = args[1];
+                long streamAddr, eventAddr;
+                final int FLAGS = 0x0; // Flags must be zero according to CUDA documentation;
+
+                if (streamObj instanceof CUDAStream) {
+                    streamAddr = ((CUDAStream) streamObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAStream object");
+                }
+                if (eventObj instanceof CUDAEvent) {
+                    eventAddr = ((CUDAEvent) eventObj).getRawPointer();
+                } else {
+                    throw new GrCUDAException("expected CUDAEvent object");
+                }
+                callSymbol(cudaRuntime, streamAddr, eventAddr, FLAGS);
+                return NoneValue.get();
+            }
+        },
+        CUDA_PROFILERSTART("cudaProfilerStart", "(): sint32") {
+            @Override
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 0);
+                callSymbol(cudaRuntime);
+                return NoneValue.get();
+            }
+        },
+        CUDA_PROFILERSTOP("cudaProfilerStop", "(): sint32") {
+            @Override
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws ArityException, UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException {
+                checkArgumentLength(args, 0);
+                callSymbol(cudaRuntime);
+                return NoneValue.get();
+            }
+        },
+        CUDA_MEM_ADVISE("cudaMemAdvise", "(pointer, uint64, uint64, uint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 4);
+                Object arrayObj = args[0];
+                long arrayAddr = extractArrayPointer(arrayObj);
+                long numBytes = expectPositiveLong(args[1]);
+                long advise = expectLong(args[2]);
+                int deviceId = expectInt(args[3]);
+                callSymbol(cudaRuntime, arrayAddr, numBytes, advise, deviceId);
+                return NoneValue.get();
+            }
+        },
+        CUDA_DEVICE_CAN_ACCESS_PEER("cudaDeviceCanAccessPeer","(pointer, sint32, sint32): sint32") {
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 2);
+                int device = expectInt(args[0]);
+                int peerDevice = expectInt(args[1]);
+                try (UnsafeHelper.Integer32Object value = UnsafeHelper.createInteger32Object()) {
+                    callSymbol(cudaRuntime, value.getAddress(), device, peerDevice);
+                    return value.getValue();
+                }
+
+            }
+        },
+        CUDA_DEVICE_ENABLE_PEER_ACCESS("cudaDeviceEnablePeerAccess","(sint32, uint32): sint32"){
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                int peerDevice = expectInt(args[0]);
+                final int flag = 0;  // Must be 0 according to CUDA documentation;
+                callSymbol(cudaRuntime, peerDevice, flag);
+                return NoneValue.get();
+            }
+        },
+        CUDA_DEVICE_DISABLE_PEER_ACCESS("cudaDeviceDisablePeerAccess","(sint32): sint32"){
+            @Override
+            @TruffleBoundary
+            public Object call(CUDARuntime cudaRuntime, Object[] args) throws UnsupportedMessageException, UnknownIdentifierException, UnsupportedTypeException, ArityException {
+                checkArgumentLength(args, 1);
+                int device = expectInt(args[0]);
+                callSymbol(cudaRuntime, device);
+                return NoneValue.get();
+            }
+        };
+
+        private final String name;
+        private final String nfiSignature;
+
+        CUDARuntimeFunction(String name, String nfiSignature) {
+            this.name = name;
+            this.nfiSignature = nfiSignature;
+        }
+
+        public String getName() {
+            return name;
+        }
+
+        public Object getSymbol(CUDARuntime runtime) throws UnknownIdentifierException {
+            return runtime.getSymbol(CUDA_RUNTIME_LIBRARY_NAME, name, nfiSignature);
+        }
+
+        long extractArrayPointer(Object array) {
+            if (array instanceof GPUPointer) {
+                return ((GPUPointer) array).getRawPointer();
+            } else if (array instanceof LittleEndianNativeArrayView) {
+                return ((LittleEndianNativeArrayView) array).getStartAddress();
+            } else if (array instanceof AbstractArray) {
+                return ((AbstractArray) array).getFullArrayPointer();
+            } else {
+                throw new GrCUDAException("expected GPUPointer or LittleEndianNativeArrayView or DeviceArray");
+            }
+        }
+
+        long extractStreamPointer(Object stream) {
+            if (stream instanceof CUDAStream) {
+                return ((CUDAStream) stream).getRawPointer();
+            } else {
+                throw new GrCUDAException("expected CUDAStream object");
+            }
+        }
+    }
+
+    /************************************************************
+     *************************************************************
+     * Implementation of CUDA driver API available within GrCUDA *
+     * (not exposed to the host language);                       *
+     *************************************************************
+     *************************************************************/
+
+    /**
+     * Provide optimized interfaces to load, build and launch GPU kernels on single and multi-GPU systems;
+     */
+    interface KernelManagementInterface {
+        Kernel loadKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String cubinFile, String kernelName, String symbolName, String signature);
+
+        Kernel buildKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String kernelName, String signature, String moduleName, PTXKernel ptx);
+
+        void launchKernel(CUDARuntime runtime, Kernel kernel, KernelConfig config, KernelArguments args, CUDAStream stream);
+
+        @TruffleBoundary
+        default void launchKernelInternal(CUDARuntime runtime, KernelConfig config, KernelArguments args, CUDAStream stream, long kernelFunctionHandle) {
+            try {
+                Object callable = CUDADriverFunction.CU_LAUNCHKERNEL.getSymbol(runtime);
+                Dim3 gridSize = config.getGridSize();
+                Dim3 blockSize = config.getBlockSize();
+                Object result = INTEROP.execute(
+                    callable,
+                    kernelFunctionHandle,
+                    gridSize.getX(),
+                    gridSize.getY(),
+                    gridSize.getZ(),
+                    blockSize.getX(),
+                    blockSize.getY(),
+                    blockSize.getZ(),
+                    config.getDynamicSharedMemoryBytes(),
+                    stream.getRawPointer(),
+                    args.getPointer(),              // pointer to kernel arguments array
+                    0                               // extra args
+                );
+                checkCUReturnCode(result, "cuLaunchKernel");
+            } catch (InteropException e) {
+                throw new GrCUDAException(e);
+            }
+        }
+    }
+
+    class KernelManagementSingleGPU implements KernelManagementInterface {
+
+        // TODO: we might want to support single-GPU system on systems with multiple GPUs,
+        //  without having to enable all GPUs. In this case, specify a custom deviceId instead of "0";
+
+        @TruffleBoundary
+        @Override
+        public Kernel loadKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String cubinFile, String kernelName, String symbolName, String signature) {
+            // Load module from GPU 0;
+            CUModule module = loadedModules.get(DEFAULT_DEVICE).get(cubinFile);
+            if (module == null) {
+                // Load module as it is not yet loaded;
+                module = cuModuleLoad(cubinFile);
+                loadedModules.get(DEFAULT_DEVICE).put(cubinFile, module);
+            }
+            long kernelFunctionHandle = cuModuleGetFunction(module, symbolName);
+            return new Kernel(grCUDAExecutionContext, kernelName, symbolName, List.of(kernelFunctionHandle), signature, List.of(module));
+        }
+
+        @TruffleBoundary
+        @Override
+        public Kernel buildKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String kernelName, String signature, String moduleName, PTXKernel ptx) {
+            CUModule module = cuModuleLoadData(ptx.getPtxSource(), moduleName);
+            loadedModules.get(DEFAULT_DEVICE).put(moduleName, module);
+            long kernelFunctionHandle = cuModuleGetFunction(module, ptx.getLoweredKernelName());
+            return new Kernel(grCUDAExecutionContext, kernelName, ptx.getLoweredKernelName(), List.of(kernelFunctionHandle),
+                    signature, List.of(module), ptx.getPtxSource());
+        }
+
+        @Override
+        public void launchKernel(CUDARuntime runtime, Kernel kernel, KernelConfig config, KernelArguments args, CUDAStream stream) {
+            long kernelFunctionHandle = kernel.getKernelFunctionHandle(DEFAULT_DEVICE);
+            launchKernelInternal(runtime, config, args, stream, kernelFunctionHandle);
+        }
+    }
+
+    class KernelManagementMultiGPU implements KernelManagementInterface {
+        @TruffleBoundary
+        @Override
+        public Kernel loadKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String cubinFile, String kernelName, String symbolName, String signature) {
+            ArrayList<Long> kernelFunctionHandles = new ArrayList<>();
+            ArrayList<CUModule> modules = new ArrayList<>();
+            int currentDevice = getCurrentGPU();
+
+            for (int i = 0; i < numberOfGPUsToUse; i++) {
+                // Load the kernel on each GPU;
+                CUModule module = loadedModules.get(i).get(cubinFile);
+                cudaSetDevice(i);
+                if (module == null) {
+                    module = cuModuleLoad(cubinFile);
+                    loadedModules.get(i).put(cubinFile, module);
+                }
+                modules.add(module);
+
+                long kernelFunctionHandle = cuModuleGetFunction(module, symbolName);
+                kernelFunctionHandles.add(kernelFunctionHandle);
+            }
+            // Restore the device previously active;
+            cudaSetDevice(currentDevice);
+
+            return new Kernel(grCUDAExecutionContext, kernelName, symbolName, kernelFunctionHandles, signature, modules);
+        }
+
+        @TruffleBoundary
+        @Override
+        public Kernel buildKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String kernelName, String signature, String moduleName, PTXKernel ptx) {
+            ArrayList<Long> kernelFunctionHandles = new ArrayList<>();
+            ArrayList<CUModule> modules = new ArrayList<>();
+            int currentDevice = getCurrentGPU();
+
+            for (int i = 0; i < numberOfGPUsToUse; i++) {
+                // Load the kernel on each GPU;
+                cudaSetDevice(i);
+                CUModule module = cuModuleLoadData(ptx.getPtxSource(), moduleName);
+                long kernelFunctionHandle = cuModuleGetFunction(module, ptx.getLoweredKernelName());
+                kernelFunctionHandles.add(kernelFunctionHandle);
+                modules.add(module);
+
+                loadedModules.get(i).put(moduleName, module);
+            }
+            // Restore the device previously active;
+            cudaSetDevice(currentDevice);
+            return new Kernel(grCUDAExecutionContext, kernelName, ptx.getLoweredKernelName(), kernelFunctionHandles,
+                              signature, modules, ptx.getPtxSource());
+        }
+
+        @Override
+        public void launchKernel(CUDARuntime runtime, Kernel kernel, KernelConfig config, KernelArguments args, CUDAStream stream) {
+            // Set the device where the kernel is executed;
+            cudaSetDevice(stream.getStreamDeviceId());
+            long kernelFunctionHandle = kernel.getKernelFunctionHandle(stream.getStreamDeviceId());
+            launchKernelInternal(runtime, config, args, stream, kernelFunctionHandle);
+        }
+    }
+
+    public Kernel loadKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, Binding binding) {
+        return kernelManagement.loadKernel(grCUDAExecutionContext, binding.getLibraryFileName(), binding.getName(), binding.getSymbolName(), binding.getNIDLParameterSignature());
+    }
+
+    public Kernel buildKernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String code, String kernelName, String signature) {
+        RUNTIME_LOGGER.finest(() -> "buildKernel device:" + getCurrentGPU());
+        String moduleName = "truffle" + context.getNextModuleId();
+        PTXKernel ptx = nvrtc.compileKernel(code, kernelName, moduleName, "--std=c++14");
+        return kernelManagement.buildKernel(grCUDAExecutionContext, kernelName, signature, moduleName, ptx);
+    }
+
+    public void cuLaunchKernel(Kernel kernel, KernelConfig config, KernelArguments args, CUDAStream stream) {
+        this.kernelManagement.launchKernel(this, kernel, config, args, stream);
+    }
+
+    @TruffleBoundary
+    public CUModule cuModuleLoad(String cubinName) {
+        assertCUDAInitialized();
+        int currDevice = getCurrentGPU();
+        if (this.loadedModules.get(currDevice).containsKey(cubinName)) {
+            throw new GrCUDAException("A module for " + cubinName + " was already loaded.");
+        }
+        try (UnsafeHelper.Integer64Object modulePtr = UnsafeHelper.createInteger64Object()) {
+            Object callable = CUDADriverFunction.CU_MODULELOAD.getSymbol(this);
+            Object result = INTEROP.execute(callable, modulePtr.getAddress(), cubinName);
+            checkCUReturnCode(result, "cuModuleLoad");
+            return new CUModule(cubinName, modulePtr.getValue());
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public CUModule cuModuleLoadData(String ptx, String moduleName) {
+        assertCUDAInitialized();
+        int currDevice = getCurrentGPU();
+        if (this.loadedModules.get(currDevice).containsKey(moduleName)) {
+            throw new GrCUDAException("A module for " + moduleName + " was already loaded.");
+        }
+        try (UnsafeHelper.Integer64Object modulePtr = UnsafeHelper.createInteger64Object()) {
+            Object callable = CUDADriverFunction.CU_MODULELOADDATA.getSymbol(this);
+            Object result = INTEROP.execute(callable,
+                            modulePtr.getAddress(), ptx);
+            checkCUReturnCode(result, "cuModuleLoadData");
+            return new CUModule(moduleName, modulePtr.getValue());
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cuModuleUnload(CUModule module) {
+        try {
+            Object callable = CUDADriverFunction.CU_MODULEUNLOAD.getSymbol(this);
+            Object result = INTEROP.execute(callable, module.modulePointer);
+            checkCUReturnCode(result, "cuModuleUnload");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    /**
+     * Get function handle to kernel in module.
+     *
+     * @param kernelModule CUmodule containing the kernel function
+     * @param kernelName name of the kernel to load from the module
+     * @return native CUfunction function handle
+     */
+    @TruffleBoundary
+    public long cuModuleGetFunction(CUModule kernelModule, String kernelName) {
+        try (UnsafeHelper.Integer64Object functionPtr = UnsafeHelper.createInteger64Object()) {
+            Object callable = CUDADriverFunction.CU_MODULEGETFUNCTION.getSymbol(this);
+            Object result = INTEROP.execute(callable,
+                            functionPtr.getAddress(), kernelModule.getModulePointer(), kernelName);
+            checkCUReturnCode(result, "cuModuleGetFunction");
+            return functionPtr.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cuCtxSynchronize() {
+        assertCUDAInitialized();
+        try {
+            Object callable = CUDADriverFunction.CU_CTXSYNCHRONIZE.getSymbol(this);
+            Object result = INTEROP.execute(callable);
+            checkCUReturnCode(result, "cuCtxSynchronize");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private void cuInit() {
+        try {
+            Object callable = CUDADriverFunction.CU_INIT.getSymbol(this);
+            int flags = 0; // must be zero as per CUDA Driver API documentation
+            Object result = INTEROP.execute(callable, flags);
+            checkCUReturnCode(result, "cuInit");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private int cuDeviceGetCount() {
+        try (UnsafeHelper.Integer32Object devCount = UnsafeHelper.createInteger32Object()) {
+            Object callable = CUDADriverFunction.CU_DEVICEGETCOUNT.getSymbol(this);
+            Object result = INTEROP.execute(callable, devCount.getAddress());
+            checkCUReturnCode(result, "cuDeviceGetCount");
+            return devCount.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private int cuDeviceGet(int deviceOrdinal) {
+        assertCUDAInitialized();
+        try (UnsafeHelper.Integer32Object deviceObj = UnsafeHelper.createInteger32Object()) {
+            Object callable = CUDADriverFunction.CU_DEVICEGET.getSymbol(this);
+            Object result = INTEROP.execute(callable, deviceObj.getAddress(), deviceOrdinal);
+            checkCUReturnCode(result, "cuDeviceGet");
+            return deviceObj.getValue();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private String cuDeviceGetName(int cuDeviceId) {
+        final int maxLength = 256;
+        try (UnsafeHelper.StringObject nameString = new UnsafeHelper.StringObject(maxLength)) {
+            Object callable = CUDADriverFunction.CU_DEVICEGETNAME.getSymbol(this);
+            Object result = INTEROP.execute(callable, nameString.getAddress(), maxLength, cuDeviceId);
+            checkCUReturnCode(result, "cuDeviceGetName");
+            return nameString.getZeroTerminatedString();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private long cuCtxCreate(int flags, int cudevice) {
+        try (UnsafeHelper.PointerObject pctx = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDADriverFunction.CU_CTXCREATE.getSymbol(this);
+            Object result = INTEROP.execute(callable, pctx.getAddress(), flags, cudevice);
+            checkCUReturnCode(result, "cuCtxCreate");
+            return pctx.getValueOfPointer();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private long cuDevicePrimaryCtxRetain(int cudevice) {
+        try (UnsafeHelper.PointerObject pctx = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDADriverFunction.CU_DEVICEPRIMARYCTXRETAIN.getSymbol(this);
+            Object result = INTEROP.execute(callable, pctx.getAddress(), cudevice);
+            checkCUReturnCode(result, "cuDevicePrimaryCtxRetain");
+            return pctx.getValueOfPointer();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private void cuCtxDestroy(long ctx) {
+        try {
+            Object callable = CUDADriverFunction.CU_CTXCREATE.getSymbol(this);
+            Object result = INTEROP.execute(callable, ctx);
+            checkCUReturnCode(result, "cuCtxDestroy");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private long cuCtxGetCurrent() {
+        try (UnsafeHelper.PointerObject ctxPointer = UnsafeHelper.createPointerObject()) {
+            Object callable = CUDADriverFunction.CU_CTXGETCURRENT.getSymbol(this);
+            Object result = INTEROP.execute(callable, ctxPointer.getAddress());
+            checkCUDAReturnCode(result, "cuCtxGetCurrent");
+            return ctxPointer.getValueOfPointer();
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    public void cuCtxSetCurrent(GPUPointer ctxPointer) {
+        try {
+            Object callable = CUDADriverFunction.CU_CTXSETCURRENT.getSymbol(this);
+            Object result = INTEROP.execute(callable, ctxPointer.getRawPointer());
+            checkCUDAReturnCode(result, "cuCtxSetCurrent");
+        } catch (InteropException e) {
+            throw new GrCUDAException(e);
+        }
+    }
+
+    @TruffleBoundary
+    private void assertCUDAInitialized() {
+        if (!context.isCUDAInitialized()) {
+            int currentDevice = getCurrentGPU();
+            for (int i = 0; i < numberOfGPUsToUse; i++) {
+                cudaSetDevice(i);
+                cuInit();
+                // A simple way to create the device context in the driver is to call any CUDA
+                // API function;
+                cudaDeviceSynchronize();
+                this.innerCudaContexts.add(initializeInnerCudaContext(i));
+            }
+            cudaSetDevice(currentDevice);
+            context.setCUDAInitialized();
+        }
+    }
+    
+    @SuppressWarnings("static-method")
+    private static void checkCUReturnCode(Object result, String... function) {
+        int returnCode;
+        try {
+            returnCode = INTEROP.asInt(result);
+        } catch (UnsupportedMessageException e) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(
+                            "expected return code as Integer object in " + Arrays.toString(function) + ", got " +
+                                            result.getClass().getName());
+        }
+        if (returnCode != 0) {
+            RUNTIME_LOGGER.severe(() -> "ERROR CODE=" + returnCode);
+            throw new GrCUDAException(returnCode, DriverAPIErrorMessages.getString(returnCode), function);
+
+        }
+    }
+
+    private void shutdown() {
+        // unload all modules
+        for (int i = 0; i < numberOfGPUsToUse; i++) {
+            for (CUModule module : loadedModules.get(i).values()) {
+                try {
+                    module.close();
+                } catch (Exception e) {
+                    /* ignore exception */
+                }
+            }
+            loadedModules.get(i).clear();
+        }
+    }
+
+    /****************************************************************
+     ****************************************************************
+     * Implementation of CUDA driver API exposed to host languages; *
+     ****************************************************************
+     ****************************************************************/
+
+    public enum CUDADriverFunction {
+        CU_CTXCREATE("cuCtxCreate", "(pointer, uint32, sint32) :sint32"),
+        CU_CTXDESTROY("cuCtxDestroy", "(pointer): sint32"),
+        CU_CTXGETCURRENT("cuCtxGetCurrent", "(pointer) :sint32"),
+        CU_CTXSETCURRENT("cuCtxSetCurrent", "(pointer) :sint32"),
+        CU_CTXSYNCHRONIZE("cuCtxSynchronize", "(): sint32"),
+        CU_DEVICEGETCOUNT("cuDeviceGetCount", "(pointer): sint32"),
+        CU_DEVICEGET("cuDeviceGet", "(pointer, sint32): sint32"),
+        CU_DEVICEGETNAME("cuDeviceGetName", "(pointer, sint32, sint32): sint32"),
+        CU_DEVICEPRIMARYCTXRETAIN("cuDevicePrimaryCtxRetain", "(pointer, sint32): sint32"),
+        CU_INIT("cuInit", "(uint32): sint32"),
+        CU_LAUNCHKERNEL("cuLaunchKernel", "(uint64, uint32, uint32, uint32, uint32, uint32, uint32, uint32, uint64, pointer, pointer): sint32"),
+        CU_MODULELOAD("cuModuleLoad", "(pointer, string): sint32"),
+        CU_MODULELOADDATA("cuModuleLoadData", "(pointer, string): sint32"),
+        CU_MODULEUNLOAD("cuModuleUnload", "(uint64): sint32"),
+        CU_MODULEGETFUNCTION("cuModuleGetFunction", "(pointer, uint64, string): sint32");
+
+        private final String symbolName;
+        private final String signature;
+
+        CUDADriverFunction(String symbolName, String nfiSignature) {
+            this.symbolName = symbolName;
+            this.signature = nfiSignature;
+        }
+
+        public Object getSymbol(CUDARuntime runtime) throws UnknownIdentifierException {
+            return runtime.getSymbol(CUDA_LIBRARY_NAME, symbolName, signature);
+        }
+    }
+
+    /** CUDA device attributes from driver_types.h CUDA header. */
+    public enum CUDADeviceAttribute {
+        MAX_THREADS_PER_BLOCK("maxThreadsPerBlock", 1),
+        MAX_BLOCK_DIMX("maxBlockDimX", 2),
+        MAX_BLOCK_DIMY("maxBlockDimY", 3),
+        MAX_BLOCK_DIMZ("maxBlockDimZ", 4),
+        MAX_GRID_DIMX("maxGridDimX", 5),
+        MAX_GRID_DIMY("maxGridDimY", 6),
+        MAX_GRID_DIMZ("maxGridDimZ", 7),
+        MAX_SHARED_MEMORY_PER_BLOCK("maxSharedMemoryPerBlock", 8),
+        TOTAL_CONSTANT_MEMORY("totalConstantMemory", 9),
+        WARPSIZE("warpSize", 10),
+        MAX_PITCH("maxPitch", 11),
+        MAX_REGISTERS_PER_BLOCK("maxRegistersPerBlock", 12),
+        CLOCK_RATE("clockRate", 13),
+        TEXTURE_ALIGNMENT("textureAlignment", 14),
+        GPU_OVERLAP("gpuOverlap", 15),
+        MULTI_PROCESSOR_COUNT("multiProcessorCount", 16),
+        KERNEL_EXEC_TIMEOUT("kernelExecTimeout", 17),
+        INTEGRATED("integrated", 18),
+        CAN_MAP_HOST_MEMORY("canMapHostMemory", 19),
+        COMPUTE_MODE("computeMode", 20),
+        MAX_TEXTURE1D_WIDTH("maxTexture1DWidth", 21),
+        MAX_TEXTURE2D_WIDTH("maxTexture2DWidth", 22),
+        MAX_TEXTURE2D_HEIGHT("maxTexture2DHeight", 23),
+        MAX_TEXTURE3D_WIDTH("maxTexture3DWidth", 24),
+        MAX_TEXTURE3D_HEIGHT("maxTexture3DHeight", 25),
+        MAX_TEXTURE3D_DEPTH("maxTexture3DDepth", 26),
+        MAX_TEXTURE2D_LAYERED_WIDTH("maxTexture2DLayeredWidth", 27),
+        MAX_TEXTURE2D_LAYERED_HEIGHT("maxTexture2DLayeredHeight", 28),
+        MAX_TEXTURE2D_LAYERED_LAYERS("maxTexture2DLayeredLayers", 29),
+        SURFACE_ALIGNMENT("surfaceAlignment", 30),
+        CONCURRENT_KERNELS("concurrentKernels", 31),
+        ECC_ENABLED("eccEnabled", 32),
+        PCI_BUS_ID("pciBusId", 33),
+        PCI_DEVICE_ID("pciDeviceId", 34),
+        TCC_DRIVER("tccDriver", 35),
+        MEMORY_CLOCK_RATE("memoryClockRate", 36),
+        GLOBAL_MEMORY_BUS_WIDTH("globalMemoryBusWidth", 37),
+        L2_CACHE_SIZE("l2CacheSize", 38),
+        MAX_THREADS_PER_MULTIPROCESSOR("maxThreadsPerMultiProcessor", 39),
+        ASYNC_ENGINE_COUNT("asyncEngineCount", 40),
+        UNIFIED_ADDRESSING("unifiedAddressing", 41),
+        MAX_TEXTURE1D_LAYERED_WIDTH("maxTexture1DLayeredWidth", 42),
+        MAX_TEXTURE1D_LAYERED_LAYERS("maxTexture1DLayeredLayers", 43),
+        MAX_TEXTURE2D_GATHER_WIDTH("maxTexture2DGatherWidth", 45),
+        MAX_TEXTURE2D_GATHER_HEIGHT("maxTexture2DGatherHeight", 46),
+        MAX_TEXTURE3D_WIDTH_ALT("maxTexture3DWidthAlt", 47),
+        MAX_TEXTURE3D_HEIGHT_ALT("maxTexture3DHeightAlt", 48),
+        MAX_TEXTURE3D_DEPTH_ALT("maxTexture3DDepthAlt", 49),
+        PCI_DOMAIN_ID("pciDomainId", 50),
+        TEXTURE_PITCH_ALIGNMENT("texturePitchAlignment", 51),
+        MAX_TEXTURE_CUBEMAP_WIDTH("maxTextureCubemapWidth", 52),
+        MAX_TEXTURE_CUBEMAP_LAYERED_WIDTH("maxTextureCubemapLayeredWidth", 53),
+        MAX_TEXTURE_CUBEMAP_LAYERED_LAYERS("maxTextureCubemapLayeredLayers", 54),
+        MAX_SURFACE1D_WIDTH("maxSurface1DWidth", 55),
+        MAX_SURFACE2D_WIDTH("maxSurface2DWidth", 56),
+        MAX_SURFACE2D_HEIGHT("maxSurface2DHeight", 57),
+        MAX_SURFACE3D_WIDTH("maxSurface3DWidth", 58),
+        MAX_SURFACE3D_HEIGHT("maxSurface3DHeight", 59),
+        MAX_SURFACE3D_DEPTH("maxSurface3DDepth", 60),
+        MAX_SURFACE1D_LAYERED_WIDTH("maxSurface1DLayeredWidth", 61),
+        MAX_SURFACE1D_LAYERED_LAYERS("maxSurface1DLayeredLayers", 62),
+        MAX_SURFACE2D_LAYERED_WIDTH("maxSurface2DLayeredWidth", 63),
+        MAX_SURFACE2D_LAYERED_HEIGHT("maxSurface2DLayeredHeight", 64),
+        MAX_SURFACE2D_LAYERED_LAYERS("maxSurface2DLayeredLayers", 65),
+        MAX_SURFACE_CUBEMAP_WIDTH("maxSurfaceCubemapWidth", 66),
+        MAX_SURFACE_CUBEMAP_LAYERED_WIDTH("maxSurfaceCubemapLayeredWidth", 67),
+        MAX_SURFACE_CUBEMAP_LAYERED_LAYERS("maxSurfaceCubemapLayeredLayers", 68),
+        MAX_TEXTURE1D_LINEAR_WIDTH("maxTexture1DLinearWidth", 69),
+        MAX_TEXTURE2D_LINEAR_WIDTH("maxTexture2DLinearWidth", 70),
+        MAX_TEXTURE2D_LINEAR_HEIGHT("maxTexture2DLinearHeight", 71),
+        MAX_TEXTURE2D_LINEAR_PITCH("maxTexture2DLinearPitch", 72),
+        MAX_TEXTURE2D_MIPMAPPED_WIDTH("maxTexture2DMipmappedWidth", 73),
+        MAX_TEXTURE2D_MIPMAPPED_HEIGHT("maxTexture2DMipmappedHeight", 74),
+        COMPUTE_CAPABILITY_MAJOR("computeCapabilityMajor", 75),
+        COMPUTE_CAPABILITY_MINOR("computeCapabilityMinor", 76),
+        MAX_TEXTURE1D_MIPMAPPED_WIDTH("maxTexture1DMipmappedWidth", 77),
+        STREAM_PRIORITIES_SUPPORTED("streamPrioritiesSupported", 78),
+        GLOBAL_L1_CACHE_SUPPORTED("globalL1CacheSupported", 79),
+        LOCAL_L1_CACHE_SUPPORTED("localL1CacheSupported", 80),
+        MAX_SHARED_MEMORY_PER_MULTIPROCESSOR("maxSharedMemoryPerMultiprocessor", 81),
+        MAX_REGISTERS_PER_MULTIPROCESSOR("maxRegistersPerMultiprocessor", 82),
+        MANAGED_MEMORY("managedMemory", 83),
+        IS_MULTI_GPU_BOARD("isMultiGpuBoard", 84),
+        MULTI_GPU_BOARD_GROUP_ID("multiGpuBoardGroupID", 85),
+        HOST_NATIVE_ATOMIC_SUPPORTED("hostNativeAtomicSupported", 86),
+        SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO("singleToDoublePrecisionPerfRatio", 87),
+        PAGEABLE_MEMORY_ACCESS("pageableMemoryAccess", 88),
+        CONCURRENT_MANAGED_ACCESS("concurrentManagedAccess", 89),
+        COMPUTE_PREEMPTION_SUPPORTED("computePreemptionSupported", 90),
+        CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM("canUseHostPointerForRegisteredMem", 91),
+        COOPERATIVE_LAUNCH("cooperativeLaunch", 95),
+        COOPERATIVE_MULTI_DEVICE_LAUNCH("cooperativeMultiDeviceLaunch", 96),
+        MAX_SHARED_MEMORY_PER_BLOCK_OPTIN("maxSharedMemoryPerBlockOptin", 97),
+        CAN_FLUSH_REMOTE_WRITES("canFlushRemoteWrites", 98),
+        HOST_REGISTER_SUPPORTED("hostRegisterSupported", 99),
+        PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES("pageableMemoryAccessUsesHostPageTables", 100),
+        DIRECT_MANAGED_MEM_ACCESS_FROM_HOST("directManagedMemAccessFromHost", 101);
+
+        final String attributeName;
+        final int attributeCode;
+
+        String getAttributeName() {
+            return attributeName;
+        }
+
+        int getAttributeCode() {
+            return attributeCode;
+        }
+
+        CUDADeviceAttribute(String name, int code) {
+            this.attributeName = name;
+            this.attributeCode = code;
+        }
+    }
+
+    final class CUModule implements AutoCloseable {
+        final String cubinFile;
+        /** Pointer to the native CUmodule object. */
+        final long modulePointer;
+        boolean closed = false;
+
+        CUModule(String cubinFile, long modulePointer) {
+            this.cubinFile = cubinFile;
+            this.modulePointer = modulePointer;
+            this.closed = false;
+        }
+
+        public long getModulePointer() {
+            if (closed) {
+                CompilerDirectives.transferToInterpreter();
+                throw new GrCUDAException(String.format("cannot get module pointer, module (%016x) already closed", modulePointer));
+            }
+            return modulePointer;
+        }
+
+        public boolean isClosed() {
+            return closed;
+        }
+
+        @Override
+        public boolean equals(Object other) {
+            if (other instanceof CUModule) {
+                CUModule otherModule = (CUModule) other;
+                return otherModule.cubinFile.equals(cubinFile) && otherModule.closed == closed;
+            } else {
+                return false;
+            }
+        }
+
+        @Override
+        public int hashCode() {
+            return cubinFile.hashCode();
+        }
+
+        @Override
+        public void close() {
+            if (!closed) {
+                cuModuleUnload(this);
+                closed = true;
+            }
+        }
+    }
+}
+
+final class DeviceMemoryInfo {
+    private final long freeBytes;
+    private final long totalBytes;
+
+    DeviceMemoryInfo(long freeBytes, long totalBytes) {
+        this.freeBytes = freeBytes;
+        this.totalBytes = totalBytes;
+    }
+
+    public long getFreeBytes() {
+        return freeBytes;
+    }
+
+    public long getTotalBytes() {
+        return totalBytes;
+    }
+
+    @Override
+    public String toString() {
+        return String.format("DeviceMemoryInfo(freeBytes=%d bytes, totalBytes=%d bytes", freeBytes, totalBytes);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/ConfiguredKernel.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/ConfiguredKernel.java
new file mode 100644
index 00000000..499c3bbe
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/ConfiguredKernel.java
@@ -0,0 +1,262 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArray;
+import com.nvidia.grcuda.runtime.computation.KernelExecution;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+
+@ExportLibrary(InteropLibrary.class)
+public class ConfiguredKernel extends ProfiledComputation implements TruffleObject {
+
+    private final Kernel kernel;
+
+    private final KernelConfig config;
+
+    public ConfiguredKernel(Kernel kernel, KernelConfig config) {
+        this.kernel = kernel;
+        this.config = config;
+    }
+
+    @ExportMessage
+    boolean isExecutable() {
+        return true;
+    }
+
+    /**
+     * Parse the input arguments of the kernel and map them to the signature, making sure that the signature is respected
+     * @param args list of input arguments to the kernel
+     * @param booleanAccess used to parse boolean inputs
+     * @param int8Access used to parse char inputs
+     * @param int16Access used to parse short integer inputs
+     * @param int32Access used to parse integer inputs
+     * @param int64Access used to parse long integer inputs
+     * @param doubleAccess used to parse double and float inputs
+     * @return the object that wraps the kernel signature and arguments
+     * @throws UnsupportedTypeException if one of the inputs does not respect the signature
+     * @throws ArityException if the number of inputs does not respect the signature
+     */
+    KernelArguments createKernelArguments(Object[] args, InteropLibrary booleanAccess,
+                                          InteropLibrary int8Access, InteropLibrary int16Access,
+                                          InteropLibrary int32Access, InteropLibrary int64Access, InteropLibrary doubleAccess)
+            throws UnsupportedTypeException, ArityException {
+        if (args.length != kernel.getKernelParameters().length) {
+            CompilerDirectives.transferToInterpreter();
+            throw ArityException.create(kernel.getKernelParameters().length, kernel.getKernelParameters().length, args.length);
+        }
+        KernelArguments kernelArgs = new KernelArguments(args, this.kernel.getKernelParameters());
+        for (int paramIdx = 0; paramIdx < kernel.getKernelParameters().length; paramIdx++) {
+            Object arg = args[paramIdx];
+            ComputationArgument param = kernel.getKernelParameters()[paramIdx];
+            Type paramType = param.getType();
+            try {
+                if (param.isPointer()) {
+                    if (arg instanceof DeviceArray) {
+                        DeviceArray deviceArray = (DeviceArray) arg;
+                        if (!param.isSynonymousWithPointerTo(deviceArray.getElementType())) {
+                            throw new GrCUDAException("device array of " + deviceArray.getElementType() + " cannot be used as pointer argument " + paramType);
+                        }
+                        UnsafeHelper.PointerObject pointer = UnsafeHelper.createPointerObject();
+                        pointer.setValueOfPointer(deviceArray.getPointer());
+                        kernelArgs.setArgument(paramIdx, pointer);
+                    } else if (arg instanceof MultiDimDeviceArray) {
+                        MultiDimDeviceArray deviceArray = (MultiDimDeviceArray) arg;
+                        if (!param.isSynonymousWithPointerTo(deviceArray.getElementType())) {
+                            throw new GrCUDAException("multi-dimensional device array of " +
+                                    deviceArray.getElementType() + " cannot be used as pointer argument " + paramType);
+                        }
+                        UnsafeHelper.PointerObject pointer = UnsafeHelper.createPointerObject();
+                        pointer.setValueOfPointer(deviceArray.getPointer());
+                        kernelArgs.setArgument(paramIdx, pointer);
+                    } else {
+                        CompilerDirectives.transferToInterpreter();
+                        throw UnsupportedTypeException.create(new Object[]{arg}, "expected DeviceArray type");
+                    }
+                } else {
+                    // by-value argument
+                    switch (paramType) {
+                        case BOOLEAN: {
+                            UnsafeHelper.Integer8Object int8 = UnsafeHelper.createInteger8Object();
+                            int8.setValue(booleanAccess.asBoolean(arg) ? ((byte) 1) : ((byte) 0));
+                            kernelArgs.setArgument(paramIdx, int8);
+                            break;
+                        }
+                        case SINT8:
+                        case CHAR: {
+                            UnsafeHelper.Integer8Object int8 = UnsafeHelper.createInteger8Object();
+                            int8.setValue(int8Access.asByte(arg));
+                            kernelArgs.setArgument(paramIdx, int8);
+                            break;
+                        }
+                        case SINT16: {
+                            UnsafeHelper.Integer16Object int16 = UnsafeHelper.createInteger16Object();
+                            int16.setValue(int16Access.asShort(arg));
+                            kernelArgs.setArgument(paramIdx, int16);
+                            break;
+                        }
+                        case SINT32:
+                        case WCHAR: {
+                            UnsafeHelper.Integer32Object int32 = UnsafeHelper.createInteger32Object();
+                            int32.setValue(int32Access.asInt(arg));
+                            kernelArgs.setArgument(paramIdx, int32);
+                            break;
+                        }
+                        case SINT64:
+                        case SLL64:
+                            // no larger primitive type than long -> interpret long as unsigned
+                        case UINT64:
+                        case ULL64: {
+                            UnsafeHelper.Integer64Object int64 = UnsafeHelper.createInteger64Object();
+                            int64.setValue(int64Access.asLong(arg));
+                            kernelArgs.setArgument(paramIdx, int64);
+                            break;
+                        }
+                        case UINT8:
+                        case CHAR8: {
+                            int uint8 = int16Access.asShort(arg);
+                            if (uint8 < 0 || uint8 > 0xff) {
+                                CompilerDirectives.transferToInterpreter();
+                                throw createExceptionValueOutOfRange(paramType, uint8);
+                            }
+                            UnsafeHelper.Integer8Object int8 = UnsafeHelper.createInteger8Object();
+                            int8.setValue((byte) (0xff & uint8));
+                            kernelArgs.setArgument(paramIdx, int8);
+                            break;
+                        }
+                        case UINT16:
+                        case CHAR16: {
+                            int uint16 = int32Access.asInt(arg);
+                            if (uint16 < 0 || uint16 > 0xffff) {
+                                CompilerDirectives.transferToInterpreter();
+                                throw createExceptionValueOutOfRange(paramType, uint16);
+                            }
+                            UnsafeHelper.Integer16Object int16 = UnsafeHelper.createInteger16Object();
+                            int16.setValue((short) (0xffff & uint16));
+                            kernelArgs.setArgument(paramIdx, int16);
+                            break;
+                        }
+                        case UINT32: {
+                            long uint32 = int64Access.asLong(arg);
+                            if (uint32 < 0 || uint32 > 0xffffffffL) {
+                                CompilerDirectives.transferToInterpreter();
+                                throw createExceptionValueOutOfRange(paramType, uint32);
+                            }
+                            UnsafeHelper.Integer32Object int32 = UnsafeHelper.createInteger32Object();
+                            int32 = UnsafeHelper.createInteger32Object();
+                            int32.setValue((int) (0xffffffffL & uint32));
+                            kernelArgs.setArgument(paramIdx, int32);
+                            break;
+                        }
+                        case FLOAT: {
+                            UnsafeHelper.Float32Object fp32 = UnsafeHelper.createFloat32Object();
+                            // going via "double" to allow floats to be initialized with doubles
+                            fp32.setValue((float) doubleAccess.asDouble(arg));
+                            kernelArgs.setArgument(paramIdx, fp32);
+                            break;
+                        }
+                        case DOUBLE: {
+                            UnsafeHelper.Float64Object fp64 = UnsafeHelper.createFloat64Object();
+                            fp64.setValue(doubleAccess.asDouble(arg));
+                            kernelArgs.setArgument(paramIdx, fp64);
+                            break;
+                        }
+                        default:
+                            CompilerDirectives.transferToInterpreter();
+                            throw UnsupportedTypeException.create(new Object[]{arg},
+                                    "unsupported by-value parameter type: " + paramType);
+                    }
+                }
+            } catch (UnsupportedMessageException e) {
+                CompilerDirectives.transferToInterpreter();
+                throw UnsupportedTypeException.create(new Object[]{arg},
+                        "expected type " + paramType + " in argument " + arg);
+            }
+        }
+        return kernelArgs;
+    }
+
+    private static GrCUDAException createExceptionValueOutOfRange(Type type, long value) {
+        return new GrCUDAException("value " + value + " is out of range for type " + type);
+    }
+
+    @ExportMessage
+    @TruffleBoundary
+    Object execute(Object[] arguments,
+                    @CachedLibrary(limit = "3") InteropLibrary boolAccess,
+                    @CachedLibrary(limit = "3") InteropLibrary int8Access,
+                    @CachedLibrary(limit = "3") InteropLibrary int16Access,
+                    @CachedLibrary(limit = "3") InteropLibrary int32Access,
+                    @CachedLibrary(limit = "3") InteropLibrary int64Access,
+                    @CachedLibrary(limit = "3") InteropLibrary doubleAccess) throws UnsupportedTypeException, ArityException {
+        kernel.incrementLaunchCount();
+        try (KernelArguments args = this.createKernelArguments(arguments, boolAccess, int8Access, int16Access,
+                        int32Access, int64Access, doubleAccess)) {
+            // If using a manually specified stream, do not schedule it automatically, but execute it immediately;
+            if (!config.useCustomStream()) {
+                new KernelExecution(this, args).schedule();
+            } else {
+                kernel.getGrCUDAExecutionContext().getCudaRuntime().cuLaunchKernel(kernel, config, args, config.getStream());
+            }
+        }
+        return this;
+    }
+
+    public Kernel getKernel() {
+        return kernel;
+    }
+
+    public KernelConfig getConfig() {
+        return config;
+    }
+
+    @Override
+    public String toString() {
+        return "ConfiguredKernel(" + kernel.toString() + "; " + config.toString() + ")";
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/Device.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/Device.java
similarity index 61%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/Device.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/Device.java
index ffd37e5f..21938183 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/Device.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/Device.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,10 +33,12 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
-import com.nvidia.grcuda.DeviceArray.MemberSet;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.MemberSet;
 import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.dsl.Cached;
 import com.oracle.truffle.api.dsl.Cached.Shared;
@@ -44,41 +53,132 @@
 import com.oracle.truffle.api.library.ExportMessage;
 import com.oracle.truffle.api.profiles.ValueProfile;
 
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+
 @ExportLibrary(InteropLibrary.class)
-public final class Device implements TruffleObject {
+public class Device extends AbstractDevice implements TruffleObject {
 
     private static final String ID = "id";
     private static final String PROPERTIES = "properties";
     private static final String IS_CURRENT = "isCurrent";
     private static final String SET_CURRENT = "setCurrent";
     private static final MemberSet PUBLIC_MEMBERS = new MemberSet(ID, PROPERTIES, IS_CURRENT, SET_CURRENT);
-
-    private final int deviceId;
     private final GPUDeviceProperties properties;
     private final CUDARuntime runtime;
 
+    /**
+     * List of streams associated to this device;
+     */
+    private final List<CUDAStream> streams;
+    /**
+     * Keep a set of the free available streams;
+     */
+    protected final Set<CUDAStream> freeStreams = new HashSet<>();
+
     public Device(int deviceId, CUDARuntime runtime) {
-        this.deviceId = deviceId;
+        super(deviceId);
+        if (deviceId < 0) {
+            throw new GrCUDAException("GPU device must have id >= 0, instead it is " + deviceId);
+        }
         this.runtime = runtime;
         this.properties = new GPUDeviceProperties(deviceId, runtime);
+        this.streams = new ArrayList<>();
+    }
+
+    /**
+     * Return a stream (in no particular order) without any active computation on it;
+     * @return a stream with no active computation on it
+     */
+    public CUDAStream getFreeStream(){
+        // Get the first stream available, and remove it from the list of free streams;
+        if (!freeStreams.isEmpty()) {
+            CUDAStream stream = freeStreams.iterator().next();
+            freeStreams.remove(stream);
+            return stream;
+        } else {
+            throw new GrCUDAException("no free CUDA stream is available on device id=" + deviceId);
+        }
+    }
+
+    /**
+     * Create a new {@link CUDAStream} and add it to the list of streams associated to this device;
+     */
+    public CUDAStream createStream() {
+        // To create a stream, we need to guarantee that this device is currently active;
+        if (this.runtime.getCurrentGPU() != this.deviceId) {
+            this.runtime.cudaSetDevice(this.deviceId);
+        }
+        // The stream is not added to the list of free streams:
+        // a new stream is created only when it is required for a computation,
+        // so it will be immediately "busy" anyway;
+        CUDAStream newStream = this.runtime.cudaStreamCreate(this.streams.size());
+        this.streams.add(newStream);
+        return newStream;
+    }
+
+    /**
+     * Set a specific CUDA stream as free, so it can be reused;
+     * @param stream a free CUDA stream
+     */
+    public void updateFreeStreams(CUDAStream stream) {
+        freeStreams.add(stream);
+    }
+
+    /**
+     * Set all streams on this device as free, so they can be reused;
+     */
+    public void updateFreeStreams() {
+        freeStreams.addAll(streams);
+    }
+
+    public int getNumberOfFreeStreams() {
+        return freeStreams.size();
+    }
+
+    public int getNumberOfBusyStreams(){
+        return this.streams.size() - freeStreams.size();
     }
 
     public GPUDeviceProperties getProperties() {
         return properties;
     }
 
+    public int getDeviceId() {
+        return deviceId;
+    }
+
+    /**
+     * @return the list of streams associated to this device;
+     */
+    public List<CUDAStream> getStreams() {
+        return streams;
+    }
+
+    /**
+     * Cleanup and deallocate the streams associated to this device;
+     */
+    public void cleanup() {
+        this.streams.forEach(runtime::cudaStreamDestroy);
+        this.freeStreams.clear();
+        this.streams.clear();
+    }
+
     @Override
     public String toString() {
-        return "Device(id=" + deviceId + ")";
+        return "GPU(id=" + deviceId + ")";
     }
 
     @Override
-    public boolean equals(Object other) {
-        if (other instanceof Device) {
-            Device otherDevice = (Device) other;
-            return otherDevice.deviceId == deviceId;
-        }
-        return false;
+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        Device that = (Device) o;
+        return deviceId == that.deviceId;
     }
 
     @Override
@@ -148,6 +248,9 @@ Object invokeMember(String memberName,
     }
 }
 
+/**
+ * Find if the specified device is the one currently in use;
+ */
 @ExportLibrary(InteropLibrary.class)
 final class IsCurrentFunction implements TruffleObject {
     private final int deviceId;
@@ -169,15 +272,20 @@ boolean isExecutable() {
     public Object execute(Object[] arguments) throws ArityException {
         if (arguments.length != 0) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(0, arguments.length);
+            // FIXME: the maximum number of arguments is unbound (as each argument is a dimension of a N-dimensional tensor).
+            //  Truffle currently uses -1 to handle an unbound number of arguments;
+            throw ArityException.create(0, -1, arguments.length);
         }
-        return runtime.cudaGetDevice() == deviceId;
+        return runtime.getCurrentGPU() == deviceId;
     }
 }
 
+/**
+ * Set the specified device as the one currently in use;
+ */
 @ExportLibrary(InteropLibrary.class)
 class SetCurrentFunction implements TruffleObject {
-    private int deviceId;
+    private final int deviceId;
     private final CUDARuntime runtime;
 
     SetCurrentFunction(int deviceId, CUDARuntime runtime) {
@@ -196,7 +304,7 @@ boolean isExecutable() {
     public Object execute(Object[] arguments) throws ArityException {
         if (arguments.length != 0) {
             CompilerDirectives.transferToInterpreter();
-            throw ArityException.create(0, arguments.length);
+            throw ArityException.create(0, 0, arguments.length);
         }
         runtime.cudaSetDevice(deviceId);
         return NoneValue.get();
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/DeviceList.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/DeviceList.java
similarity index 67%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/DeviceList.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/DeviceList.java
index 57e243f3..5e2a179a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/DeviceList.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/DeviceList.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,9 +33,12 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
+import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.Iterator;
+import java.util.List;
 import java.util.NoSuchElementException;
 
 import com.oracle.truffle.api.CompilerDirectives;
@@ -39,17 +49,29 @@
 import com.oracle.truffle.api.library.ExportMessage;
 
 @ExportLibrary(InteropLibrary.class)
-public final class DeviceList implements TruffleObject, Iterable<Device> {
+public class DeviceList implements TruffleObject, Iterable<Device> {
+
+    protected final List<Device> devices;
 
-    private final Device[] devices;
+    public DeviceList(CUDARuntime runtime) {
+        this(runtime.getNumberOfAvailableGPUs(), runtime);
+    }
 
     public DeviceList(int numDevices, CUDARuntime runtime) {
-        devices = new Device[numDevices];
-        for (int deviceOrdinal = 0; deviceOrdinal < numDevices; ++deviceOrdinal) {
-            devices[deviceOrdinal] = new Device(deviceOrdinal, runtime);
+        devices = Arrays.asList(new Device[numDevices]);
+        this.initializeDeviceList(numDevices, runtime);
+    }
+
+    public void initializeDeviceList(int numDevices, CUDARuntime runtime) {
+        for (int deviceOrdinal = 0; deviceOrdinal < numDevices; deviceOrdinal++) {
+            devices.set(deviceOrdinal, new Device(deviceOrdinal, runtime));
         }
     }
 
+    public List<Device> getDevices() {
+        return this.devices;
+    }
+
     // Java API
 
     public Iterator<Device> iterator() {
@@ -57,12 +79,12 @@ public Iterator<Device> iterator() {
             int nextIndex = 0;
 
             public boolean hasNext() {
-                return nextIndex < devices.length;
+                return nextIndex < devices.size();
             }
 
             public Device next() {
-                if (nextIndex < devices.length) {
-                    return devices[nextIndex++];
+                if (nextIndex < devices.size()) {
+                    return devices.get(nextIndex++);
                 } else {
                     CompilerDirectives.transferToInterpreter();
                     throw new NoSuchElementException();
@@ -72,15 +94,22 @@ public Device next() {
     }
 
     public int size() {
-        return devices.length;
+        return devices.size();
     }
 
     public Device getDevice(int deviceOrdinal) {
-        if ((deviceOrdinal < 0) || (deviceOrdinal >= devices.length)) {
+        if ((deviceOrdinal < 0) || (deviceOrdinal >= devices.size())) {
             CompilerDirectives.transferToInterpreter();
             throw new IndexOutOfBoundsException();
         }
-        return devices[deviceOrdinal];
+        return devices.get(deviceOrdinal);
+    }
+
+    /**
+     * Cleanup and deallocate the streams managed by each device;
+     */
+    public void cleanup() {
+        this.devices.forEach(Device::cleanup);
     }
 
     @Override
@@ -108,20 +137,20 @@ boolean hasArrayElements() {
 
     @ExportMessage
     public long getArraySize() {
-        return devices.length;
+        return devices.size();
     }
 
     @ExportMessage
     boolean isArrayElementReadable(long index) {
-        return index >= 0 && index < devices.length;
+        return index >= 0 && index < devices.size();
     }
 
     @ExportMessage
     Object readArrayElement(long index) throws InvalidArrayIndexException {
-        if ((index < 0) || (index >= devices.length)) {
+        if ((index < 0) || (index >= devices.size())) {
             CompilerDirectives.transferToInterpreter();
             throw InvalidArrayIndexException.create(index);
         }
-        return devices[(int) index];
+        return devices.get((int) index);
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/DriverAPIErrorMessages.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/DriverAPIErrorMessages.java
similarity index 90%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/DriverAPIErrorMessages.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/DriverAPIErrorMessages.java
index 975a77f6..2ee630b8 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/DriverAPIErrorMessages.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/DriverAPIErrorMessages.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,7 +33,7 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import java.util.EnumSet;
 import java.util.HashMap;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/GPUDeviceProperties.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/GPUDeviceProperties.java
similarity index 93%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/GPUDeviceProperties.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/GPUDeviceProperties.java
index ad97918b..195cba85 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/GPUDeviceProperties.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/GPUDeviceProperties.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,7 +33,7 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import java.util.Arrays;
 import java.util.EnumSet;
@@ -34,7 +41,7 @@
 import java.util.List;
 import java.util.Optional;
 
-import com.nvidia.grcuda.gpu.CUDARuntime.CUDADeviceAttribute;
+import com.nvidia.grcuda.runtime.CUDARuntime.CUDADeviceAttribute;
 import com.oracle.truffle.api.CompilerDirectives.CompilationFinal;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.InteropLibrary;
@@ -168,7 +175,6 @@ private interface DeviceProperty {
         String getName();
 
         Object getValue(int deviceId, CUDARuntime runtime);
-
     }
 
     private static class DeviceAttributeProperty implements DeviceProperty {
@@ -197,7 +203,7 @@ private static class DeviceMemoryPropertyAccessor {
         private Optional<DeviceMemoryInfo> info = Optional.empty();
 
         private void getTotalAndFreeDeviceMemory(int deviceId, CUDARuntime runtime) {
-            int currentDevice = runtime.cudaGetDevice();
+            int currentDevice = runtime.getCurrentGPU();
             try {
                 if (currentDevice != deviceId) {
                     runtime.cudaSetDevice(deviceId);
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/Kernel.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/Kernel.java
new file mode 100644
index 00000000..e8a0c221
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/Kernel.java
@@ -0,0 +1,307 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime;
+
+import com.nvidia.grcuda.GrCUDAInternalException;
+import com.nvidia.grcuda.MemberSet;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.TypeException;
+import com.nvidia.grcuda.runtime.CUDARuntime.CUModule;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.dsl.Fallback;
+import com.oracle.truffle.api.dsl.Specialization;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnknownIdentifierException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+
+@ExportLibrary(InteropLibrary.class)
+public class Kernel implements TruffleObject {
+
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
+    private final String kernelName;
+    private final String kernelSymbol;
+    private final List<Long> nativeKernelFunctionHandle;
+    private final List<CUModule> modules;
+    private final ComputationArgument[] kernelComputationArguments;
+    private int launchCount = 0;
+    private final String ptxCode;
+
+    /**
+     * Create a kernel without PTX code.
+     *
+     * @param grCUDAExecutionContext captured reference to the GrCUDA execution context
+     * @param kernelName name of the kernel as exposed through Truffle
+     * @param kernelSymbol name of the kernel symbol*
+     * @param kernelFunction native pointer to the kernel function (CUfunction), one pointer for
+     *            each device on which it is loaded
+     * @param kernelSignature signature string of the kernel (NFI or NIDL)
+     * @param modules CUmodules that contains the kernel function, one for each device
+     */
+    public Kernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String kernelName,
+                    String kernelSymbol, List<Long> kernelFunction,
+                    String kernelSignature, List<CUModule> modules) {
+        this(grCUDAExecutionContext, kernelName, kernelSymbol, kernelFunction, kernelSignature, modules, "");
+    }
+
+    /**
+     * Create a kernel and hold on to the PTX code.
+     *
+     * @param grCUDAExecutionContext captured reference to the GrCUDA execution context
+     * @param kernelName name of kernel as exposed through Truffle
+     * @param kernelSymbol name of the kernel symbol
+     * @param kernelFunction native pointer to the kernel function (CUfunction), one pointer for
+     *            each device on which it is loaded
+     * @param kernelSignature signature string of the kernel (NFI or NIDL)
+     * @param modules CUmodules that contains the kernel function, one for each device
+     * @param ptx PTX source code for the kernel.
+     */
+    public Kernel(AbstractGrCUDAExecutionContext grCUDAExecutionContext, String kernelName, String kernelSymbol,
+                    List<Long> kernelFunction, String kernelSignature, List<CUModule> modules, String ptx) {
+        try {
+            List<ComputationArgument> paramList = ComputationArgument.parseParameterSignature(kernelSignature);
+            ComputationArgument[] params = new ComputationArgument[paramList.size()];
+            this.kernelComputationArguments = paramList.toArray(params);
+        } catch (TypeException e) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(e.getMessage());
+        }
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
+        this.kernelName = kernelName;
+        this.kernelSymbol = kernelSymbol;
+        this.nativeKernelFunctionHandle = kernelFunction;
+        this.modules = modules;
+        this.ptxCode = ptx;
+        this.grCUDAExecutionContext.registerKernel(this);
+    }
+
+    public void incrementLaunchCount() {
+        launchCount++;
+    }
+
+    public AbstractGrCUDAExecutionContext getGrCUDAExecutionContext() {
+        return grCUDAExecutionContext;
+    }
+
+    public ComputationArgument[] getKernelParameters() {
+        return kernelComputationArguments;
+    }
+
+    public long getKernelFunctionHandle(int deviceId) {
+        if (modules.get(deviceId).isClosed()) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException("CUmodule containing kernel " + kernelName + " is already closed");
+        }
+        return nativeKernelFunctionHandle.get(deviceId);
+    }
+
+    @Override
+    public String toString() {
+        return "Kernel(" + kernelName + ", " + Arrays.toString(kernelComputationArguments) + ", launchCount=" + launchCount + ")";
+    }
+
+    public String getPTX() {
+        return ptxCode;
+    }
+
+    public String getKernelName() {
+        return kernelName;
+    }
+
+    public String getSymbolName() {
+        return kernelSymbol;
+    }
+
+    public int getLaunchCount() {
+        return launchCount;
+    }
+
+    // implementation of InteropLibrary
+
+    protected static final String PTX = "ptx";
+    protected static final String NAME = "name";
+    protected static final String LAUNCH_COUNT = "launchCount";
+    static final MemberSet MEMBERS = new MemberSet(PTX, NAME, LAUNCH_COUNT);
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean hasMembers() {
+        return true;
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    Object getMembers(@SuppressWarnings("unused") boolean includeInternal) {
+        return MEMBERS;
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean isMemberReadable(String member) {
+        return PTX.equals(member) || NAME.equals(member) || LAUNCH_COUNT.equals(member);
+    }
+
+    @ExportMessage
+    @SuppressWarnings("unused")
+    abstract static class ReadMember {
+        @Specialization(guards = "PTX.equals(member)")
+        public static String readMemberPtx(Kernel receiver, String member) {
+            String ptx = receiver.getPTX();
+            if (ptx == null) {
+                return "<no PTX code>";
+            } else {
+                return ptx;
+            }
+        }
+
+        @Specialization(guards = "NAME.equals(member)")
+        public static String readMemberName(Kernel receiver, String member) {
+            return receiver.getKernelName();
+        }
+
+        @Specialization(guards = "LAUNCH_COUNT.equals(member)")
+        public static int readMemberLaunchCount(Kernel receiver, String member) {
+            return receiver.getLaunchCount();
+        }
+
+        @Fallback
+        public static Object readMemberOther(Kernel receiver, String member) throws UnknownIdentifierException {
+            throw UnknownIdentifierException.create(member);
+        }
+    }
+
+    private static int extractNumber(Object valueObj, String argumentName, InteropLibrary access) throws UnsupportedTypeException {
+        try {
+            return access.asInt(valueObj);
+        } catch (UnsupportedMessageException e) {
+            CompilerDirectives.transferToInterpreter();
+            throw UnsupportedTypeException.create(new Object[]{valueObj}, "integer expected for " + argumentName);
+        }
+    }
+
+    private static Dim3 extractDim3(Object valueObj, String argumentName, InteropLibrary access, InteropLibrary elementAccess) throws UnsupportedTypeException {
+        if (access.hasArrayElements(valueObj)) {
+            long size;
+            try {
+                size = access.getArraySize(valueObj);
+            } catch (UnsupportedMessageException e) {
+                CompilerDirectives.transferToInterpreter();
+                throw new GrCUDAInternalException("unexpected behavior");
+            }
+            if (size < 1 || size > 3) {
+                CompilerDirectives.transferToInterpreter();
+                throw UnsupportedTypeException.create(new Object[]{valueObj}, argumentName + " needs to have between 1 and 3 elements");
+            }
+            int[] dim3 = new int[]{1, 1, 1};
+            final char[] suffix = {'x', 'y', 'z'};
+            for (int i = 0; i < size; i++) {
+                Object elementObj;
+                try {
+                    elementObj = access.readArrayElement(valueObj, i);
+                } catch (UnsupportedMessageException e) {
+                    CompilerDirectives.transferToInterpreter();
+                    throw new GrCUDAInternalException("unexpected behavior");
+                } catch (InvalidArrayIndexException e) {
+                    CompilerDirectives.transferToInterpreter();
+                    throw UnsupportedTypeException.create(new Object[]{valueObj}, argumentName + " needs to have between 1 and 3 elements");
+                }
+                dim3[i] = extractNumber(elementObj, "dim3." + suffix[i], elementAccess);
+            }
+            return new Dim3(dim3[0], dim3[1], dim3[2]);
+        }
+        return new Dim3(extractNumber(valueObj, argumentName, access));
+    }
+
+    private static CUDAStream extractStream(Object streamObj) throws UnsupportedTypeException {
+        if (streamObj instanceof CUDAStream) {
+            return (CUDAStream) streamObj;
+        } else {
+            CompilerDirectives.transferToInterpreter();
+            throw UnsupportedTypeException.create(new Object[]{streamObj}, "expected CUDAStream type, received " + streamObj.getClass());
+        }
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean isExecutable() {
+        return true;
+    }
+
+    @ExportMessage
+    Object execute(Object[] arguments,
+                    @CachedLibrary(limit = "3") InteropLibrary gridSizeAccess,
+                    @CachedLibrary(limit = "3") InteropLibrary gridSizeElementAccess,
+                    @CachedLibrary(limit = "3") InteropLibrary blockSizeAccess,
+                    @CachedLibrary(limit = "3") InteropLibrary blockSizeElementAccess,
+                    @CachedLibrary(limit = "3") InteropLibrary sharedMemoryAccess) throws UnsupportedTypeException, ArityException {
+        if (arguments.length < 2 || arguments.length > 4) {
+            CompilerDirectives.transferToInterpreter();
+            throw ArityException.create(2, 4, arguments.length);
+        }
+
+        Dim3 gridSize = extractDim3(arguments[0], "gridSize", gridSizeAccess, gridSizeElementAccess);
+        Dim3 blockSize = extractDim3(arguments[1], "blockSize", blockSizeAccess, blockSizeElementAccess);
+        KernelConfigBuilder configBuilder = new KernelConfigBuilder(gridSize, blockSize);
+        if (arguments.length == 3) {
+            if (sharedMemoryAccess.isNumber(arguments[2])) {
+                // Dynamic shared memory specified;
+                configBuilder.dynamicSharedMemoryBytes(extractNumber(arguments[2], "dynamicSharedMemory", sharedMemoryAccess));
+            } else {
+                // Stream specified;
+                configBuilder.stream(extractStream(arguments[2]));
+            }
+        } else if (arguments.length == 4) {
+            configBuilder.dynamicSharedMemoryBytes(extractNumber(arguments[2], "dynamicSharedMemory", sharedMemoryAccess));
+            // Stream specified;
+            configBuilder.stream(extractStream(arguments[3]));
+        }
+        return new ConfiguredKernel(this, configBuilder.build());
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelArguments.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelArguments.java
new file mode 100644
index 00000000..d39a1750
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelArguments.java
@@ -0,0 +1,99 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+public final class KernelArguments implements Closeable {
+
+    private final Object[] originalArgs;
+    /**
+     * Associate each input object to the characteristics of its argument, such as its type and if it's constant;
+     */
+    private final List<ComputationArgumentWithValue> kernelArgumentWithValues = new ArrayList<>();
+    private final UnsafeHelper.PointerArray argumentArray;
+    private final ArrayList<Closeable> argumentValues = new ArrayList<>();
+
+    public KernelArguments(Object[] args, ComputationArgument[] kernelArgumentList) {
+        this.originalArgs = args;
+        this.argumentArray = UnsafeHelper.createPointerArray(args.length);
+        assert(args.length == kernelArgumentList.length);
+        // Initialize the list of arguments and object references;
+        for (int i = 0; i < args.length; i++) {
+            kernelArgumentWithValues.add(new ComputationArgumentWithValue(kernelArgumentList[i], args[i]));
+        }
+    }
+
+    public void setArgument(int argIdx, UnsafeHelper.MemoryObject obj) {
+        argumentArray.setValueAt(argIdx, obj.getAddress());
+        argumentValues.add(obj);
+    }
+
+    long getPointer() {
+        return argumentArray.getAddress();
+    }
+
+    public Object[] getOriginalArgs() {
+        return originalArgs;
+    }
+
+    public Object getOriginalArg(int index) {
+        return originalArgs[index];
+    }
+
+    public List<ComputationArgumentWithValue> getKernelArgumentWithValues() {
+        return kernelArgumentWithValues;
+    }
+
+    @Override
+    public String toString() {
+        return "KernelArgs=" + Arrays.toString(originalArgs);
+    }
+
+    @Override
+    public void close() {
+        this.argumentArray.close();
+        for (Closeable c : argumentValues) {
+            try {
+                c.close();
+            } catch (IOException e) {
+                /* ignored */
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/KernelConfig.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelConfig.java
similarity index 77%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/KernelConfig.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelConfig.java
index f4c4a069..d0eca94a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/KernelConfig.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelConfig.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,40 +33,33 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
+
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.oracle.truffle.api.CompilerAsserts;
 
 import java.util.Arrays;
 import java.util.Objects;
 
-import com.oracle.truffle.api.CompilerAsserts;
-
 public final class KernelConfig {
     private final Dim3 gridSize;
     private final Dim3 blockSize;
     private final int dynamicSharedMemoryBytes;
+    private CUDAStream stream;
+    private final boolean useCustomStream;
 
-    public KernelConfig(int numBlocks, int numThreadsPerBlock) {
-        gridSize = new Dim3(numBlocks);
-        blockSize = new Dim3(numThreadsPerBlock);
-        dynamicSharedMemoryBytes = 0;
-    }
-
-    public KernelConfig(Dim3 gridSize, Dim3 blockSize) {
-        this.gridSize = gridSize;
-        this.blockSize = blockSize;
-        this.dynamicSharedMemoryBytes = 0;
-    }
-
-    public KernelConfig(Dim3 gridSize, Dim3 blockSize, int sharedMemoryBytes) {
+    public KernelConfig(Dim3 gridSize, Dim3 blockSize, int sharedMemoryBytes, CUDAStream stream, boolean useCustomStream) {
         this.gridSize = gridSize;
         this.blockSize = blockSize;
         this.dynamicSharedMemoryBytes = sharedMemoryBytes;
+        this.stream = stream;
+        this.useCustomStream = useCustomStream;
     }
 
     @Override
     public String toString() {
         return "KernelConfig(gridSize=" + gridSize + ", blockSize=" + blockSize +
-                        ", sharedMemoryBytes=" + dynamicSharedMemoryBytes + ", stream=" + getStream() + ')';
+                        ", sharedMemoryBytes=" + dynamicSharedMemoryBytes + (useCustomStream ? ", stream=" + getStream() : "") + ')' ;
     }
 
     public Dim3 getGridSize() {
@@ -75,8 +75,16 @@ public int getDynamicSharedMemoryBytes() {
     }
 
     @SuppressWarnings("static-method")
-    public int getStream() {
-        return 0; // default stream
+    public CUDAStream getStream() {
+        return stream;
+    }
+
+    public void setStream(CUDAStream stream) {
+        this.stream = stream;
+    }
+
+    public boolean useCustomStream() {
+        return useCustomStream;
     }
 
     @Override
@@ -89,9 +97,9 @@ public boolean equals(Object o) {
         }
         KernelConfig that = (KernelConfig) o;
         return dynamicSharedMemoryBytes == that.dynamicSharedMemoryBytes &&
-                        getStream() == that.getStream() &&
+                        getStream().equals(that.getStream()) &&
                         gridSize.equals(that.gridSize) &&
-                        blockSize.equals(that.blockSize);
+                        blockSize.equals(that.blockSize) && stream.equals(that.stream);
     }
 
     @Override
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelConfigBuilder.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelConfigBuilder.java
new file mode 100644
index 00000000..28fb59ee
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/KernelConfigBuilder.java
@@ -0,0 +1,67 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime;
+
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.DefaultStream;
+
+public class KernelConfigBuilder {
+
+    private final Dim3 gridSize;
+    private final Dim3 blockSize;
+    private int dynamicSharedMemoryBytes = 0;
+    private CUDAStream stream = DefaultStream.get();
+    private boolean useCustomStream = false;
+
+    KernelConfigBuilder(Dim3 gridSize, Dim3 blockSize) {
+        this.gridSize = gridSize;
+        this.blockSize = blockSize;
+    }
+
+    public static KernelConfigBuilder newBuilder(Dim3 gridSize, Dim3 blockSize) {
+        return new KernelConfigBuilder(gridSize, blockSize);
+    }
+
+    public KernelConfigBuilder dynamicSharedMemoryBytes(int bytes) {
+        this.dynamicSharedMemoryBytes = bytes;
+        return this;
+    }
+
+    public KernelConfigBuilder stream(CUDAStream stream) {
+        this.stream = stream;
+        this.useCustomStream = true;
+        return this;
+    }
+
+    public KernelConfig build() {
+        return new KernelConfig(gridSize, blockSize, dynamicSharedMemoryBytes, stream, useCustomStream);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/LazyKernel.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/LazyKernel.java
similarity index 82%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/LazyKernel.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/LazyKernel.java
index 3a4d9041..cb2c2a15 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/LazyKernel.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/LazyKernel.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,10 +33,11 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import com.nvidia.grcuda.KernelBinding;
 
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
 import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.InteropLibrary;
 import com.oracle.truffle.api.interop.TruffleObject;
@@ -45,12 +53,12 @@ public final class LazyKernel implements TruffleObject {
     public static final InteropLibrary INTEROP = InteropLibrary.getFactory().getUncached();
 
     private final KernelBinding binding;
-    private final CUDARuntime cudaRuntime;
+    private final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
     private Kernel kernel;
 
-    public LazyKernel(KernelBinding binding, CUDARuntime runtime) {
+    public LazyKernel(KernelBinding binding, AbstractGrCUDAExecutionContext grCUDAExecutionContext) {
         this.binding = binding;
-        this.cudaRuntime = runtime;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
     }
 
     public String getKernelName() {
@@ -102,7 +110,7 @@ Object execute(Object[] arguments,
     private void assertKernelLoaded() {
         synchronized (this) {
             if (kernel == null) {
-                kernel = cudaRuntime.loadKernel(binding);
+                kernel = grCUDAExecutionContext.loadKernel(binding);
                 assert kernel != null : "Loaded kernel non-null";
             }
         }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/LittleEndianNativeArrayView.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/LittleEndianNativeArrayView.java
similarity index 89%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/LittleEndianNativeArrayView.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/LittleEndianNativeArrayView.java
index 3d11e8cf..d316aa81 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/LittleEndianNativeArrayView.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/LittleEndianNativeArrayView.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,7 +33,7 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import com.nvidia.grcuda.Type;
 import sun.misc.Unsafe;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/NVRTCException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/NVRTCException.java
similarity index 74%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/NVRTCException.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/NVRTCException.java
index 199482c4..289ab369 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/NVRTCException.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/NVRTCException.java
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -12,6 +13,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -25,12 +32,11 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
-import com.oracle.truffle.api.TruffleException;
-import com.oracle.truffle.api.nodes.Node;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
 
-public class NVRTCException extends RuntimeException implements TruffleException {
+public class NVRTCException extends AbstractTruffleException {
 
     private static final long serialVersionUID = 7687673079396178282L;
 
@@ -41,10 +47,4 @@ public NVRTCException(int errorCode, String functionName) {
     public NVRTCException(int errorCode, String message, String functionName) {
         super(message + '(' + errorCode + ") in " + functionName);
     }
-
-    @Override
-    public Node getLocation() {
-        // null = location not available
-        return null;
-    }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/NVRuntimeCompiler.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/NVRuntimeCompiler.java
similarity index 95%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/NVRuntimeCompiler.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/NVRuntimeCompiler.java
index 4057b814..88baf0e3 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/NVRuntimeCompiler.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/NVRuntimeCompiler.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,15 +33,15 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import java.io.PrintStream;
 import java.util.ArrayList;
 
 import com.nvidia.grcuda.GrCUDAInternalException;
 import com.nvidia.grcuda.GrCUDALanguage;
-import com.nvidia.grcuda.gpu.UnsafeHelper.PointerArray;
-import com.nvidia.grcuda.gpu.UnsafeHelper.StringObject;
+import com.nvidia.grcuda.runtime.UnsafeHelper.PointerArray;
+import com.nvidia.grcuda.runtime.UnsafeHelper.StringObject;
 import com.oracle.truffle.api.CompilerAsserts;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.interop.InteropException;
@@ -61,7 +68,7 @@ public PTXKernel compileKernel(String code, String kernelName, String moduleName
             NVRTCResult compileResult = nvrtcCompileProgram(program, compileOpts);
             if (compileResult != NVRTCResult.NVRTC_SUCCESS) {
                 String compileLog = getProgramLog(program);
-                PrintStream err = new PrintStream(GrCUDALanguage.getCurrentLanguage().getContextReference().get().getEnv().err());
+                PrintStream err = new PrintStream(GrCUDALanguage.getCurrentContext().getEnv().err());
                 err.println("compile result: " + compileResult);
                 err.println("program log: " + compileLog);
                 throw new NVRTCException(compileResult.errorCode, compileLog);
@@ -108,13 +115,14 @@ public NVRTCResult nvrtcCompileProgram(NVRTCProgram program, String... opts) {
         if (opts.length == 0) {
             return nvrtcCompileProgramInternal(program, 0, 0L);
         } else {
-            ArrayList<StringObject> optCStrings = new ArrayList<>(opts.length);
+            ArrayList<StringObject> optCStrings = new ArrayList<>();
             try (UnsafeHelper.PointerArray optCStringArr = new PointerArray(opts.length)) {
                 int idx = 0;
                 for (String optString : opts) {
                     UnsafeHelper.StringObject cString = UnsafeHelper.StringObject.fromJavaString(optString);
                     optCStrings.add(cString);
                     optCStringArr.setValueAt(idx, cString.getAddress());
+                    idx++;
                 }
                 NVRTCResult result = nvrtcCompileProgramInternal(program, opts.length,
                                 optCStringArr.getAddress());
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/OffheapMemory.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/OffheapMemory.java
similarity index 84%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/OffheapMemory.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/OffheapMemory.java
index e94de30f..4fd8bd93 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/OffheapMemory.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/OffheapMemory.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,7 +33,7 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import java.lang.reflect.Field;
 import sun.misc.Unsafe;
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/ProfiledComputation.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/ProfiledComputation.java
new file mode 100644
index 00000000..79a4bee2
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/ProfiledComputation.java
@@ -0,0 +1,61 @@
+/*
+ * Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime;
+
+import java.util.HashMap;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Abstract class that stores the historical execution data of a given computation (for example, a certain GPU kernel).
+ * Classes that need to store execution times or other profiling information should use this class.
+ * How the execution time (or other information) are actually measured is not specified by this class,
+ * which simply defines how such information is stored for future utilization;
+ */
+public abstract class ProfiledComputation {
+
+    // Track all the execution times associated to the GPU on which a kernel was executed;
+    HashMap<Integer, List<Float>> collectionOfExecutions;
+
+    public ProfiledComputation() {
+        collectionOfExecutions = new HashMap<>();
+    }
+
+    public void addExecutionTime(int deviceId, float executionTime) {
+        collectionOfExecutions.putIfAbsent(deviceId, new ArrayList<>());
+        collectionOfExecutions.get(deviceId).add(executionTime);
+    }
+
+    public List<Float> getExecutionTimesOnDevice(int deviceId) throws RuntimeException {
+        if (collectionOfExecutions.containsKey(deviceId)) {
+            return collectionOfExecutions.get(deviceId);
+        } else {
+            throw new RuntimeException("Execution times for device=" + deviceId + "have not been collected");
+        }
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/UnsafeHelper.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/UnsafeHelper.java
similarity index 92%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/UnsafeHelper.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/UnsafeHelper.java
index 784ae0b0..51988ea2 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/gpu/UnsafeHelper.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/UnsafeHelper.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,11 +33,12 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda.gpu;
+package com.nvidia.grcuda.runtime;
 
 import java.io.ByteArrayOutputStream;
 import java.lang.reflect.Field;
 import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
 
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
@@ -151,7 +159,7 @@ static final class StringObject extends MemoryObject {
 
         @TruffleBoundary
         static StringObject fromJavaString(String javaString) {
-            byte[] bytes = javaString.getBytes(Charset.forName("ISO-8859-1"));
+            byte[] bytes = javaString.getBytes(StandardCharsets.ISO_8859_1);
             StringObject so = new StringObject(bytes.length + 1); // + 1 for \NULL terminator
             for (int i = 0; i < bytes.length; i++) {
                 unsafe.putByte(so.getAddress() + i, bytes[i]);
@@ -242,7 +250,7 @@ public void setValue(int value) {
 
     public static final class Integer64Object extends MemoryObject {
 
-        Integer64Object() {
+        public Integer64Object() {
             super(unsafe.allocateMemory(8));
         }
 
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/AbstractArray.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/AbstractArray.java
new file mode 100644
index 00000000..7bb8b66a
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/AbstractArray.java
@@ -0,0 +1,514 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.array;
+
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.MemberSet;
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.LittleEndianNativeArrayView;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.ArrayAccessExecution;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.DefaultStream;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.dsl.Cached;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnknownIdentifierException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+import java.util.HashSet;
+import java.util.Set;
+
+/**
+ * Simple wrapper around each class that represents device arrays in GrCUDA.
+ * It can be used to keep track of generic arrays during execution, and monitor dependencies.
+ */
+@ExportLibrary(InteropLibrary.class)
+public abstract class AbstractArray implements TruffleObject {
+
+    protected static final String POINTER = "pointer";
+    protected static final String COPY_FROM = "copyFrom";
+    protected static final String COPY_TO = "copyTo";
+    protected static final String FREE = "free";
+    protected static final String IS_MEMORY_FREED = "isMemoryFreed";
+    protected static final String ACCESSED_FREED_MEMORY_MESSAGE = "memory of array freed";
+
+    protected static final MemberSet PUBLIC_MEMBERS = new MemberSet(COPY_FROM, COPY_TO, FREE, IS_MEMORY_FREED);
+    protected static final MemberSet MEMBERS = new MemberSet(POINTER, COPY_FROM, COPY_TO, FREE, IS_MEMORY_FREED);
+
+    /**
+     * Reference to the underlying CUDA runtime that manages the array memory.
+     */
+    protected final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
+
+    /**
+     * Data type of elements stored in the array.
+     */
+    protected final Type elementType;
+
+    /**
+     * True IFF the array has been registered in {@link AbstractGrCUDAExecutionContext}.
+     * Used to avoid multiple registration;
+     */
+    private boolean registeredInContext = false;
+
+    /**
+     * Keep track of whether this array is attached to a specific stream that limits its visibility.
+     * By default, every array is attached to the {@link DefaultStream};
+     */
+    protected CUDAStream streamMapping = DefaultStream.get();
+
+    /**
+     * Function used to compute if we can skip the scheduling of a computational element for a given array read;
+     */
+    private final SkipSchedulingInterface skipScheduleRead;
+    /**
+     * Function used to compute if we can skip the scheduling of a computational element for a given array write;
+     */
+    private final SkipSchedulingInterface skipScheduleWrite;
+
+    /** Flag set when underlying off-heap memory has been freed. */
+    protected boolean arrayFreed = false;
+
+    /**
+     * List of devices where the array is currently UP-TO-DATE, i.e. it can be accessed without requiring any memory transfer.
+     * On pre-Pascal GPUs, arrays are allocated on the currently active GPU. On devices since Pascal, arrays are allocated on the CPU.
+     * We identify devices using integers. CPU is -1 ({@link com.nvidia.grcuda.runtime.CPUDevice#CPU_DEVICE_ID}, GPUs start from 0;
+     */
+    protected final Set<Integer> arrayUpToDateLocations = new HashSet<>();
+
+    public Type getElementType() {
+        return elementType;
+    }
+
+    protected AbstractArray(AbstractGrCUDAExecutionContext grCUDAExecutionContext, Type elementType) {
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
+        this.elementType = elementType;
+
+        // Specify how we can identify if we can skip the scheduling of array accesses;
+        if (this.grCUDAExecutionContext.isArchitecturePascalOrNewer()) {
+            this.skipScheduleRead = this::isArrayUpdatedOnCPU;
+            this.skipScheduleWrite = this::isArrayUpdatedOnlyOnCPU;
+        } else {
+            // On pre-Pascal devices, we cannot access GPU memory while some other computation is active on the default stream;
+            this.skipScheduleRead = () -> isArrayUpdatedOnCPU() && !(streamMapping.isDefaultStream() && grCUDAExecutionContext.isAnyComputationActive());
+            this.skipScheduleWrite = () -> isArrayUpdatedOnlyOnCPU() && !(streamMapping.isDefaultStream() && grCUDAExecutionContext.isAnyComputationActive());
+        }
+        // Initialize the location of an abstract array.
+        // On pre-Pascal devices, the default location is the current GPU. Since Pascal, it is the CPU.
+        if (this.grCUDAExecutionContext.isArchitecturePascalOrNewer()) {
+            this.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            this.addArrayUpToDateLocations(grCUDAExecutionContext.getCurrentGPU());
+        }
+    }
+
+    protected AbstractArray(AbstractArray otherArray) {
+        this.grCUDAExecutionContext = otherArray.grCUDAExecutionContext;
+        this.elementType = otherArray.elementType;
+        this.skipScheduleRead = otherArray.skipScheduleRead;
+        this.skipScheduleWrite = otherArray.skipScheduleWrite;
+        // Initialize the location of an abstract array, copying the ones specified in the input;
+        this.arrayUpToDateLocations.addAll(otherArray.getArrayUpToDateLocations());
+        this.arrayFreed = otherArray.arrayFreed;
+        this.streamMapping = otherArray.streamMapping;
+        // Registration must be done afterwards;
+        this.registeredInContext = false;
+    }
+
+    /**
+     * Register the array in {@link AbstractGrCUDAExecutionContext} so that operations on this array
+     * can be monitored by the runtime. Registration must be done with a separate function at the end of concrete Array classes.
+     * This is done to avoid leaving the context in an inconsistent state if the concrete constructor throws an exception and fails.
+     */
+    protected void registerArray() {
+        if (!this.registeredInContext) {
+            this.grCUDAExecutionContext.registerArray(this);
+            this.registeredInContext = true;
+        }
+    }
+
+    public AbstractGrCUDAExecutionContext getGrCUDAExecutionContext() {
+        return grCUDAExecutionContext;
+    }
+
+    public CUDAStream getStreamMapping() {
+        return streamMapping;
+    }
+
+    public void setStreamMapping(CUDAStream streamMapping) {
+        this.streamMapping = streamMapping;
+    }
+
+    /**
+     * Tracks whether the array is up-to-date on CPU.
+     * This happens if the last operation done on the native memory underlying this array is a read/write operation
+     * handled by the CPU. If so, we can avoid creating {@link com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement}
+     * for array reads that are immediately following the last one, as they are performed synchronously and there is no
+     * reason to explicitly model them in the {@link ExecutionDAG};
+     */
+    // FIXME (check if fixed already): Possible error: Array A is up-to-date on CPU and GPU0. There's an ongoing kernel on GPU0 that uses A read-only.
+    //  If we write A on the CPU, is the scheduling skipped? That's an error.
+    //  In the case of a read, no problem (a kernel that modifies the data would take exclusive ownership),
+    //  while in the case of a write we need to check that arrayUpToDateLocations == CPU
+    public boolean isArrayUpdatedOnCPU() {
+        return this.arrayUpToDateLocations.contains(CPUDevice.CPU_DEVICE_ID);
+    }
+
+    /**
+     * Tracks whether the array is up-to-date only on CPU, and not on other devices.
+     * This happens if the last operation done on the native memory underlying this array is a read/write operation
+     * handled by the CPU. If so, we can avoid creating {@link com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement}
+     * for array accesses that are immediately following the last one, as they are performed synchronously and there is no
+     * reason to explicitly model them in the {@link ExecutionDAG}.
+     * To perform a write on the CPU, we need the array to be updated exclusively on the CPU;
+     */
+    public boolean isArrayUpdatedOnlyOnCPU() {
+        return this.arrayUpToDateLocations.size() == 1 && this.arrayUpToDateLocations.contains(CPUDevice.CPU_DEVICE_ID);
+    }
+
+    public Set<Integer> getArrayUpToDateLocations() {
+        return this.arrayUpToDateLocations;
+    }
+
+    /**
+     * Reset the list of devices where the array is currently up-to-date,
+     * and specify a new device where the array is up-to-date.
+     * Used when the array is modified by some device: there should never be a situation where
+     * the array is not up-to-date on at least one device;
+     */
+    public void resetArrayUpToDateLocations(int deviceId) {
+        this.arrayUpToDateLocations.clear();
+        this.arrayUpToDateLocations.add(deviceId);
+    }
+
+    public void addArrayUpToDateLocations(int deviceId) {
+        this.arrayUpToDateLocations.add(deviceId);
+    }
+
+    /**
+     * True if this array is up-to-date for the input device;
+     * @param deviceId a device for which we want to check if this array is up-to-date;
+     * @return if this array is up-to-date with respect to the input device
+     */
+    public boolean isArrayUpdatedInLocation(int deviceId) {
+        return this.arrayUpToDateLocations.contains(deviceId);
+    }
+
+    public abstract long getPointer();
+    public abstract long getSizeBytes();
+    public abstract void freeMemory();
+
+    /**
+     * Access the underlying native memory of the array, as if it were a linear 1D array.
+     * It can be used to copy chunks of the array without having to perform repeated checks,
+     * and for the low-level implementation of array accesses
+     * @param index index used to access the array
+     * @param elementTypeProfile profiling of the element type, to speed up the native view access
+     * @return element of the array
+     */
+    public abstract Object readNativeView(long index, @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile);
+
+    /**
+     * Static method to read the native view of an array. It can be used to implement the innermost access in {@link AbstractArray#readNativeView};
+     * @param nativeView native array representation of the array
+     * @param index index used to access the array
+     * @param elementType type of the array, required to know the size of each element
+     * @param elementTypeProfile profiling of the element type, to speed up the native view access
+     * @return element of the array
+     */
+    protected static Object readArrayElementNative(LittleEndianNativeArrayView nativeView, long index, Type elementType,
+                                                   @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) {
+        switch (elementTypeProfile.profile(elementType)) {
+            case CHAR:
+                return nativeView.getByte(index);
+            case SINT16:
+                return nativeView.getShort(index);
+            case SINT32:
+                return nativeView.getInt(index);
+            case SINT64:
+                return nativeView.getLong(index);
+            case FLOAT:
+                return nativeView.getFloat(index);
+            case DOUBLE:
+                return nativeView.getDouble(index);
+        }
+        return null;
+    }
+
+    /**
+     * Access the underlying native memory of the array, as if it were a linear 1D array.
+     * It can be used to copy chunks of the array without having to perform repeated checks,
+     * and for the low-level implementation of array accesses
+     * @param index index used to access the array
+     * @param value value to write in the array
+     * @param valueLibrary interop access of the value, required to understand its type
+     * @param elementTypeProfile profiling of the element type, to speed up the native view access
+     * @throws UnsupportedTypeException if writing the wrong type in the array
+     */
+    public abstract void writeNativeView(long index, Object value, @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                                         @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException;
+
+    /**
+     * Static method to write the native view of an array. It can be used to implement the innermost access in {@link AbstractArray#writeNativeView};
+     * @param nativeView native array representation of the array
+     * @param index index used to access the array
+     * @param value value to write in the array
+     * @param elementType type of the array, required to know the size of each element
+     * @param valueLibrary interop access of the value, required to understand its type
+     * @param elementTypeProfile profiling of the element type, to speed up the native view access
+     * @throws UnsupportedTypeException if writing the wrong type in the array
+     */
+    public static void writeArrayElementNative(LittleEndianNativeArrayView nativeView, long index, Object value, Type elementType,
+                                               @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                                               @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException {
+        try {
+            switch (elementTypeProfile.profile(elementType)) {
+                case CHAR:
+                    nativeView.setByte(index, valueLibrary.asByte(value));
+                    break;
+                case SINT16:
+                    nativeView.setShort(index, valueLibrary.asShort(value));
+                    break;
+                case SINT32:
+                    nativeView.setInt(index, valueLibrary.asInt(value));
+                    break;
+                case SINT64:
+                    nativeView.setLong(index, valueLibrary.asLong(value));
+                    break;
+                case FLOAT:
+                    // going via "double" to allow floats to be initialized with doubles
+                    nativeView.setFloat(index, (float) valueLibrary.asDouble(value));
+                    break;
+                case DOUBLE:
+                    nativeView.setDouble(index, valueLibrary.asDouble(value));
+                    break;
+            }
+        } catch (UnsupportedMessageException e) {
+            CompilerDirectives.transferToInterpreter();
+            throw UnsupportedTypeException.create(new Object[]{value}, "value cannot be coerced to " + elementType);
+        }
+    }
+
+    public boolean isMemoryFreed() {
+        return arrayFreed;
+    }
+
+    /**
+     * Check if this array can be accessed by the host for a read without having to schedule a {@link ArrayAccessExecution}.
+     * This is possible if the array is up-to-date on the CPU,
+     * and the array is not exposed on the default stream while other GPU computations are running (on pre-Pascal devices).
+     * @return if this array can be accessed by the host without scheduling a computation
+     */
+    public boolean canSkipSchedulingRead() {
+        return this.skipScheduleRead.canSkipScheduling();
+    }
+
+    /**
+     * Check if this array can be accessed by the host for a write without having to schedule a {@link ArrayAccessExecution}.
+     * This is possible if the array is assumed up-to-date only on the CPU,
+     * and the array is not exposed on the default stream while other GPU computations are running (on pre-Pascal devices).
+     * @return if this array can be accessed by the host without scheduling a computation
+     */
+    public boolean canSkipSchedulingWrite() {
+        return this.skipScheduleWrite.canSkipScheduling();
+    }
+
+    protected interface SkipSchedulingInterface {
+        boolean canSkipScheduling();
+    }
+
+    /**
+     * When working with array views, it is common to require both the pointer to the view, and the pointer
+     * to the whole array. This method always returns the pointer to the whole array,
+     * and should be used when dealing with low-level CUDA APIs that cannot handle just part of the array.
+     * By default, i.e. if the array is not a view of a larger array,
+     * this function is identical to {@link AbstractArray#getPointer()}
+     * @return the pointer to the whole array
+     */
+    public long getFullArrayPointer() {
+        return this.getPointer();
+    }
+
+    /**
+     * When working with array views, it is common to require both the size of the view, and the size of
+     * the whole array. This method always returns the size (in bytes) of the whole array,
+     * and should be used when dealing with low-level CUDA APIs that cannot handle just part of the array.
+     * By default, i.e. if the array is not a view of a larger array,
+     * this function is identical to {@link AbstractArray#getSizeBytes()}
+     * @return the size, in bytes, of the whole array
+     */
+    public long getFullArraySizeBytes() {
+        return this.getSizeBytes();
+    }
+
+    /**
+     * By default, we assume that arrays are stored in row-major format ("C" format).
+     * This holds true for {@link DeviceArray}s, which are 1D arrays where the storage order does not matter;
+     * @return if the array was stored in column-major format (i.e. "Fortran" or "F")
+     */
+    public boolean isColumnMajorFormat() {
+        return false;
+    }
+
+    // Implementation of InteropLibrary
+
+    @ExportMessage
+    boolean isPointer() { return true; }
+
+    @ExportMessage
+    long asPointer() { return this.getPointer(); }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean hasArrayElements() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return true;
+    }
+
+    @ExportMessage
+    Object readArrayElement(long index) throws UnsupportedMessageException, InvalidArrayIndexException {
+        return null;
+    }
+
+    @ExportMessage
+    boolean isArrayElementReadable(long index) {
+        return false;
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean hasMembers() {
+        return true;
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    Object getMembers(boolean includeInternal) {
+        return includeInternal ? MEMBERS : PUBLIC_MEMBERS;
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean isMemberReadable(String memberName,
+                             @Cached.Shared("memberName") @Cached("createIdentityProfile()") ValueProfile memberProfile) {
+        String name = memberProfile.profile(memberName);
+        return POINTER.equals(name) || COPY_FROM.equals(name) || COPY_TO.equals(name) || FREE.equals(name) || IS_MEMORY_FREED.equals(name);
+    }
+
+    @ExportMessage
+    Object readMember(String memberName,
+                      @Cached.Shared("memberName") @Cached("createIdentityProfile()") ValueProfile memberProfile) throws UnknownIdentifierException {
+        if (!isMemberReadable(memberName, memberProfile)) {
+            CompilerDirectives.transferToInterpreter();
+            throw UnknownIdentifierException.create(memberName);
+        }
+        if (POINTER.equals(memberName)) {
+            return getPointer();
+        }
+        if (COPY_FROM.equals(memberName)) {
+            return new DeviceArrayCopyFunction(this, DeviceArrayCopyFunction.CopyDirection.FROM_POINTER);
+        }
+        if (COPY_TO.equals(memberName)) {
+            return new DeviceArrayCopyFunction(this, DeviceArrayCopyFunction.CopyDirection.TO_POINTER);
+        }
+        if (FREE.equals(memberName)) {
+            return new DeviceArray.DeviceArrayFreeFunction();
+        }
+        if (IS_MEMORY_FREED.equals(memberName)) {
+            return isMemoryFreed();
+        }
+        CompilerDirectives.transferToInterpreter();
+        throw UnknownIdentifierException.create(memberName);
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean isMemberInvocable(String memberName) {
+        return COPY_FROM.equals(memberName) || COPY_TO.equals(memberName) || FREE.equals(memberName);
+    }
+
+    @ExportMessage
+    Object invokeMember(String memberName,
+                        Object[] arguments,
+                        @CachedLibrary("this") InteropLibrary interopRead,
+                        @CachedLibrary(limit = "1") InteropLibrary interopExecute)
+            throws UnsupportedTypeException, ArityException, UnsupportedMessageException, UnknownIdentifierException {
+        return interopExecute.execute(interopRead.readMember(this, memberName), arguments);
+    }
+
+    /**
+     * Retrieve the total number of elements in the array,
+     * or the size of the current dimension for matrices and tensors
+     *
+     * @return the total number of elements in the array
+     */
+    @ExportMessage
+    public abstract long getArraySize();
+
+    // TODO: equals must be smarter than checking memory address, as a MultiDimView should be considered as part of its parent,
+    //   similarly to what "isLastComputationArrayAccess" is doing.
+    //   The hash instead should be different. We might also not touch equals, and have another method "isPartOf"
+
+    @ExportLibrary(InteropLibrary.class)
+    final class DeviceArrayFreeFunction implements TruffleObject {
+        @ExportMessage
+        @SuppressWarnings("static-method")
+        boolean isExecutable() {
+            return true;
+        }
+
+        @ExportMessage
+        Object execute(Object[] arguments) throws ArityException {
+            if (arguments.length != 0) {
+                CompilerDirectives.transferToInterpreter();
+                throw ArityException.create(0, 0, arguments.length);
+            }
+            freeMemory();
+            return NoneValue.get();
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/DeviceArray.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/DeviceArray.java
new file mode 100644
index 00000000..9558bff2
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/DeviceArray.java
@@ -0,0 +1,222 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.array;
+
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.LittleEndianNativeArrayView;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.DeviceArrayReadExecution;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.DeviceArrayWriteExecution;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.dsl.Cached;
+import com.oracle.truffle.api.dsl.Cached.Shared;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+@ExportLibrary(InteropLibrary.class)
+public class DeviceArray extends AbstractArray implements TruffleObject {
+
+    /** Total number of elements stored in the array. */
+    private final long numElements;
+
+    /**
+     * Total number of bytes allocated and used to store the array data (includes padding).
+     */
+    private final long sizeBytes;
+
+    /** Mutable view onto the underlying memory buffer. */
+    private final LittleEndianNativeArrayView nativeView;
+
+    public DeviceArray(AbstractGrCUDAExecutionContext grCUDAExecutionContext, long numElements, Type elementType) {
+        super(grCUDAExecutionContext, elementType);
+        this.numElements = numElements;
+        this.sizeBytes = numElements * elementType.getSizeBytes();
+        // Allocate the GPU memory;
+        this.nativeView = allocateMemory();
+        // Register the array in the AsyncGrCUDAExecutionContext;
+        this.registerArray();
+    }
+
+    /**
+     * Allocate the GPU memory. It can be overridden to mock the array;
+     * 
+     * @return a reference to the GPU memory
+     */
+    protected LittleEndianNativeArrayView allocateMemory() {
+        return this.grCUDAExecutionContext.getCudaRuntime().cudaMallocManaged(getSizeBytes());
+    }
+
+    @Override
+    final public long getSizeBytes() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return sizeBytes;
+    }
+
+    @Override
+    public long getPointer() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return nativeView.getStartAddress();
+    }
+
+    public Type getElementType() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return elementType;
+    }
+
+    @Override
+    public String toString() {
+        if (arrayFreed) {
+            return "DeviceArray(memory freed)";
+        } else {
+            return "DeviceArray(elementType=" + elementType + ", numElements=" + numElements + ", nativeView=" + nativeView + ')';
+        }
+    }
+
+    @Override
+    protected void finalize() throws Throwable {
+        if (!arrayFreed) {
+            grCUDAExecutionContext.getCudaRuntime().cudaFree(nativeView);
+        }
+        super.finalize();
+    }
+
+    @Override
+    public void freeMemory() {
+        if (arrayFreed) {
+            throw new GrCUDAException("device array already freed");
+        }
+        grCUDAExecutionContext.getCudaRuntime().cudaFree(nativeView);
+        arrayFreed = true;
+    }
+
+    // Implementation of InteropLibrary
+
+    @ExportMessage
+    public long getArraySize() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return numElements;
+    }
+
+    @ExportMessage
+    boolean isArrayElementReadable(long index) {
+        return !arrayFreed && index >= 0 && index < numElements;
+    }
+
+    @ExportMessage
+    boolean isArrayElementModifiable(long index) {
+        return index >= 0 && index < numElements;
+    }
+
+    @SuppressWarnings("static-method")
+    @ExportMessage
+    boolean isArrayElementInsertable(@SuppressWarnings("unused") long index) {
+        return false;
+    }
+
+    @ExportMessage
+    Object readArrayElement(long index,
+                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws InvalidArrayIndexException {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        if ((index < 0) || (index >= numElements)) {
+            CompilerDirectives.transferToInterpreter();
+            throw InvalidArrayIndexException.create(index);
+        }
+        try {
+            if (this.canSkipSchedulingRead()) {
+                // Fast path, skip the DAG scheduling;
+                return AbstractArray.readArrayElementNative(this.nativeView, index, this.elementType, elementTypeProfile);
+            } else {
+                return new DeviceArrayReadExecution(this, index, elementTypeProfile).schedule();
+            }
+        } catch (UnsupportedTypeException e) {
+            e.printStackTrace();
+            return null;
+        }
+    }
+
+    @Override
+    public Object readNativeView(long index, @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) {
+        return AbstractArray.readArrayElementNative(this.nativeView, index, this.elementType, elementTypeProfile);
+    }
+
+    @ExportMessage
+    public void writeArrayElement(long index, Object value,
+                    @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException, InvalidArrayIndexException {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        if ((index < 0) || (index >= numElements)) {
+            CompilerDirectives.transferToInterpreter();
+            throw InvalidArrayIndexException.create(index);
+        }
+        if (this.canSkipSchedulingWrite()) {
+            // Fast path, skip the DAG scheduling;
+            AbstractArray.writeArrayElementNative(this.nativeView, index, value, elementType, valueLibrary, elementTypeProfile);
+        } else {
+            new DeviceArrayWriteExecution(this, index, value, valueLibrary, elementTypeProfile).schedule();
+        }
+    }
+
+    @Override
+    public void writeNativeView(long index, Object value, @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                    @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException {
+        AbstractArray.writeArrayElementNative(this.nativeView, index, value, elementType, valueLibrary, elementTypeProfile);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MultiDimDeviceArray.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/MultiDimDeviceArray.java
similarity index 57%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MultiDimDeviceArray.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/MultiDimDeviceArray.java
index cf7052cd..778082a4 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/MultiDimDeviceArray.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/MultiDimDeviceArray.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,42 +33,27 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda;
+package com.nvidia.grcuda.runtime.array;
 
-import java.util.Arrays;
-import com.nvidia.grcuda.DeviceArray.MemberSet;
-import com.nvidia.grcuda.gpu.CUDARuntime;
-import com.nvidia.grcuda.gpu.LittleEndianNativeArrayView;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.LittleEndianNativeArrayView;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.dsl.Cached;
-import com.oracle.truffle.api.dsl.Cached.Shared;
-import com.oracle.truffle.api.interop.ArityException;
 import com.oracle.truffle.api.interop.InteropLibrary;
 import com.oracle.truffle.api.interop.InvalidArrayIndexException;
 import com.oracle.truffle.api.interop.TruffleObject;
-import com.oracle.truffle.api.interop.UnknownIdentifierException;
-import com.oracle.truffle.api.interop.UnsupportedMessageException;
 import com.oracle.truffle.api.interop.UnsupportedTypeException;
 import com.oracle.truffle.api.library.CachedLibrary;
 import com.oracle.truffle.api.library.ExportLibrary;
 import com.oracle.truffle.api.library.ExportMessage;
 import com.oracle.truffle.api.profiles.ValueProfile;
 
-@ExportLibrary(InteropLibrary.class)
-public class MultiDimDeviceArray implements TruffleObject {
-
-    private static final String POINTER = "pointer";
-    private static final String FREE = "free";
-    private static final String IS_MEMORY_FREED = "isMemoryFreed";
-    private static final String ACCESSED_FREED_MEMORY_MESSAGE = "memory of array freed";
-
-    private static final MemberSet PUBLIC_MEMBERS = new MemberSet(FREE, IS_MEMORY_FREED);
-    private static final MemberSet MEMBERS = new MemberSet(POINTER, FREE, IS_MEMORY_FREED);
-
-    private final CUDARuntime runtime;
+import java.util.Arrays;
 
-    /** Data type of the elements stored in the array. */
-    private final Type elementType;
+@ExportLibrary(InteropLibrary.class)
+public class MultiDimDeviceArray extends AbstractArray implements TruffleObject {
 
     /** Number of elements in each dimension. */
     private final long[] elementsPerDimension;
@@ -70,26 +62,49 @@ public class MultiDimDeviceArray implements TruffleObject {
     private final long[] stridePerDimension;
 
     /** Total number of elements stored in the array. */
-    private final long totalElementCount;
+    private final long numElements;
 
     /** true if data is stored in column-major format (Fortran), false row-major (C). */
-    private boolean columnMajor;
+    private final boolean columnMajor;
 
     /** Mutable view onto the underlying memory buffer. */
     private final LittleEndianNativeArrayView nativeView;
 
     /**
-     * true if nativeView's off-heap memory has been freed and the view invalid flag. This flag is
-     * set in free().
+     * If we modify the devices where this multi-dimensional array is updated, we also have to update its views.
+     * As this array does not track the views (but views track their parent), we do so lazily.
+     * When the location of this array is changed, we switch this flag. When we access the location of views, we update
+     * their location and reset this flag;
      */
-    private boolean arrayFreed = false;
+    private boolean isViewLocationUpdated = true;
 
-    public MultiDimDeviceArray(CUDARuntime runtime, Type elementType, long[] dimensions,
-                    boolean useColumnMajor) {
+    public MultiDimDeviceArray(AbstractGrCUDAExecutionContext grCUDAExecutionContext, Type elementType, long[] dimensions,
+                               boolean useColumnMajor) {
+        super(grCUDAExecutionContext, elementType);
+        this.numElements = obtainTotalSize(dimensions);
+        this.columnMajor = useColumnMajor;
+        this.elementsPerDimension = new long[dimensions.length];
+        System.arraycopy(dimensions, 0, this.elementsPerDimension, 0, dimensions.length);
+        this.stridePerDimension = computeStride(dimensions, columnMajor);
+        // Allocate the GPU memory;
+        this.nativeView = allocateMemory();
+        // Register the array in the AsyncGrCUDAExecutionContext;
+        this.registerArray();
+    }
+
+    /**
+     * Allocate the GPU memory. It can be overridden to mock the array;
+     * @return a reference to the GPU memory
+     */
+    protected LittleEndianNativeArrayView allocateMemory() {
+        return this.grCUDAExecutionContext.getCudaRuntime().cudaMallocManaged(getSizeBytes());
+    }
+
+    private long obtainTotalSize(long[] dimensions) {
         if (dimensions.length < 2) {
             CompilerDirectives.transferToInterpreter();
             throw new IllegalArgumentException(
-                            "MultiDimDeviceArray requires at least two dimension, use DeviceArray instead");
+                    "MultiDimDeviceArray requires at least two dimension, use DeviceArray instead");
         }
         // check arguments
         long prod = 1;
@@ -100,14 +115,7 @@ public MultiDimDeviceArray(CUDARuntime runtime, Type elementType, long[] dimensi
             }
             prod *= n;
         }
-        this.runtime = runtime;
-        this.elementType = elementType;
-        this.columnMajor = useColumnMajor;
-        this.elementsPerDimension = new long[dimensions.length];
-        System.arraycopy(dimensions, 0, this.elementsPerDimension, 0, dimensions.length);
-        this.stridePerDimension = computeStride(dimensions, columnMajor);
-        this.totalElementCount = prod;
-        this.nativeView = runtime.cudaMallocManaged(getSizeBytes());
+        return prod;
     }
 
     private static long[] computeStride(long[] dimensions, boolean columnMajor) {
@@ -158,18 +166,51 @@ final boolean isIndexValidInDimension(long index, int dimension) {
         return (index > 0) && (index < numElementsInDim);
     }
 
-    final boolean isColumnMajorFormat() {
+    public boolean isViewLocationUpdated() {
+        return isViewLocationUpdated;
+    }
+
+    public void resetViewLocationUpdated() {
+        isViewLocationUpdated = true;
+    }
+
+    /**
+     * If we update the list of locations where this array is located,
+     * we flag this array so that its views will update their list of locations.
+     * The update is done lazily when we access (or update) the location list of a view;
+     */
+    @Override
+    public void resetArrayUpToDateLocations(int deviceId) {
+        super.resetArrayUpToDateLocations(deviceId);
+        this.isViewLocationUpdated = false;
+    }
+
+    /**
+     * If we update the list of locations where this array is located,
+     * we flag this array so that its views will update their list of locations.
+     * The update is done lazily when we access (or update) the location list of a view;
+     */
+    @Override
+    public void addArrayUpToDateLocations(int deviceId) {
+        super.addArrayUpToDateLocations(deviceId);
+        this.isViewLocationUpdated = false;
+    }
+
+    @Override
+    public final boolean isColumnMajorFormat() {
         return columnMajor;
     }
 
-    long getTotalElementCount() {
-        return totalElementCount;
+    long getNumElements() {
+        return numElements;
     }
 
-    final long getSizeBytes() {
-        return totalElementCount * elementType.getSizeBytes();
+    @Override
+    final public long getSizeBytes() {
+        return numElements * elementType.getSizeBytes();
     }
 
+    @Override
     public final long getPointer() {
         if (arrayFreed) {
             CompilerDirectives.transferToInterpreter();
@@ -178,10 +219,6 @@ public final long getPointer() {
         return nativeView.getStartAddress();
     }
 
-    public final Type getElementType() {
-        return elementType;
-    }
-
     final LittleEndianNativeArrayView getNativeView() {
         if (arrayFreed) {
             CompilerDirectives.transferToInterpreter();
@@ -194,7 +231,7 @@ final LittleEndianNativeArrayView getNativeView() {
     public String toString() {
         return "MultiDimDeviceArray(elementType=" + elementType +
                         ", dims=" + Arrays.toString(elementsPerDimension) +
-                        ", Elements=" + totalElementCount +
+                        ", Elements=" + numElements +
                         ", size=" + getSizeBytes() + " bytes" +
                         ", nativeView=" + nativeView + ')';
     }
@@ -202,21 +239,45 @@ public String toString() {
     @Override
     protected void finalize() throws Throwable {
         if (!arrayFreed) {
-            runtime.cudaFree(nativeView);
+            grCUDAExecutionContext.getCudaRuntime().cudaFree(nativeView);
         }
         super.finalize();
     }
 
+    @Override
     public void freeMemory() {
         if (arrayFreed) {
             throw new GrCUDAException("device array already freed");
         }
-        runtime.cudaFree(nativeView);
+        grCUDAExecutionContext.getCudaRuntime().cudaFree(nativeView);
         arrayFreed = true;
     }
 
-    public boolean isMemoryFreed() {
-        return arrayFreed;
+    /**
+     * Direct access to the native view underlying the multidimensional array;
+     * @param index index used to access the array
+     * @param elementTypeProfile type of the array
+     * @return value read from the array
+     */
+    @Override
+    public Object readNativeView(long index,
+                                 @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) {
+        return AbstractArray.readArrayElementNative(this.nativeView, index, this.elementType, elementTypeProfile);
+    }
+
+    /**
+     * Direct access to the native view underlying the multidimensional array;
+     * @param index index used to access the array
+     * @param value value to write in the array
+     * @param valueLibrary interop access of the value, required to understand its type
+     * @param elementTypeProfile profiling of the element type, to speed up the native view access
+     * @throws UnsupportedTypeException if writing the wrong type in the array
+     */
+    @Override
+    public void writeNativeView(long index, Object value,
+                                @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                                @Cached.Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException {
+        AbstractArray.writeArrayElementNative(this.nativeView, index, value, this.elementType, valueLibrary, elementTypeProfile);
     }
 
     //
@@ -225,24 +286,29 @@ public boolean isMemoryFreed() {
 
     @ExportMessage
     @SuppressWarnings("static-method")
-    boolean hasArrayElements() {
-        return true;
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    long getArraySize() {
+    @Override
+    public long getArraySize() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
         return elementsPerDimension[0];
     }
 
     @ExportMessage
     @SuppressWarnings("static-method")
+    @Override
     boolean isArrayElementReadable(long index) {
         return index >= 0 && index < elementsPerDimension[0];
     }
 
     @ExportMessage
+    @Override
     Object readArrayElement(long index) throws InvalidArrayIndexException {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
         if ((index < 0) || (index >= elementsPerDimension[0])) {
             CompilerDirectives.transferToInterpreter();
             throw InvalidArrayIndexException.create(index);
@@ -250,84 +316,4 @@ Object readArrayElement(long index) throws InvalidArrayIndexException {
         long offset = index * stridePerDimension[0];
         return new MultiDimDeviceArrayView(this, 1, offset, stridePerDimension[1]);
     }
-
-    @ExportMessage
-    boolean hasMembers() {
-        return true;
-    }
-
-    @ExportMessage
-    Object getMembers(boolean includeInternal) {
-        return includeInternal ? MEMBERS : PUBLIC_MEMBERS;
-    }
-
-    @ExportMessage
-    boolean isMemberReadable(String member,
-                    @Shared("member") @Cached("createIdentityProfile()") ValueProfile memberProfile) {
-        return POINTER.equals(memberProfile.profile(member)) || FREE.equals(memberProfile.profile(member)) || IS_MEMORY_FREED.contentEquals(memberProfile.profile(member));
-    }
-
-    @ExportMessage
-    Object readMember(String member,
-                    @Shared("member") @Cached("createIdentityProfile()") ValueProfile memberProfile) throws UnknownIdentifierException {
-        if (!isMemberReadable(member, memberProfile)) {
-            CompilerDirectives.transferToInterpreter();
-            throw UnknownIdentifierException.create(member);
-        }
-        if (POINTER.equals(memberProfile.profile(member))) {
-            return getPointer();
-        }
-        if (IS_MEMORY_FREED.equals(memberProfile.profile(member))) {
-            return arrayFreed;
-        }
-        if (FREE.equals(memberProfile.profile(member))) {
-            return new MultiDimDeviceArrayFreeFunction();
-        }
-        CompilerDirectives.transferToInterpreter();
-        throw new GrCUDAInternalException("trying to read unknown member '" + member + "'");
-    }
-
-    @ExportMessage
-    @SuppressWarnings("static-method")
-    boolean isMemberInvocable(String memberName) {
-        return FREE.equals(memberName);
-    }
-
-    @ExportMessage
-    Object invokeMember(String memberName,
-                    Object[] arguments,
-                    @CachedLibrary("this") InteropLibrary interopRead,
-                    @CachedLibrary(limit = "1") InteropLibrary interopExecute)
-                    throws UnsupportedTypeException, ArityException, UnsupportedMessageException, UnknownIdentifierException {
-        return interopExecute.execute(interopRead.readMember(this, memberName), arguments);
-    }
-
-    @ExportMessage
-    boolean isPointer() {
-        return true;
-    }
-
-    @ExportMessage
-    long asPointer() {
-        return getPointer();
-    }
-
-    @ExportLibrary(InteropLibrary.class)
-    final class MultiDimDeviceArrayFreeFunction implements TruffleObject {
-        @ExportMessage
-        @SuppressWarnings("static-method")
-        boolean isExecutable() {
-            return true;
-        }
-
-        @ExportMessage
-        Object execute(Object[] arguments) throws ArityException {
-            if (arguments.length != 0) {
-                CompilerDirectives.transferToInterpreter();
-                throw ArityException.create(0, arguments.length);
-            }
-            freeMemory();
-            return NoneValue.get();
-        }
-    }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/MultiDimDeviceArrayView.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/MultiDimDeviceArrayView.java
new file mode 100644
index 00000000..f497c7ca
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/array/MultiDimDeviceArrayView.java
@@ -0,0 +1,332 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.array;
+
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.MultiDimDeviceArrayViewReadExecution;
+import com.nvidia.grcuda.runtime.computation.arraycomputation.MultiDimDeviceArrayViewWriteExecution;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.dsl.Cached;
+import com.oracle.truffle.api.dsl.Cached.Shared;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.TruffleObject;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+import java.util.Set;
+
+@ExportLibrary(InteropLibrary.class)
+public class MultiDimDeviceArrayView extends AbstractArray implements TruffleObject {
+
+    private final MultiDimDeviceArray mdDeviceArray;
+    private final int thisDimension;
+    private final long offset;
+    private final long stride;
+
+    /**
+     * A (N - 1)-dimensional view of an N-dimensional dense array.
+     * From the host language perspective (i.e. from the user perspective), an array view should be no different than
+     * a standard (N - 1)-dimensional array.
+     * From the internal implementation of GrCUDA, it should always be considered that the view is part of a larger memory
+     * chunk managed (also) by the GPU. Some CUDA APIs do not allow operating on memory chunks, but require access to the full array.
+     * As such, array views also provide access to information about the full array they belong to.
+     * For example, let's assume that the original array has 4 dimensions. We can create 3, 2, 1 dimensional views from it (in this order).
+     * Let's say that we are creating a 2-dimensional view
+     * @param mdDeviceArray the full array from which this view is created (the 4-dimensional array, in the example)
+     * @param dim the dimension identifier of this view (e.g. 2, in the example)
+     * @param offset the index (in the full array) at which this array view start
+     * @param stride value used to jump to consecutive values in the array, and determined by the slice that has been extracted
+     */
+    public MultiDimDeviceArrayView(MultiDimDeviceArray mdDeviceArray, int dim, long offset, long stride) {
+        // The up-to-date locations of the view are the same of the parent, along with the context and type;
+        super(mdDeviceArray);
+        this.mdDeviceArray = mdDeviceArray;
+        this.thisDimension = dim;
+        this.offset = offset; // Index at which this array view starts;
+        this.stride = stride;
+        // Register the array in the AsyncGrCUDAExecutionContext;
+        this.registerArray();
+    }
+
+    public int getDimension() {
+        return thisDimension;
+    }
+
+    public long getOffset() {
+        return offset;
+    }
+
+    public long getStride() {
+        return stride;
+    }
+
+    @Override
+    public long getPointer() {
+        return mdDeviceArray.getPointer() + offset * elementType.getSizeBytes();
+    }
+
+    @Override
+    public long getFullArrayPointer() {
+        return mdDeviceArray.getFullArrayPointer();
+    }
+
+    @Override
+    public boolean isColumnMajorFormat() {
+        return mdDeviceArray.isColumnMajorFormat();
+    }
+
+    /**
+     * Propagate the array location to the parent array, so other temporary views are aware of this update;
+     * @param deviceId device with an updated view of this array;
+     */
+    @Override
+    public void resetArrayUpToDateLocations(int deviceId) {
+        // No need to check the isViewLocationUpdated flag, we are clearing all the location lists;
+        this.arrayUpToDateLocations.clear();
+        this.arrayUpToDateLocations.add(deviceId);
+        this.mdDeviceArray.resetArrayUpToDateLocations(deviceId);
+    }
+
+    /**
+     * Propagate the array location to the parent array, so other temporary views are aware of this update;
+     * @param deviceId device with an updated view of this array;
+     */
+    @Override
+    public void addArrayUpToDateLocations(int deviceId) {
+        // Guarantee that the parent list of updated locations is the same as the view;
+        ensureUpdatedListConsistencyWithParent();
+        this.arrayUpToDateLocations.add(deviceId);
+        this.mdDeviceArray.addArrayUpToDateLocations(deviceId);
+    }
+
+    @Override
+    public boolean isArrayUpdatedInLocation(int deviceId) {
+        // Guarantee that the parent list of updated locations is the same as the view;
+        ensureUpdatedListConsistencyWithParent();
+        return super.isArrayUpdatedInLocation(deviceId);
+    }
+
+    @Override
+    public boolean isArrayUpdatedOnCPU() {
+        // Guarantee that the parent list of updated locations is the same as the view;
+        ensureUpdatedListConsistencyWithParent();
+        return super.isArrayUpdatedOnCPU();
+    }
+
+    @Override
+    public Set<Integer> getArrayUpToDateLocations() {
+        // Guarantee that the parent list of updated locations is the same as the view;
+        ensureUpdatedListConsistencyWithParent();
+        return super.getArrayUpToDateLocations();
+    }
+
+    private void ensureUpdatedListConsistencyWithParent() {
+        if (!this.mdDeviceArray.isViewLocationUpdated()) {
+            this.arrayUpToDateLocations.clear();
+            this.arrayUpToDateLocations.addAll(this.mdDeviceArray.getArrayUpToDateLocations());
+            this.mdDeviceArray.resetViewLocationUpdated();
+        }
+    }
+
+    /**
+     * Propagate the stream mapping to the parent array, so other temporary views are aware of this mapping;
+     * @param streamMapping the stream to which this array is associated
+     */
+    @Override
+    public void setStreamMapping(CUDAStream streamMapping) {
+        this.mdDeviceArray.setStreamMapping(streamMapping);
+        this.streamMapping = streamMapping;
+    }
+
+    /**
+     * Return the parent stream mapping, to guarantee that all views have the same mapping;
+     * @return the stream to which this array is associated
+     */
+    @Override
+    public CUDAStream getStreamMapping() {
+        return this.mdDeviceArray.getStreamMapping();
+    }
+
+    @Override
+    public long getSizeBytes() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return mdDeviceArray.getElementsInDimension(thisDimension) * elementType.getSizeBytes();
+    }
+
+    @Override
+    public long getFullArraySizeBytes() {
+        return mdDeviceArray.getFullArraySizeBytes();
+    }
+
+    @Override
+    public void freeMemory() {
+        // This should not be called directly on a view;
+        CompilerDirectives.transferToInterpreter();
+        throw new GrCUDAException("Freeing memory directly on a MultiDimDeviceArrayView is not allowed");
+    }
+
+    @Override
+    public String toString() {
+        return String.format("MultiDimDeviceArrayView(dim=%d, offset=%d, stride=%d)\n",
+                        thisDimension, offset, stride);
+    }
+
+    //
+    // Implementation of Interop Library
+    //
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean hasArrayElements() {
+        return true;
+    }
+
+    @ExportMessage
+    @Override
+    public long getArraySize() {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return mdDeviceArray.getElementsInDimension(thisDimension);
+    }
+
+    @ExportMessage
+    boolean isArrayElementReadable(long index) {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return index >= 0 && index < mdDeviceArray.getElementsInDimension(thisDimension);
+    }
+
+    @ExportMessage
+    boolean isArrayElementModifiable(long index) {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        return (thisDimension + 1) == mdDeviceArray.getNumberDimensions() &&
+                        index >= 0 && index < mdDeviceArray.getElementsInDimension(thisDimension);
+    }
+
+    @ExportMessage
+    @SuppressWarnings("static-method")
+    boolean isArrayElementInsertable(@SuppressWarnings("unused") long index) {
+        return false;
+    }
+
+    @ExportMessage
+    Object readArrayElement(long index,
+                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws InvalidArrayIndexException {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        if ((index < 0) || (index >= mdDeviceArray.getElementsInDimension(thisDimension))) {
+            CompilerDirectives.transferToInterpreter();
+            throw InvalidArrayIndexException.create(index);
+        }
+        try {
+            if (this.canSkipSchedulingRead()) {
+                // Fast path, skip the DAG scheduling;
+                return readNativeView(index, elementTypeProfile);
+            } else {
+                return new MultiDimDeviceArrayViewReadExecution(this, index, elementTypeProfile).schedule();
+            }
+        } catch (UnsupportedTypeException e) {
+            e.printStackTrace();
+            return null;
+        }
+    }
+
+    @Override
+    public Object readNativeView(long index, @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) {
+        if ((thisDimension + 1) == mdDeviceArray.getNumberDimensions()) {
+            long flatIndex = offset + index * stride;
+            return AbstractArray.readArrayElementNative(this.mdDeviceArray.getNativeView(), flatIndex, this.mdDeviceArray.getElementType(), elementTypeProfile);
+        } else {
+            long off = offset + index * stride;
+            long newStride = mdDeviceArray.getStrideInDimension(thisDimension + 1);
+            return new MultiDimDeviceArrayView(mdDeviceArray, thisDimension + 1, off, newStride);
+        }
+    }
+
+    @ExportMessage
+    void writeArrayElement(long index, Object value,
+                    @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                    @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException, InvalidArrayIndexException {
+        if (arrayFreed) {
+            CompilerDirectives.transferToInterpreter();
+            throw new GrCUDAException(ACCESSED_FREED_MEMORY_MESSAGE);
+        }
+        if ((index < 0) || (index >= mdDeviceArray.getElementsInDimension(thisDimension))) {
+            CompilerDirectives.transferToInterpreter();
+            throw InvalidArrayIndexException.create(index);
+        }
+        if (this.canSkipSchedulingWrite()) {
+            // Fast path, skip the DAG scheduling;
+            writeNativeView(index, value, valueLibrary, elementTypeProfile);
+        } else {
+            new MultiDimDeviceArrayViewWriteExecution(this, index, value, valueLibrary, elementTypeProfile).schedule();
+        }
+    }
+
+    @Override
+    public void writeNativeView(long index, Object value,
+                                @CachedLibrary(limit = "3") InteropLibrary valueLibrary,
+                                @Shared("elementType") @Cached("createIdentityProfile()") ValueProfile elementTypeProfile) throws UnsupportedTypeException {
+        if ((thisDimension + 1) == mdDeviceArray.getNumberDimensions()) {
+            long flatIndex = offset + index * stride;
+            AbstractArray.writeArrayElementNative(this.mdDeviceArray.getNativeView(), flatIndex, value, this.mdDeviceArray.getElementType(), valueLibrary, elementTypeProfile);
+        } else {
+            CompilerDirectives.transferToInterpreter();
+            throw new IllegalStateException("tried to write non-last dimension in MultiDimDeviceArrayView");
+        }
+    }
+
+    public MultiDimDeviceArray getMdDeviceArray() {
+        return mdDeviceArray;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/CUDALibraryExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/CUDALibraryExecution.java
new file mode 100644
index 00000000..49c9e5f5
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/CUDALibraryExecution.java
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation;
+
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.functions.Function;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.LibrarySetStream;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+
+/**
+ * Computational element that wraps calls to CUDA libraries such as cuBLAS or cuML.
+ */
+public class CUDALibraryExecution extends GrCUDAComputationalElement {
+
+    private final Function nfiFunction;
+    protected Object[] argsWithHandle;
+    private final LibrarySetStream setStreamFunctionNFI;
+
+    public CUDALibraryExecution(AbstractGrCUDAExecutionContext context, Function nfiFunction, LibrarySetStream setStreamFunctionNFI, List<ComputationArgumentWithValue> args) {
+        this(context, nfiFunction, setStreamFunctionNFI, args, 0);
+    }
+
+    public CUDALibraryExecution(AbstractGrCUDAExecutionContext context, Function nfiFunction, LibrarySetStream setStreamFunctionNFI, List<ComputationArgumentWithValue> args, int extraArguments) {
+        super(context, new CUDALibraryExecutionInitializer(args));
+        this.nfiFunction = nfiFunction;
+        this.setStreamFunctionNFI = setStreamFunctionNFI;
+
+        // Array of [libraryHandle + arguments], required by CUDA libraries for execution.
+        // Some libraries (such as cuSPARSE) wrap input arrays, making it not possible to directly track them.
+        // We add those arrays at the end of "args" so they are tracked by CUDALibraryExecutionInitializer,
+        // but remove them here so the list of arguments passed to the final CUDA library function is correct;
+        this.argsWithHandle = new Object[args.size() - extraArguments];
+        for (int i = 0; i < args.size() - extraArguments; i++) {
+            argsWithHandle[i] = args.get(i).getArgumentValue();
+        }
+    }
+
+    @Override
+    public boolean canUseStream() {
+        return true;
+    }
+
+    @Override
+    public Object execute() throws UnsupportedTypeException {
+        // Execution happens on the default stream;
+        Object result = null;
+        try {
+            this.setStreamFunctionNFI.setStream(this.getStream());
+            result = INTEROP.execute(this.nfiFunction, this.argsWithHandle);
+        } catch (ArityException | UnsupportedMessageException e) {
+            GrCUDALogger.getLogger(GrCUDALogger.COMPUTATION_LOGGER).severe("error in execution of the function");
+            e.printStackTrace();
+        }
+        return result;
+    }
+
+    static class CUDALibraryExecutionInitializer implements InitializeDependencyList {
+        private final List<ComputationArgumentWithValue> args;
+
+        CUDALibraryExecutionInitializer(List<ComputationArgumentWithValue> args) {
+            this.args = args;
+        }
+
+        @Override
+        public List<ComputationArgumentWithValue> initialize() {
+            // Consider only arrays as dependencies;
+            // The CUDA documentation is not clear on whether you can have concurrent computations
+            // with the same handle;
+            return this.args.stream().filter(ComputationArgument::isArray).collect(Collectors.toList());
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Parameter.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/ComputationArgument.java
similarity index 53%
rename from projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Parameter.java
rename to projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/ComputationArgument.java
index a8a2d472..d09f0f5a 100644
--- a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/Parameter.java
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/ComputationArgument.java
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -13,6 +14,12 @@
  *  * Neither the name of NVIDIA CORPORATION nor the names of its
  *    contributors may be used to endorse or promote products derived
  *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
@@ -26,13 +33,20 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
-package com.nvidia.grcuda;
+package com.nvidia.grcuda.runtime.computation;
 
 import java.util.ArrayList;
 
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.TypeException;
 import com.oracle.truffle.api.CompilerAsserts;
 
-public class Parameter {
+/**
+ * Defines a {@link GrCUDAComputationalElement} argument representing the elements of a NIDL/NFI signature.
+ * For each argument/parameter, store its type, if it's a pointer or a value,
+ * and if it's constant (i.e. its content cannot be modified in the computation);
+ */
+public class ComputationArgument {
 
     public enum Kind {
         BY_VALUE,
@@ -41,47 +55,83 @@ public enum Kind {
         POINTER_INOUT,
     }
 
-    private int position;
-    private final String name;
-    private final Type type;
-    private final Kind kind;
+    // Used to denote default position, for cases in which position is not relevant;
+    private static final int DEFAULT_POSITION = 0;
+    protected int position;
+    protected final String name;
+    protected final Type type;
+    protected final Kind kind;
+
+    protected final boolean isArray;
+    protected final boolean isConst;
 
     /**
-     * Create new parameter from its components.
+     * Create new argument/parameter from its components.
      *
-     * @param position zero-based position from the left of the parameter list
-     * @param name parameter name
+     * @param position zero-based position from the left of the parameter list. This is useful for debugging, but otherwise non-functional
+     * @param name parameter name. This is useful for debugging, but otherwise non-functional
      * @param type data type of the parameter
      * @param kind kind of the parameter (by-value, pointer with direction)
      */
-    Parameter(int position, String name, Type type, Kind kind) {
+    ComputationArgument(int position, String name, Type type, Kind kind) {
         this.position = position;
         this.name = name;
         this.type = type;
         this.kind = kind;
+        this.isArray = kind.equals(Kind.POINTER_IN) || kind.equals(Kind.POINTER_INOUT) || kind.equals(Kind.POINTER_OUT);
+        this.isConst = kind.equals(Kind.POINTER_IN) || kind.equals(Kind.BY_VALUE);
+    }
+
+    ComputationArgument(String name, Type type, Kind kind) {
+        this(DEFAULT_POSITION, name, type, kind);
+    }
+
+    /**
+     * Create new pointer parameter from its components;
+     *
+     * @param name parameter name
+     * @param type data type of the value to which the pointer points
+     * @param kind direction of pointer parameter (allowed values `POINTER_IN`, `POINTER_OUT` and
+     *            `POINTER_INOUT`, must not by `BY_VALUE`)
+     * @param position index of the parameter, used only for debugging purposes
+     */
+    public static ComputationArgument createPointerComputationArgument(String name, Type type, Kind kind, int position) {
+        assert kind != Kind.BY_VALUE : "pointer parameter cannot be by-value";
+        return new ComputationArgument(position, name, type, kind);
     }
 
     /**
-     * Create new pointer parameter from its components (with position 0).
+     * Create new pointer parameter from its components, with position 0;
      *
      * @param name parameter name
      * @param type data type of the value to which the pointer points
      * @param kind direction of pointer parameter (allowed values `POINTER_IN`, `POINTER_OUT` and
      *            `POINTER_INOUT`, must not by `BY_VALUE`)
      */
-    public static Parameter createPointerParameter(String name, Type type, Kind kind) {
+    public static ComputationArgument createPointerComputationArgument(String name, Type type, Kind kind) {
         assert kind != Kind.BY_VALUE : "pointer parameter cannot be by-value";
-        return new Parameter(0, name, type, kind);
+        return new ComputationArgument(0, name, type, kind);
     }
 
     /**
-     * Create new by-value parameter from its components (with position 0).
+     * Create new by-value parameter from its components.
      *
      * @param name parameter name
      * @param type data type of the parameter
+     * @param position index of the parameter, used only for debugging purposes
      */
-    public static Parameter createByValueParameter(String name, Type type) {
-        return new Parameter(0, name, type, Kind.BY_VALUE);
+    public static ComputationArgument createByValueComputationArgument(String name, Type type, int position) {
+        return new ComputationArgument(position, name, type, Kind.BY_VALUE);
+    }
+
+    /**
+     * Create new by-value parameter from its components, with position 0;
+     *
+     * @param name parameter name
+     * @param type data type of the parameter
+     */
+    public static ComputationArgument createByValueComputationArgument(String name, Type type) {
+        return new ComputationArgument(0, name, type, Kind.BY_VALUE);
     }
 
     /**
@@ -101,7 +151,7 @@ public static Parameter createByValueParameter(String name, Type type) {
      * @throws TypeException if {@code param} string cannot be parsed successfully
      * @return Parameter
      */
-    private static Parameter parseNIDLOrLegacyParameterString(int position, String param) throws TypeException {
+    private static ComputationArgument parseNIDLOrLegacyParameterString(int position, String param) throws TypeException {
         String paramStr = param.trim();
         if (paramStr.indexOf(':') == -1) {
             // no colon found -> attempt parsing it as a legacy NFI signature
@@ -127,7 +177,7 @@ private static Parameter parseNIDLOrLegacyParameterString(int position, String p
                 // the void is not a legal by-value parameter type
                 throw new TypeException("invalid type \"pointer\" of by-value parameter");
             }
-            return createByValueParameter(name, type);
+            return createByValueComputationArgument(name, type, position);
         } else {
             if (dirPointerAndType[1].equals("pointer")) {
                 Type type = Type.fromNIDLTypeString(dirPointerAndType[2]);
@@ -137,11 +187,11 @@ private static Parameter parseNIDLOrLegacyParameterString(int position, String p
                 }
                 switch (dirPointerAndType[0]) {
                     case "in":
-                        return createPointerParameter(name, type, Kind.POINTER_IN);
+                        return createPointerComputationArgument(name, type, Kind.POINTER_IN, position);
                     case "inout":
-                        return createPointerParameter(name, type, Kind.POINTER_INOUT);
+                        return createPointerComputationArgument(name, type, Kind.POINTER_INOUT, position);
                     case "out":
-                        return createPointerParameter(name, type, Kind.POINTER_OUT);
+                        return createPointerComputationArgument(name, type, Kind.POINTER_OUT, position);
                     default:
                         throw new TypeException("invalid direction: " + dirPointerAndType[0] + ", expected \"in\", \"inout\", or \"out\"");
                 }
@@ -159,11 +209,38 @@ private static Parameter parseNIDLOrLegacyParameterString(int position, String p
      * @throws TypeException of the specified type cannot be parsed
      * @return PArameter in which the names are "param1", "param2", ...
      */
-    private static Parameter parseLegacyParameterString(int position, String param) throws TypeException {
+    private static ComputationArgument parseLegacyParameterString(int position, String param) throws TypeException {
         String name = "param" + (position + 1);
-        Type type = Type.fromNIDLTypeString(param.trim());
+
+        // Find if the type is const;
+        String[] typePieces = param.trim().split(" ");
+        String typeString;
+        boolean typeIsConst = false;
+        if (typePieces.length == 1) {
+            // If only 1 piece is found, the argument is not const;
+            typeString = typePieces[0].trim();
+        } else if (typePieces.length == 2) {
+            // Const can be either before or after the type;
+            if (typePieces[0].trim().equals("const")) {
+                typeIsConst = true;
+                typeString = typePieces[1].trim();
+            } else if (typePieces[1].trim().equals("const")) {
+                typeIsConst = true;
+                typeString = typePieces[0].trim();
+            } else {
+                throw new IllegalArgumentException("invalid type identifier in kernel signature: " + param);
+            }
+        } else {
+            throw new IllegalArgumentException("invalid type identifier in kernel signature: " + param);
+        }
+
+        Type type = Type.fromNIDLTypeString(typeString);
         assertNonVoidType(type, position, param);
-        return new Parameter(position, name, type, type == Type.NFI_POINTER ? Kind.POINTER_INOUT : Kind.BY_VALUE);
+        Kind kind = type == Type.NFI_POINTER ? Kind.POINTER_INOUT : Kind.BY_VALUE;
+        if (typeIsConst && type == Type.NFI_POINTER) {
+            kind = Kind.POINTER_IN;
+        }
+        return new ComputationArgument(position, name, type, kind);
     }
 
     private static void assertNonVoidType(Type type, int position, String paramStr) throws TypeException {
@@ -172,9 +249,45 @@ private static void assertNonVoidType(Type type, int position, String paramStr)
         }
     }
 
-    public static ArrayList<Parameter> parseParameterSignature(String parameterSignature) throws TypeException {
+    /**
+     * Count the occurrences of "c" in "s", and return the position of the first occurrence.
+     * If no occurrence is found, the position is -1;
+     * @param s the string to inspect
+     * @param c the character to search for
+     * @return a size-2 array with count and location
+     */
+    private static int[] countCharAndReturnFirstOccurrence(String s, char c) {
+        int[] result = {0, -1};  // Store count and first occurrence location of "c" in "s";
+        int count = 0;
+        for (int i = 0; i < s.length(); i++) {
+            if (s.charAt(i) == c) {
+                if (count == 0) result[1] = i;
+                count++;
+            }
+        }
+        result[0] = count;
+        return result;
+    }
+
+    /**
+     * Parse the signature of the function, and create a list of {@link ComputationArgument} from it
+     * @param parameterSignature the signature of the function
+     * @return a list of {@link ComputationArgument}
+     * @throws TypeException if the signature is not well-formed
+     */
+    public static ArrayList<ComputationArgument> parseParameterSignature(String parameterSignature) throws TypeException {
         CompilerAsserts.neverPartOfCompilation();
-        ArrayList<Parameter> params = new ArrayList<>();
+        ArrayList<ComputationArgument> params = new ArrayList<>();
+        // If the function is wrapped in parentheses, remove them. It also means that the output is specified,
+        // but (currently) we don't care about it;
+        int[] leftParLoc = countCharAndReturnFirstOccurrence(parameterSignature, '(');
+        int[] rightParLoc = countCharAndReturnFirstOccurrence(parameterSignature, ')');
+        // Check that we have exactly 0 or 1 "()", and split the string to retrieve the part inside ();
+        if ((leftParLoc[0] == 1 && rightParLoc[0] == 1) && (leftParLoc[1] <= rightParLoc[1])) {
+            parameterSignature = parameterSignature.split("\\(")[1].split("\\)")[0];
+        } else if (leftParLoc[0] != 0 || rightParLoc[0] != 0) {
+            throw new TypeException("malformed parentheses in signature = " + parameterSignature);
+        }
         for (String s : parameterSignature.trim().split(",")) {
             params.add(parseNIDLOrLegacyParameterString(params.size(), s.trim()));
         }
@@ -205,6 +318,14 @@ public int getPosition() {
         return position;
     }
 
+    public boolean isArray() {
+        return isArray;
+    }
+
+    public boolean isConst() {
+        return isConst;
+    }
+
     public String getMangledType() {
         // Simple substitution rule for GCC
         // Note that this does not implement the substitution rules
@@ -271,6 +392,6 @@ public boolean isSynonymousWithPointerTo(Type elementType) {
 
     @Override
     public String toString() {
-        return "Parameter(position=" + position + ", name=" + name + ", type=" + type + ", kind=" + kind + ")";
+        return name;
     }
 }
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/ComputationArgumentWithValue.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/ComputationArgumentWithValue.java
new file mode 100644
index 00000000..6a3ee641
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/ComputationArgumentWithValue.java
@@ -0,0 +1,75 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation;
+
+import com.nvidia.grcuda.Type;
+
+import java.util.Objects;
+
+/**
+ * Defines a {@link GrCUDAComputationalElement} argument representing the elements of a NFI signature.
+ * For each argument, store its type, if it's a pointer,
+ * and if it's constant (i.e. its content cannot be modified in the computation).
+ * This class also holds a reference to the actual object associated to the argument;
+ */
+public class ComputationArgumentWithValue extends ComputationArgument {
+    private final Object argumentValue;
+
+    public ComputationArgumentWithValue(String name, Type type, Kind kind, Object argumentValue) {
+        super(name, type, kind);
+        this.argumentValue = argumentValue;
+    }
+
+    public ComputationArgumentWithValue(ComputationArgument computationArgument, Object argumentValue) {
+        super(computationArgument.getPosition(), computationArgument.getName(), computationArgument.getType(), computationArgument.getKind());
+        this.argumentValue = argumentValue;
+    }
+
+    public Object getArgumentValue() { return this.argumentValue; }
+
+    @Override
+    public String toString() {
+        return name;
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o) return true;
+        if (o == null || getClass() != o.getClass()) return false;
+        ComputationArgumentWithValue that = (ComputationArgumentWithValue) o;
+        return Objects.equals(argumentValue, that.argumentValue);
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(argumentValue);
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/GrCUDAComputationalElement.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/GrCUDAComputationalElement.java
new file mode 100644
index 00000000..97e39176
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/GrCUDAComputationalElement.java
@@ -0,0 +1,382 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation;
+
+import com.nvidia.grcuda.CUDAEvent;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.computation.streamattach.StreamAttachArchitecturePolicy;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyComputation;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.DefaultStream;
+import com.oracle.truffle.api.TruffleLogger;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+import java.util.Collection;
+import java.util.List;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import static com.nvidia.grcuda.GrCUDALogger.COMPUTATION_LOGGER;
+
+/**
+ * Basic class that represents GrCUDA computations,
+ * and is used to model data dependencies between computations;
+ */
+public abstract class GrCUDAComputationalElement {
+
+    private static final TruffleLogger LOGGER = GrCUDALogger.getLogger(COMPUTATION_LOGGER);
+
+    /**
+     * This list contains the original set of input arguments that are used to compute dependencies;
+     */
+    protected final List<ComputationArgumentWithValue> argumentsThatCanCreateDependencies;
+    /**
+     * Reference to the execution context where this computation is executed;
+     */
+    protected final AbstractGrCUDAExecutionContext grCUDAExecutionContext;
+    /**
+     * Reference to the stream where this computation will be executed,
+     * if possible (i.e. if the computation can be executed on a custom stream).
+     * Subclasses can keep an internal reference to the stream, e.g. if it can be manually modified by the user,
+     * but it is required to keep that value consistent to this one if it is modified;
+     */
+    private CUDAStream stream = DefaultStream.get();
+    /**
+     * Reference to the event associated to this computation, and recorded on the stream where this computation is executed,
+     * before the computation is started. It is used in order to time the execution.
+     */
+    private CUDAEvent eventStart;
+    /**
+     * Reference to the event associated to this computation, and recorded on the stream where this computation is executed,
+     * after the computation is started. It offers a precise synchronization point for children computations.
+     * If the computation is not executed on a stream, the event is null;
+     */
+    private CUDAEvent eventStop;
+    /**
+     * Keep track of whether this computation has already been executed, and represents a "dead" vertex in the DAG.
+     * Computations that are already executed will not be considered when computing dependencies;
+     */
+    private boolean computationFinished = false;
+    /**
+     * Keep track of whether this computation has already been started, to avoid performing the same computation multiple times;
+     */
+    private boolean computationStarted = false;
+    /**
+     * Specify if this computational element represents a computation executed on the CPU,
+     * such as an array access (read or write) on an {@link com.nvidia.grcuda.runtime.array.AbstractArray}.
+     * CPU computations are assumed synchronous. By default it returns false;
+     */
+    protected boolean isComputationDoneByCPU = false;
+
+    private final DependencyComputation dependencyComputation;
+
+    /**
+     * True IFF {@link GrCUDAComputationalElement#executionTimeMs} has been set;
+     */
+    private boolean executionTimeMeasured = false;
+
+    /**
+     * Execution time in milliseconds of this computation. The execution time is available only after the end of this computation,
+     * and whether the time has been measured is given by {@link GrCUDAComputationalElement#executionTimeMeasured}.
+     * Whether the execution time is measurable (and measured) depends on the GrCUDAComputationalElement and on user-specified settings.
+     */
+    private float executionTimeMs = 0;
+
+    /**
+     * Constructor that takes an argument set initializer to build the set of arguments used in the dependency computation
+     * @param grCUDAExecutionContext execution context in which this computational element will be scheduled
+     * @param initializer the initializer used to build the internal set of arguments considered in the dependency computation
+     */
+    public GrCUDAComputationalElement(AbstractGrCUDAExecutionContext grCUDAExecutionContext, InitializeDependencyList initializer) {
+        this.argumentsThatCanCreateDependencies = initializer.initialize();
+        // Initialize by making a copy of the original set;
+        this.grCUDAExecutionContext = grCUDAExecutionContext;
+        this.dependencyComputation = grCUDAExecutionContext.getDependencyBuilder().initialize(this.argumentsThatCanCreateDependencies);
+    }
+
+    /**
+     * Simplified constructor that takes a list of arguments, and consider all of them in the dependency computation
+     * @param grCUDAExecutionContext execution context in which this computational element will be scheduled
+     * @param args the list of arguments provided to the computation. Arguments are expected to be {@link org.graalvm.polyglot.Value}
+     */
+    public GrCUDAComputationalElement(AbstractGrCUDAExecutionContext grCUDAExecutionContext, List<ComputationArgumentWithValue> args) {
+        this(grCUDAExecutionContext, new DefaultExecutionInitializer(args));
+    }
+
+    public List<ComputationArgumentWithValue> getArgumentsThatCanCreateDependencies() {
+        return argumentsThatCanCreateDependencies;
+    }
+
+    /**
+     * Store the execution time for this ComputationalElement (in milliseconds)
+     * @param executionTimeMs the execution time of this ComputationalElement
+     */
+    public void setExecutionTime(float executionTimeMs) {
+        this.executionTimeMs = executionTimeMs;
+        this.executionTimeMeasured = true;
+        LOGGER.fine(() -> "computation (" + this + "), execution time: " + executionTimeMs + " ms");
+    }
+
+    public float getExecutionTime() {
+        if (this.executionTimeMeasured) {
+            return this.executionTimeMs;
+        } else {
+            throw new GrCUDAException("execution time for computation " + this + " has not been measured!");
+        }
+    }
+
+    /**
+     * Return if this computation could lead to dependencies with future computations.
+     * If not, this usually means that all of its arguments have already been superseded by other computations,
+     * or that the computation didn't have any arguments to begin with;
+     * @return if the computation could lead to future dependencies
+     */
+    public boolean hasPossibleDependencies() {
+        return !this.dependencyComputation.getActiveArgumentSet().isEmpty();
+    }
+
+    /**
+     * Schedule this computation for future execution by the {@link AsyncGrCUDAExecutionContext}.
+     * The scheduling request is separate from the {@link GrCUDAComputationalElement} instantiation
+     * as we need to ensure that the the computational element subclass has been completely instantiated;
+     */
+    public Object schedule() throws UnsupportedTypeException {
+        return this.grCUDAExecutionContext.registerExecution(this);
+    }
+
+    /**
+     * Generic interface to perform the execution of this {@link GrCUDAComputationalElement}.
+     * The actual execution implementation must be added by concrete computational elements.
+     * The execution request will be done by the {@link AsyncGrCUDAExecutionContext}, after this computation has been scheduled
+     * using {@link GrCUDAComputationalElement#schedule()}
+     */
+    public abstract Object execute() throws UnsupportedTypeException;
+
+    public CUDAStream getStream() {
+        return this.stream;
+    }
+
+    public void setStream(CUDAStream stream) {
+        this.stream = stream;
+    }
+
+    public boolean isComputationFinished() {
+        return computationFinished;
+    }
+
+    public boolean isComputationStarted() {
+        return computationStarted;
+    }
+
+    public void setComputationFinished() {
+        this.computationFinished = true;
+    }
+
+    public void setComputationStarted() {
+        this.computationStarted = true;
+    }
+
+    public Optional<CUDAEvent> getEventStop() {
+        if (eventStop != null) {
+            return Optional.of(eventStop);
+        } else {
+            return Optional.empty();
+        }
+    }
+
+    public Optional<CUDAEvent> getEventStart() {
+        if (eventStart != null) {
+            return Optional.of(eventStart);
+        } else {
+            return Optional.empty();
+        }
+    }
+
+    public void setEventStop(CUDAEvent eventStop) {
+        this.eventStop = eventStop;
+    }
+
+    public void setEventStart(CUDAEvent eventStart) {
+        this.eventStart = eventStart;
+    }
+
+    /**
+     * Find whether this computation should be done on a user-specified {@link com.nvidia.grcuda.runtime.stream.CUDAStream};
+     * If not, the stream will be provided internally using the specified execution policy. By default, return false;
+     * @return if the computation is done on a custom CUDA stream;
+     */
+    public boolean useManuallySpecifiedStream() {
+        return false;
+    }
+
+    /**
+     * Some computational elements, like kernels, can be executed on different {@link CUDAStream} to provide
+     * parallel asynchronous execution. Other computations, such as array reads, do not require streams, or cannot be
+     * executed on streams different from the {@link DefaultStream};
+     * @return if this computation can be executed on a customized stream
+     */
+    public boolean canUseStream() {
+        return false;
+    }
+
+    // TODO: currently not supported. It is not clear what the synchronization semantic for the default stream is.
+    //  It is better to just always execute computations on the default stream synchronously.
+//    /**
+//     * Some computational elements, like some CUDA library functions, do not expose the option to use arbitrary streams.
+//     * In these cases, we still allow asynchronous execution using events etc., but the computation
+//     * is always executed on the default stream.
+//     * If this function returns true, {@link GrCUDAComputationalElement#canUseStream()} must also be true.
+//     * Otherwise, returning true has no effect;
+//     * @return if this computation must be executed on the default stream;
+//     */
+//    public boolean mustUseDefaultStream() { return false; }
+
+    /**
+     * Provide a way to associate input arrays allocated using managed memory to the stream
+     * on which this kernel is executed. This is required by pre-Pascal GPUs to allow the CPU to access
+     * managed memory belonging to arrays not used by kernels running on the GPU.
+     * By default, the implementation is empty, as {@link GrCUDAComputationalElement#canUseStream} is false;
+     */
+    public final void associateArraysToStream() {
+        grCUDAExecutionContext.getArrayStreamArchitecturePolicy().execute(this::associateArraysToStreamImpl);
+    }
+
+    /**
+     * Actual implementation of {@link GrCUDAComputationalElement#associateArraysToStream()},
+     * to be modified by concrete computational elements;
+     */
+    protected void associateArraysToStreamImpl() {}
+
+    /**
+     * Retrieve how the dependency computations are computed;
+     */
+    public DependencyComputation getDependencyComputation() { return dependencyComputation; }
+
+    /**
+     * Set for all the {@link com.nvidia.grcuda.runtime.array.AbstractArray} in the computation if this computation is an array access.
+     * This implementation is meant for GPU computations that use streams, e.g. kernels and GPU libraries.
+     * CPU computations (e.g. array accesses) should re-implement this function to track the CPU.
+     * GPU computations don't use custom streams only if the are synchronized (e.g. when using the sync scheduler),
+     * and there's no benefit in tracking their location.
+     * Locations are updated BEFORE the start of the actual computation: if another computation is scheduled after
+     * the current one, it will be scheduled assuming that the data transfer for this computation has already taken place.
+     * This assumption can avoid duplicate data movements, e.g. with
+     * (Xr) -> ...
+     * (Xr) -> ...
+     * we can avoid transferring X twice, and schedule the second kernel on the GPU where X will be already present;
+     */
+    public void updateLocationOfArrays() {
+        for (ComputationArgumentWithValue o : this.argumentsThatCanCreateDependencies) {
+            // Ignore non-array arguments. Also, don't update locations if the ComputationalElement does not use streams;
+            if (o.getArgumentValue() instanceof AbstractArray && this.canUseStream()) {
+                AbstractArray a = (AbstractArray) o.getArgumentValue();
+                // If the argument is read-only, add the location of this ComputationalElement to the array;
+                if (grCUDAExecutionContext.isConstAware() && o.isConst()) {
+                    a.addArrayUpToDateLocations(this.stream.getStreamDeviceId());
+                } else {
+                    // Clear the list of up-to-date locations: only the current device has the updated array;
+                    a.resetArrayUpToDateLocations(this.stream.getStreamDeviceId());
+                }
+            }
+        }
+    }
+
+    /**
+     * Obtain the list of input arguments for this computation that are arrays;
+     * @return a list of arrays that are inputs for this computation
+     */
+    public List<AbstractArray> getArrayArguments(){
+        // Note: "argumentsThatCanCreateDependencies" is a filter applied to the original inputs,
+        // so we have no guarantees that it contains all the input arrays.
+        // In practice, "argumentsThatCanCreateDependencies" is already a selection of the input arrays,
+        // making the filter below unnecessary.
+        // If for whatever reason we have a argumentsThatCanCreateDependencies that does not contain all the input arrays,
+        // we need to store the original input list in this class as well, and apply the filter below to that list.
+        return this.argumentsThatCanCreateDependencies.stream()
+                .filter(ComputationArgument::isArray)
+                .map(a -> (AbstractArray) a.getArgumentValue())
+                .collect(Collectors.toList());
+    }
+
+    /**
+     * Computes if the "other" GrCUDAComputationalElement has dependencies w.r.t. this kernel,
+     * such as requiring as input a value computed by this kernel;
+     * @param other kernel for which we want to check dependencies, w.r.t. this kernel
+     * @return the list of arguments that the two kernels have in common
+     */
+    public Collection<ComputationArgumentWithValue> computeDependencies(GrCUDAComputationalElement other) {
+        return this.dependencyComputation.computeDependencies(other);
+    }
+
+    /**
+     * Compute and return an additional stream dependency used by this computation.
+     * This function is used by {@link com.nvidia.grcuda.runtime.stream.GrCUDAStreamManager} to synchronize streams
+     * that might not be directly used by this computation, but that have to be synchronized for this computation
+     * to take place correctly. For example, in pre-Pascal GPUs it is required to ensure that no kernel is running if
+     * the array accessed is visible to the global stream.
+     * The actual invocation is wrapped by a {@link StreamAttachArchitecturePolicy},
+     * as the invocation depends on the GPU architecture;
+     * @return An additional stream to synchronize
+     */
+    public final Optional<CUDAStream> additionalStreamDependency() {
+        return grCUDAExecutionContext.getArrayStreamArchitecturePolicy().execute(this::additionalStreamDependencyImpl);
+    }
+
+    /**
+     * Actual implementation of {@link GrCUDAComputationalElement#additionalStreamDependency}, it can be overridden
+     * by concrete computations to provide additional streams for synchronization;
+     * @return An additional stream to synchronize
+     */
+    protected Optional<CUDAStream> additionalStreamDependencyImpl() {
+        return Optional.empty();
+    }
+
+    /**
+     * The default initializer will simply store all the arguments,
+     * and consider each of them in the dependency computations;
+     */
+    private static class DefaultExecutionInitializer implements InitializeDependencyList {
+        private final List<ComputationArgumentWithValue> args;
+
+        DefaultExecutionInitializer(List<ComputationArgumentWithValue> args) {
+            this.args = args;
+        }
+
+        @Override
+        public List<ComputationArgumentWithValue> initialize() {
+            return args;
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/InitializeDependencyList.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/InitializeDependencyList.java
new file mode 100644
index 00000000..e5c96488
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/InitializeDependencyList.java
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation;
+
+import java.util.List;
+
+public interface InitializeDependencyList {
+    /**
+     * Used by different {@link GrCUDAComputationalElement} to initialize the list of arguments
+     * considered in the dependency evaluation.
+     * @return a list of arguments used in the dependency evaluation
+     */
+    List<ComputationArgumentWithValue> initialize();
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/KernelExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/KernelExecution.java
new file mode 100644
index 00000000..58d18bab
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/KernelExecution.java
@@ -0,0 +1,156 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation;
+
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.ConfiguredKernel;
+import com.nvidia.grcuda.runtime.Kernel;
+import com.nvidia.grcuda.runtime.KernelArguments;
+import com.nvidia.grcuda.runtime.KernelConfig;
+import com.nvidia.grcuda.runtime.executioncontext.AsyncGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.DefaultStream;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Class used to track the single execution of a {@link ConfiguredKernel}.
+ * The execution will be provided to the {@link AsyncGrCUDAExecutionContext} and scheduled accordingly.
+ */
+public class KernelExecution extends GrCUDAComputationalElement {
+
+    private final Kernel kernel;
+    private final ConfiguredKernel configuredKernel;
+    private final KernelConfig config;
+    private final KernelArguments args;
+
+    public KernelExecution(ConfiguredKernel configuredKernel, KernelArguments args) {
+        super(
+            configuredKernel.getKernel().getGrCUDAExecutionContext(),
+            new KernelExecutionInitializer(args)
+        );
+        this.configuredKernel = configuredKernel;
+        this.kernel = configuredKernel.getKernel();
+        this.config = configuredKernel.getConfig();
+        this.args = args;
+    }
+
+    @Override
+    public void setExecutionTime(float executionTimeMs) {
+        // Store the execution time inside the ProfiledComputation storage;
+        this.configuredKernel.addExecutionTime(this.config.getStream().getStreamDeviceId(), executionTimeMs);
+        // Always store the execution time in the ComputationalElement as well;
+        super.setExecutionTime(executionTimeMs);
+    }
+
+    @Override
+    public Object execute() {
+        grCUDAExecutionContext.getCudaRuntime().cuLaunchKernel(kernel, config, args, this.getStream());
+        return NoneValue.get();
+    }
+
+    public KernelArguments getArgs() {
+        return args;
+    }
+
+    /**
+     * Setting the stream must be done inside the {@link KernelConfig};
+     * @param stream the stream where this computation will be executed
+     */
+    @Override
+    public void setStream(CUDAStream stream) {
+        // Make sure that the internal reference is consistent;
+        super.setStream(stream);
+    }
+
+    /**
+     * Retrieve the stream stored in the {@link KernelConfig} if it has been manually specified by the user,
+     * otherwise return the one automatically chosen by the {@link com.nvidia.grcuda.runtime.stream.GrCUDAStreamManager};
+     * @return the stream where this computation will be executed
+     */
+    @Override
+    public CUDAStream getStream() {
+        return config.useCustomStream() ? config.getStream() : super.getStream();
+    }
+
+    @Override
+    public boolean useManuallySpecifiedStream() { return config.useCustomStream(); }
+
+    @Override
+    public boolean canUseStream() { return true; }
+
+    @Override
+    public void associateArraysToStreamImpl() {
+        for (ComputationArgumentWithValue a : args.getKernelArgumentWithValues()) {
+            if (a.getArgumentValue() instanceof AbstractArray) {
+                AbstractArray array = (AbstractArray) a.getArgumentValue();
+                if (getDependencyComputation().streamResetAttachFilter(a)) {
+                    // If the array was attached to a stream, and now it is a const parameter, reset its visibility to the default stream;
+                    if (!array.getStreamMapping().isDefaultStream()) {
+                        grCUDAExecutionContext.getCudaRuntime().cudaStreamAttachMemAsync(DefaultStream.get(), array);
+                    }
+                } else if (!array.getStreamMapping().equals(this.getStream())) {
+                    // Attach the array to the stream if the array isn't already attached to this stream;
+                    grCUDAExecutionContext.getCudaRuntime().cudaStreamAttachMemAsync(this.getStream(), array);
+                }
+            }
+        }
+    }
+
+    @Override
+    public String toString() {
+        String event = this.getEventStop().isPresent() ? Long.toString(this.getEventStop().get().getEventNumber()) : "NULL";
+        return "kernel=" + kernel.getKernelName() + "; args=[" +
+                Arrays.stream(args.getOriginalArgs()).map(a -> Integer.toString(System.identityHashCode(a))).collect(Collectors.joining(", ")) +
+                "]" + "; stream=" + this.getStream().getStreamNumber() + "; event=" + event;
+    }
+
+    static class KernelExecutionInitializer implements InitializeDependencyList {
+        private final KernelArguments args;
+
+        KernelExecutionInitializer(KernelArguments args) {
+            this.args = args;
+        }
+
+        @Override
+        public List<ComputationArgumentWithValue> initialize() {
+            // TODO: what about scalars? We cannot treat them in the same way, as they are copied and not referenced
+            //   There should be a semantic to manually specify scalar dependencies? For now we have to skip them;
+            return this.args.getKernelArgumentWithValues().stream()
+                    .filter(ComputationArgument::isArray).collect(Collectors.toList());
+        }
+    }
+}
+
+
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayAccessExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayAccessExecution.java
new file mode 100644
index 00000000..44210e31
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayAccessExecution.java
@@ -0,0 +1,57 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.InitializeDependencyList;
+import com.nvidia.grcuda.runtime.executioncontext.AbstractGrCUDAExecutionContext;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+import java.util.Optional;
+
+/**
+ * Abstract class that wraps all computational elements representing accesses on managed memory by the CPU;
+ */
+public abstract class ArrayAccessExecution<T extends AbstractArray> extends GrCUDAComputationalElement {
+
+    protected T array;
+    public static final boolean COMPUTATION_IS_DONE_BY_CPU = true;
+
+    public ArrayAccessExecution(AbstractGrCUDAExecutionContext grCUDAExecutionContext, InitializeDependencyList initializer, T array) {
+        super(grCUDAExecutionContext, initializer);
+        this.array = array;
+        this.isComputationDoneByCPU = COMPUTATION_IS_DONE_BY_CPU;
+    }
+
+    @Override
+    protected Optional<CUDAStream> additionalStreamDependencyImpl() { return Optional.of(array.getStreamMapping()); }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayAccessExecutionInitializer.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayAccessExecutionInitializer.java
new file mode 100644
index 00000000..14c359c8
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayAccessExecutionInitializer.java
@@ -0,0 +1,64 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.InitializeDependencyList;
+
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * The only argument in {@link com.nvidia.grcuda.runtime.array.AbstractArray} computations is the array itself.
+ * Note that in {@link com.nvidia.grcuda.runtime.array.MultiDimDeviceArrayView} the array is the parent {@link com.nvidia.grcuda.runtime.array.MultiDimDeviceArray},
+ * while in {@link com.nvidia.grcuda.runtime.array.MultiDimDeviceArray} there is currently no need to explicitly represent computations,
+ * as they cannot directly access the underlying memory;
+ */
+class ArrayAccessExecutionInitializer<T extends AbstractArray> implements InitializeDependencyList {
+
+    private final T array;
+    private final boolean readOnly;
+    private final static String PARAMETER_NAME = "array_access";
+
+    ArrayAccessExecutionInitializer(T array, boolean readOnly) {
+        this.array = array;
+        this.readOnly = readOnly;
+    }
+
+    @Override
+    public List<ComputationArgumentWithValue> initialize() {
+        return Collections.singletonList(
+                new ComputationArgumentWithValue(PARAMETER_NAME, Type.NFI_POINTER, this.readOnly ? ComputationArgument.Kind.POINTER_IN : ComputationArgument.Kind.POINTER_INOUT, this.array));
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecution.java
new file mode 100644
index 00000000..48bc4c66
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecution.java
@@ -0,0 +1,114 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.oracle.truffle.api.CompilerDirectives;
+
+import java.util.Optional;
+
+/**
+ * Computational elements that represents a low-level memory copy from/to a {@link AbstractArray}
+ */
+public abstract class ArrayCopyFunctionExecution extends GrCUDAComputationalElement {
+
+    /**
+     * The {@link AbstractArray} used in the copy;
+     */
+    protected final AbstractArray array;
+    /**
+     * Whether this computation copies data from the array or writes to it;
+     */
+    protected final DeviceArrayCopyFunction.CopyDirection direction;
+    /**
+     * Number of elements copied (expressed as number of elements, not as a size in bytes);
+     */
+    protected final long numElements;
+
+    public static final boolean COMPUTATION_IS_DONE_BY_CPU = true;
+
+    public ArrayCopyFunctionExecution(AbstractArray array, DeviceArrayCopyFunction.CopyDirection direction, long numElements, ArrayCopyFunctionExecutionInitializer dependencyInitializer) {
+        super(array.getGrCUDAExecutionContext(), dependencyInitializer);
+        this.array = array;
+        this.direction = direction;
+        this.numElements = numElements;
+        this.isComputationDoneByCPU = COMPUTATION_IS_DONE_BY_CPU;
+    }
+
+    @Override
+    public Object execute() {
+        if (this.numElements * this.array.getElementType().getSizeBytes() > this.array.getSizeBytes()) {
+            CompilerDirectives.transferToInterpreter();
+            throw new IndexOutOfBoundsException();
+        }
+        this.executeInner();
+        this.setComputationFinished();
+        return NoneValue.get();
+    }
+
+    /**
+     * Provide different implementations of the copy execution, depending on whether we operate on pointers, arrays, etc.
+     */
+    abstract void executeInner();
+
+    @Override
+    public void updateLocationOfArrays() {
+        // FIXME: we should also consider the other array: if it is a DeviceArray its location is also updated;
+        if (direction == DeviceArrayCopyFunction.CopyDirection.FROM_POINTER) {
+            // We are copying new data to the array, so reset its status to updated on CPU;
+            array.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            // We are copying new data from the array (on the CPU) to somewhere else,
+            // so the CPU must have updated data. It requires a sync if the context is not const-aware;
+            if (array.getGrCUDAExecutionContext().isConstAware()) {
+                array.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+            } else {
+                // Clear the list of up-to-date locations: only the CPU has the updated array;
+                array.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+            }
+        }
+    }
+    @Override
+    protected Optional<CUDAStream> additionalStreamDependencyImpl() {
+        return Optional.of(array.getStreamMapping());
+    }
+
+    @Override
+    public String toString() {
+        return "array copy on " + System.identityHashCode(this.array) + "; direction=" + this.direction + "; size=" + this.numElements;
+    }
+}
+
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionDefault.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionDefault.java
new file mode 100644
index 00000000..fbbf687c
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionDefault.java
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
+import com.oracle.truffle.api.CompilerDirectives;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.InvalidArrayIndexException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.library.CachedLibrary;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+/**
+ * Slow-path implementation of the copy function on {@link AbstractArray}, it copies data using a simple loop.
+ * It is not as fast as a memcpy, but it avoids some overheads of doing the copy in the host language;
+ */
+public class ArrayCopyFunctionExecutionDefault extends ArrayCopyFunctionExecution {
+    /**
+     * InteropLibrary object used to access the other array's elements;
+     */
+    private final InteropLibrary pointerAccess;
+    /**
+     * Object that identifies the other array from/to which we copy data;
+     */
+    private final Object otherArray;
+
+    public ArrayCopyFunctionExecutionDefault(AbstractArray array, DeviceArrayCopyFunction.CopyDirection direction, long numElements,
+                                             @CachedLibrary(limit = "3") InteropLibrary pointerAccess,
+                                             Object otherArray, ArrayCopyFunctionExecutionInitializer dependencyInitializer) {
+        super(array, direction, numElements, dependencyInitializer);
+        this.pointerAccess = pointerAccess;
+        this.otherArray = otherArray;
+    }
+
+    @Override
+    void executeInner() {
+        ValueProfile elementTypeProfile = ValueProfile.createIdentityProfile();
+        try {
+            if (direction == DeviceArrayCopyFunction.CopyDirection.FROM_POINTER) {
+                InteropLibrary valueLibrary = InteropLibrary.getFactory().createDispatched(5);
+                for (long i = 0; i < this.numElements; i++) {
+                    this.array.writeNativeView(i, this.pointerAccess.readArrayElement(this.otherArray, i), valueLibrary, elementTypeProfile);
+                }
+            } else if (direction == DeviceArrayCopyFunction.CopyDirection.TO_POINTER) {
+                for (long i = 0; i < this.numElements; i++) {
+                    this.pointerAccess.writeArrayElement(this.otherArray, i, this.array.readNativeView(i, elementTypeProfile));
+                }
+            } else {
+                CompilerDirectives.transferToInterpreter();
+                throw new DeviceArrayCopyException("invalid direction for copy: " + direction);
+            }
+        } catch (InvalidArrayIndexException | UnsupportedMessageException | UnsupportedTypeException e) {
+            throw new DeviceArrayCopyException("invalid array copy: " + e);
+        }
+    }
+
+    @Override
+    public String toString() {
+        try {
+            return "array  copy on " + System.identityHashCode(array) + "; direction=" + direction + "; target=" + pointerAccess.asString(this.otherArray) + "; size=" + numElements;
+        } catch (UnsupportedMessageException e) {
+            return "array copy on " + System.identityHashCode(array) + "; direction=" + direction + "; size=" + numElements;
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionInitializer.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionInitializer.java
new file mode 100644
index 00000000..25086ee6
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionInitializer.java
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.Type;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
+import com.nvidia.grcuda.runtime.computation.ComputationArgument;
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.InitializeDependencyList;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class ArrayCopyFunctionExecutionInitializer implements InitializeDependencyList {
+
+    private final AbstractArray array;
+    private final Object otherArray;
+    private final DeviceArrayCopyFunction.CopyDirection direction;
+    private final static String PARAMETER_NAME_1 = "array_copy_function_arg_1";
+    private final static String PARAMETER_NAME_2 = "array_copy_function_arg_2";
+
+    public ArrayCopyFunctionExecutionInitializer(AbstractArray array, Object otherArray, DeviceArrayCopyFunction.CopyDirection direction) {
+        this.array = array;
+        this.direction = direction;
+        this.otherArray = otherArray;
+    }
+
+    @Override
+    public List<ComputationArgumentWithValue> initialize() {
+        ArrayList<ComputationArgumentWithValue> dependencyList = new ArrayList<>();
+        dependencyList.add(new ComputationArgumentWithValue(PARAMETER_NAME_1, Type.NFI_POINTER,
+                        this.direction.equals(DeviceArrayCopyFunction.CopyDirection.FROM_POINTER) ? ComputationArgument.Kind.POINTER_OUT : ComputationArgument.Kind.POINTER_IN, this.array));
+        // If we are copying from/to another DeviceArray, that's also a dependency;
+        if (otherArray instanceof AbstractArray) {
+            dependencyList.add(new ComputationArgumentWithValue(PARAMETER_NAME_2, Type.NFI_POINTER,
+                            this.direction.equals(DeviceArrayCopyFunction.CopyDirection.FROM_POINTER) ? ComputationArgument.Kind.POINTER_IN : ComputationArgument.Kind.POINTER_OUT, this.otherArray));
+        }
+        return dependencyList;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionMemcpy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionMemcpy.java
new file mode 100644
index 00000000..6c57ebab
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/ArrayCopyFunctionExecutionMemcpy.java
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.functions.DeviceArrayCopyFunction;
+import com.oracle.truffle.api.CompilerDirectives;
+
+/**
+ * Fastest {@link AbstractArray} memcpy implementation, it operates using a cudaMemcpy directly on a native pointer.
+ * This implementation is used when copying data between AbstractArrays, or when copying data from/to an array backed
+ * by native memory, such as numpy arrays;
+ */
+public class ArrayCopyFunctionExecutionMemcpy extends ArrayCopyFunctionExecution {
+    /**
+     * A memory pointer from which data copied to the array are retrieved, or memory pointer to which data are written;
+     */
+    private final long pointer;
+
+    public ArrayCopyFunctionExecutionMemcpy(AbstractArray array, DeviceArrayCopyFunction.CopyDirection direction, long numElements, long pointer, ArrayCopyFunctionExecutionInitializer dependencyInitializer) {
+        super(array, direction, numElements, dependencyInitializer);
+        this.pointer = pointer;
+    }
+
+    @Override
+    void executeInner() {
+        long numBytesToCopy = this.numElements * this.array.getElementType().getSizeBytes();
+        long fromPointer;
+        long destPointer;
+        if (direction == DeviceArrayCopyFunction.CopyDirection.FROM_POINTER) {
+            fromPointer = pointer;
+            destPointer = array.getPointer();
+        } else if (direction == DeviceArrayCopyFunction.CopyDirection.TO_POINTER) {
+            fromPointer = array.getPointer();
+            destPointer = pointer;
+        } else {
+            CompilerDirectives.transferToInterpreter();
+            throw new DeviceArrayCopyException("invalid direction for copy: " + direction);
+        }
+        // If the array visibility is restricted to a stream, provide the stream to memcpy;
+        if (array.getStreamMapping().isDefaultStream()) {
+            grCUDAExecutionContext.getCudaRuntime().cudaMemcpy(destPointer, fromPointer, numBytesToCopy);
+        } else {
+            grCUDAExecutionContext.getCudaRuntime().cudaMemcpy(destPointer, fromPointer, numBytesToCopy, array.getStreamMapping());
+        }
+    }
+
+    @Override
+    public String toString() {
+        return "array memcpy on " + System.identityHashCode(array) + "; direction=" + direction + "; target=" + pointer + "; size=" + numElements;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayCopyException.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayCopyException.java
new file mode 100644
index 00000000..0dd285c6
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayCopyException.java
@@ -0,0 +1,46 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.oracle.truffle.api.exception.AbstractTruffleException;
+import com.oracle.truffle.api.nodes.Node;
+
+public final class DeviceArrayCopyException extends AbstractTruffleException {
+    private static final long serialVersionUID = 8614211550329856579L;
+
+    public DeviceArrayCopyException(String message) {
+        this(message, null);
+    }
+
+    public DeviceArrayCopyException(String message, Node node) {
+        super(message, node);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayReadExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayReadExecution.java
new file mode 100644
index 00000000..f39baba2
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayReadExecution.java
@@ -0,0 +1,74 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+public class DeviceArrayReadExecution extends ArrayAccessExecution<DeviceArray> {
+
+    private final long index;
+    private final ValueProfile elementTypeProfile;
+
+    public DeviceArrayReadExecution(DeviceArray array,
+                                    long index,
+                                    ValueProfile elementTypeProfile) {
+        super(array.getGrCUDAExecutionContext(), new ArrayAccessExecutionInitializer<>(array, true), array);
+        this.index = index;
+        this.elementTypeProfile = elementTypeProfile;
+    }
+
+    @Override
+    public void updateLocationOfArrays() {
+        if (array.getGrCUDAExecutionContext().isConstAware()) {
+            array.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            // Clear the list of up-to-date locations: only the CPU has the updated array;
+            array.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        }
+    }
+
+    @Override
+    public Object execute() {
+        Object result = array.readNativeView(index, elementTypeProfile);
+        this.setComputationFinished();
+        return result;
+    }
+
+    @Override
+    public String toString() {
+//        return "DeviceArrayReadExecution(" +
+//                "array=" + array +
+//                ", index=" + index + ")";
+        return "array read on " + System.identityHashCode(array) + "; index=" + index + "; stream=" + getStream().getStreamNumber();
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayWriteExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayWriteExecution.java
new file mode 100644
index 00000000..b78b3aeb
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/DeviceArrayWriteExecution.java
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.DeviceArray;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+public class DeviceArrayWriteExecution extends ArrayAccessExecution<DeviceArray> {
+
+    private final long index;
+    private final Object value;
+    private final InteropLibrary valueLibrary;
+    private final ValueProfile elementTypeProfile;
+
+    public DeviceArrayWriteExecution(DeviceArray array,
+                                     long index,
+                                     Object value,
+                                     InteropLibrary valueLibrary,
+                                     ValueProfile elementTypeProfile) {
+        super(array.getGrCUDAExecutionContext(), new ArrayAccessExecutionInitializer<>(array, false), array);
+        this.index = index;
+        this.value = value;
+        this.valueLibrary = valueLibrary;
+        this.elementTypeProfile = elementTypeProfile;
+    }
+
+    @Override
+    public void updateLocationOfArrays() {
+        // Clear the list of up-to-date locations: only the CPU has the updated array;
+        array.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+    }
+
+    @Override
+    public Object execute() throws UnsupportedTypeException {
+        array.writeNativeView(index, value, valueLibrary, elementTypeProfile);
+        this.setComputationFinished();
+        return NoneValue.get();
+    }
+
+    @Override
+    public String toString() {
+//        return "DeviceArrayWriteExecution(" +
+//                "array=" + array +
+//                ", index=" + index +
+//                ", value=" + value +
+//                ")";
+        return "array write on " + System.identityHashCode(array) + "; index=" + index + "; value=" + value + "; stream=" + getStream().getStreamNumber();
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/MultiDimDeviceArrayViewReadExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/MultiDimDeviceArrayViewReadExecution.java
new file mode 100644
index 00000000..f6b62264
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/MultiDimDeviceArrayViewReadExecution.java
@@ -0,0 +1,73 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArrayView;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+public class MultiDimDeviceArrayViewReadExecution extends ArrayAccessExecution<MultiDimDeviceArrayView> {
+
+    private final long index;
+    private final ValueProfile elementTypeProfile;
+
+    public MultiDimDeviceArrayViewReadExecution(MultiDimDeviceArrayView array,
+                                                long index,
+                                                ValueProfile elementTypeProfile) {
+        super(array.getGrCUDAExecutionContext(), new ArrayAccessExecutionInitializer<>(array.getMdDeviceArray(), true), array);
+        this.index = index;
+        this.elementTypeProfile = elementTypeProfile;
+    }
+
+    @Override
+    public void updateLocationOfArrays() {
+        if (array.getGrCUDAExecutionContext().isConstAware()) {
+            array.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            // Clear the list of up-to-date locations: only the CPU has the updated array;
+            array.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        }
+    }
+
+    @Override
+    public Object execute() {
+        Object result = array.readNativeView(index, elementTypeProfile);
+        this.setComputationFinished();
+        return result;
+    }
+
+    @Override
+    public String toString() {
+        return "MultiDimDeviceArrayViewReadExecution(" +
+                "array=" + array +
+                ", index=" + index + ")";
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/MultiDimDeviceArrayViewWriteExecution.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/MultiDimDeviceArrayViewWriteExecution.java
new file mode 100644
index 00000000..7fc5740b
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/arraycomputation/MultiDimDeviceArrayViewWriteExecution.java
@@ -0,0 +1,84 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.arraycomputation;
+
+import com.nvidia.grcuda.NoneValue;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.MultiDimDeviceArrayView;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+import com.oracle.truffle.api.profiles.ValueProfile;
+
+public class MultiDimDeviceArrayViewWriteExecution extends ArrayAccessExecution<MultiDimDeviceArrayView> {
+
+    private final long index;
+    private final Object value;
+    private final InteropLibrary valueLibrary;
+    private final ValueProfile elementTypeProfile;
+
+    public MultiDimDeviceArrayViewWriteExecution(MultiDimDeviceArrayView array,
+                                                 long index,
+                                                 Object value,
+                                                 InteropLibrary valueLibrary,
+                                                 ValueProfile elementTypeProfile) {
+        super(array.getGrCUDAExecutionContext(), new ArrayAccessExecutionInitializer<>(array.getMdDeviceArray(), false), array);
+        this.index = index;
+        this.value = value;
+        this.valueLibrary = valueLibrary;
+        this.elementTypeProfile = elementTypeProfile;
+    }
+
+    @Override
+    public void updateLocationOfArrays() {
+        if (array.getGrCUDAExecutionContext().isConstAware()) {
+            array.addArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        } else {
+            // Clear the list of up-to-date locations: only the CPU has the updated array;
+            array.resetArrayUpToDateLocations(CPUDevice.CPU_DEVICE_ID);
+        }
+    }
+
+    @Override
+    public Object execute() throws UnsupportedTypeException {
+        array.writeNativeView(index, value, valueLibrary, elementTypeProfile);
+        this.setComputationFinished();
+        return NoneValue.get();
+    }
+
+    @Override
+    public String toString() {
+        return "MultiDimDeviceArrayViewReadExecution(" +
+                "array=" + array +
+                ", index=" + index +
+                ", value=" + value +
+                ")";
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DefaultDependencyComputation.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DefaultDependencyComputation.java
new file mode 100644
index 00000000..b4a1ede5
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DefaultDependencyComputation.java
@@ -0,0 +1,74 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.InitializeDependencyList;
+import com.oracle.truffle.api.CompilerDirectives;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * By default, consider all dependencies in the active argument set,
+ * initially specified by the {@link InitializeDependencyList} interface.
+ * Also update the active argument set, by adding all arguments that were not included in a dependency relation;
+ */
+public class DefaultDependencyComputation extends DependencyComputation {
+
+    @CompilerDirectives.TruffleBoundary
+    DefaultDependencyComputation(List<ComputationArgumentWithValue> argumentList) {
+        activeArgumentSet = new HashSet<>(argumentList);
+    }
+
+    @CompilerDirectives.TruffleBoundary
+    @Override
+    public List<ComputationArgumentWithValue> computeDependencies(GrCUDAComputationalElement other) {
+        Set<ComputationArgumentWithValue> dependencies = new HashSet<>();
+        Set<ComputationArgumentWithValue> newArgumentSet = new HashSet<>();
+        for (ComputationArgumentWithValue arg : activeArgumentSet) {
+            // The other computation requires the current argument, so we have found a new dependency;
+            if (other.getDependencyComputation().getActiveArgumentSet().contains(arg)) {
+                dependencies.add(arg);
+            } else {
+                // Otherwise, the current argument is still "active", and could enforce a dependency on a future computation;
+                newArgumentSet.add(arg);
+            }
+        }
+        // Arguments that are not leading to a new dependency could still create new dependencies later on!
+        activeArgumentSet = newArgumentSet;
+        // Return the list of arguments that created dependencies with the new computation;
+        return new ArrayList<>(dependencies);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DefaultDependencyComputationBuilder.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DefaultDependencyComputationBuilder.java
new file mode 100644
index 00000000..49b19915
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DefaultDependencyComputationBuilder.java
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+
+import java.util.List;
+
+public class DefaultDependencyComputationBuilder implements DependencyComputationBuilder {
+
+    @Override
+    public DefaultDependencyComputation initialize(List<ComputationArgumentWithValue> argumentList) {
+        return new DefaultDependencyComputation(argumentList);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyComputation.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyComputation.java
new file mode 100644
index 00000000..d610a780
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyComputation.java
@@ -0,0 +1,78 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+
+import java.util.Collection;
+
+/**
+ * Defines how data dependencies between {@link GrCUDAComputationalElement} are found,
+ * e.g. if read-only or scalar argyments should be ignored.
+ * It returns the list of arguments that have been found to create side-effects.
+ * The function is not guaranteed to be pure,
+ * and is allowed update information in the {@link GrCUDAComputationalElement}
+ */
+public abstract class DependencyComputation {
+
+    /**
+     * This set contains the input arguments that are considered, at each step, in the dependency computation.
+     * The set initially coincides with "argumentSet", then arguments are removed from this set once a new dependency is found.
+     * This is conceptually a set, in the sense that every element is unique.
+     * Concrete implementations might use other data structures, if required;
+     */
+    protected Collection<ComputationArgumentWithValue> activeArgumentSet;
+
+    /**
+     * Computes if the "other" GrCUDAComputationalElement has dependencies w.r.t. this kernel,
+     * such as requiring as input a value computed by this kernel;
+     * @param other kernel for which we want to check dependencies, w.r.t. this kernel
+     * @return the list of arguments that the two kernels have in common
+     */
+    public abstract Collection<ComputationArgumentWithValue> computeDependencies(GrCUDAComputationalElement other);
+
+    public Collection<ComputationArgumentWithValue> getActiveArgumentSet() {
+        return activeArgumentSet;
+    }
+
+    /**
+     * Provide an additional, optional filter used to determine
+     * if an array argument should have its visibility reset to the {@link com.nvidia.grcuda.runtime.stream.DefaultStream}
+     * through {@link GrCUDAComputationalElement#associateArraysToStream()}
+     * For example, a filter might want to reset the visibility of const array arguments, and ignore the others;
+     * @param arg an argument to analyse
+     * @return if this argument visibility should be reset or not
+     */
+    public boolean streamResetAttachFilter(ComputationArgumentWithValue arg) {
+        return false;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyComputationBuilder.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyComputationBuilder.java
new file mode 100644
index 00000000..1b02cce8
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyComputationBuilder.java
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+
+import java.util.List;
+
+public interface DependencyComputationBuilder {
+    DependencyComputation initialize(List<ComputationArgumentWithValue> argumentList);
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyPolicyEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyPolicyEnum.java
new file mode 100644
index 00000000..8504e917
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/DependencyPolicyEnum.java
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+public enum DependencyPolicyEnum {
+    NO_CONST("no-const"),
+    WITH_CONST("with-const");
+
+    private final String name;
+
+    DependencyPolicyEnum(String name) {
+        this.name = name;
+    }
+
+    @Override
+    public String toString() {
+        return name;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/WithConstDependencyComputation.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/WithConstDependencyComputation.java
new file mode 100644
index 00000000..2b25ef53
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/WithConstDependencyComputation.java
@@ -0,0 +1,90 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * If two computations have the same argument, but it is read-only in both cases (i.e. const),
+ * there is no reason to create a dependency between the two ;
+ */
+public class WithConstDependencyComputation extends DependencyComputation {
+
+    WithConstDependencyComputation(List<ComputationArgumentWithValue> argumentList) {
+        activeArgumentSet = new ArrayList<>(argumentList);
+    }
+
+    @Override
+    public List<ComputationArgumentWithValue> computeDependencies(GrCUDAComputationalElement other) {
+        List<ComputationArgumentWithValue> dependencies = new ArrayList<>();
+        List<ComputationArgumentWithValue> newArgumentSet = new ArrayList<>();
+        for (ComputationArgumentWithValue arg : activeArgumentSet) {
+            boolean dependencyFound = false;
+            for (ComputationArgumentWithValue otherArg : other.getDependencyComputation().getActiveArgumentSet()) {
+                // If both arguments are const, we skip the dependency;
+                if (arg.equals(otherArg) && !(arg.isConst() && otherArg.isConst())) {
+                    dependencies.add(arg);
+                    dependencyFound = true;
+                    // If the other argument is const, the current argument must be added to newArgumentSet
+                    //   as it could cause other dependencies in the future;
+                    if (otherArg.isConst()) {
+                        newArgumentSet.add(arg);
+                    }
+                    break;
+                }
+            }
+            if (!dependencyFound) {
+                // Otherwise, the current argument is still "active", and could enforce a dependency on a future computation;
+                newArgumentSet.add(arg);
+            }
+        }
+        // Arguments that are not leading to a new dependency could still create new dependencies later on!
+        activeArgumentSet = newArgumentSet;
+        // Return the list of arguments that created dependencies with the new computation;
+        return dependencies;
+    }
+
+    /**
+     * If the array was attached to a stream, and now it is a const parameter, reset its visibility to the default stream.
+     * For simplicity, we keep the visibility of all arguments currently used as const to the default stream.
+     * This allow the scheduling of multiple computations that use the same argument as const;
+     * @param arg an argument to analyse
+     * @return if this argument visibility should be reset or not
+     */
+    @Override
+    public boolean streamResetAttachFilter(ComputationArgumentWithValue arg) {
+        return arg.isConst();
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/WithConstDependencyComputationBuilder.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/WithConstDependencyComputationBuilder.java
new file mode 100644
index 00000000..80c74245
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/dependency/WithConstDependencyComputationBuilder.java
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.dependency;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+
+import java.util.List;
+
+public class WithConstDependencyComputationBuilder implements DependencyComputationBuilder {
+
+    @Override
+    public WithConstDependencyComputation initialize(List<ComputationArgumentWithValue> argumentList) {
+        return new WithConstDependencyComputation(argumentList);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/memadvise/MemAdviserEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/memadvise/MemAdviserEnum.java
new file mode 100644
index 00000000..f48a5623
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/memadvise/MemAdviserEnum.java
@@ -0,0 +1,18 @@
+package com.nvidia.grcuda.runtime.computation.memadvise;
+
+public enum MemAdviserEnum {
+    ADVISE_READ_MOSTLY("read-mostly"),
+    ADVISE_PREFERRED_LOCATION("preferred"),
+    NONE("none");
+
+    private final String name;
+
+    MemAdviserEnum(String name) {
+        this.name = name;
+    }
+
+    @Override
+    public String toString() {
+        return name;
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/AbstractArrayPrefetcher.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/AbstractArrayPrefetcher.java
new file mode 100644
index 00000000..e3a5aa28
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/AbstractArrayPrefetcher.java
@@ -0,0 +1,58 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.prefetch;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+
+/**
+ * Class that declares an interface to prefetch the data from CPU to GPU (and possibly viceversa).
+ * Prefetching requires a GPU with architecture starting from Pascal, and is not required for functionality (it is just a performance optimization).
+ */
+public abstract class AbstractArrayPrefetcher {
+
+    protected CUDARuntime runtime;
+
+    public AbstractArrayPrefetcher(CUDARuntime runtime) {
+        this.runtime = runtime;
+    }
+
+    /**
+     * Prefetch the arrays of a {@link GrCUDAComputationalElement}. Prefetching is always done asynchronously.
+     * @param computation a computational element whose array inputs can be prefetched from host to GPU
+     */
+    public abstract void prefetchToGpu(GrCUDAComputationalElement computation);
+
+    public void prefetchToGpu(ExecutionDAG.DAGVertex vertex) {
+        this.prefetchToGpu(vertex.getComputation());
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/AsyncArrayPrefetcher.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/AsyncArrayPrefetcher.java
new file mode 100644
index 00000000..9ef38a33
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/AsyncArrayPrefetcher.java
@@ -0,0 +1,66 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.prefetch;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+public class AsyncArrayPrefetcher extends AbstractArrayPrefetcher {
+
+    public AsyncArrayPrefetcher(CUDARuntime runtime) {
+        super(runtime);
+    }
+
+    /**
+     * The default array prefetcher schedules asynchronous prefetching on the arrays used by the computation.
+     * Only the arrays whose last operation has been a CPU access are prefetched, as the other are already up-to-date on GPU.
+     * The prefetcher assumes that the GPU allows prefetching (architecture since Pascal) and the arrays are visible to the stream where they are prefetched.
+     * Technically, we need prefetching only if the array has been modified by the CPU, and we could prefetch only the part that has been modified;
+     * this simple prefetcher still prefetches everything though.
+     * @param computation a computational element whose array inputs can be prefetched from host to GPU
+     */
+    @Override
+    public void prefetchToGpu(GrCUDAComputationalElement computation) {
+        for (ComputationArgumentWithValue a : computation.getArgumentsThatCanCreateDependencies()) {
+            if (a.getArgumentValue() instanceof AbstractArray) {
+                AbstractArray array = (AbstractArray) a.getArgumentValue();
+                // The array has been used by the CPU, so we should prefetch it;
+                if (array.isArrayUpdatedOnCPU()) {
+                    CUDAStream streamToPrefetch = computation.getStream();
+                    runtime.cudaMemPrefetchAsync(array, streamToPrefetch);
+                }
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/NoneArrayPrefetcher.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/NoneArrayPrefetcher.java
new file mode 100644
index 00000000..0aeac116
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/NoneArrayPrefetcher.java
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.prefetch;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+
+public class NoneArrayPrefetcher extends AbstractArrayPrefetcher {
+
+    public NoneArrayPrefetcher(CUDARuntime runtime) {
+        super(runtime);
+    }
+
+    /**
+     * This array prefetcher doesn't do anything;
+     * @param computation a computational element whose array inputs can be prefetched from host to GPU
+     */
+    @Override
+    public void prefetchToGpu(GrCUDAComputationalElement computation) {
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/PrefetcherEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/PrefetcherEnum.java
new file mode 100644
index 00000000..d838a44b
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/PrefetcherEnum.java
@@ -0,0 +1,37 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.prefetch;
+
+public enum PrefetcherEnum {
+    NONE,
+    ASYNC,
+    SYNC
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/SyncArrayPrefetcher.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/SyncArrayPrefetcher.java
new file mode 100644
index 00000000..810d01d6
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/prefetch/SyncArrayPrefetcher.java
@@ -0,0 +1,67 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.prefetch;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+public class SyncArrayPrefetcher extends AbstractArrayPrefetcher {
+
+    public SyncArrayPrefetcher(CUDARuntime runtime) {
+        super(runtime);
+    }
+
+    /**
+     * The synchronous array prefetcher schedules prefetching on the arrays used by the computation, and waits for their completion.
+     * Only the arrays whose last operation has been a CPU access are prefetched, as the other are already up-to-date on GPU.
+     * The prefetcher assumes that the GPU allows prefetching (architecture since Pascal) and the arrays are visible to the stream where they are prefetched.
+     * Technically, we need prefetching only if the array has been modified by the CPU, and we could prefetch only the part that has been modified;
+     * this simple prefetcher still prefetches everything though.
+     * @param computation a computational element whose array inputs can be prefetched from host to GPU
+     */
+    @Override
+    public void prefetchToGpu(GrCUDAComputationalElement computation) {
+        for (ComputationArgumentWithValue a : computation.getArgumentsThatCanCreateDependencies()) {
+            if (a.getArgumentValue() instanceof AbstractArray) {
+                AbstractArray array = (AbstractArray) a.getArgumentValue();
+                // The array has been used by the CPU, so we should prefetch it;
+                if (array.isArrayUpdatedOnCPU()) {
+                    CUDAStream streamToPrefetch = computation.getStream();
+                    runtime.cudaMemPrefetchAsync(array, streamToPrefetch);
+                    runtime.cudaStreamSynchronize(streamToPrefetch);
+                }
+            }
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/PostPascalStreamAttachPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/PostPascalStreamAttachPolicy.java
new file mode 100644
index 00000000..9461aed8
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/PostPascalStreamAttachPolicy.java
@@ -0,0 +1,55 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.streamattach;
+
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+import java.util.Optional;
+import java.util.concurrent.Callable;
+
+/**
+ * GPUs with Pascal architecture or newer (e.g. Tesla P100), with compute capability >= 6.0,
+ * do not require to exclusively associate a managed memory array to a single stream to provide
+ * asynchronous host access to managed memory while a kernel is running.
+ * As such, no stream association is performed;
+ */
+public class PostPascalStreamAttachPolicy implements StreamAttachArchitecturePolicy {
+
+    @Override
+    public void execute(Runnable callable) {
+
+    }
+
+    @Override
+    public Optional<CUDAStream> execute(Callable<Optional<CUDAStream>> callable) {
+        return Optional.empty();
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/PrePascalStreamAttachPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/PrePascalStreamAttachPolicy.java
new file mode 100644
index 00000000..4843d04a
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/PrePascalStreamAttachPolicy.java
@@ -0,0 +1,63 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.streamattach;
+
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.stream.DefaultStream;
+
+import java.util.Optional;
+import java.util.concurrent.Callable;
+
+/**
+ * GPUs with pre-Pascal architecture or older, with compute capability < 6.0,
+ * require to exclusively associate a managed memory array to a single stream to provide
+ * asynchronous host access to managed memory while a kernel is running.
+ * This interface wraps and executes the array association function specified in {@link GrCUDAComputationalElement}
+ */
+public class PrePascalStreamAttachPolicy implements StreamAttachArchitecturePolicy {
+
+    @Override
+    public void execute(Runnable runnable) {
+        runnable.run();
+    }
+
+    @Override
+    public Optional<CUDAStream> execute(Callable<Optional<CUDAStream>> callable) {
+        try {
+            return callable.call();
+        } catch(Exception e) {
+            GrCUDALogger.getLogger(GrCUDALogger.COMPUTATION_LOGGER).warning("failed to compute stream dependency, returning default stream");
+            return Optional.of(DefaultStream.get());
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/StreamAttachArchitecturePolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/StreamAttachArchitecturePolicy.java
new file mode 100644
index 00000000..af68b759
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/computation/streamattach/StreamAttachArchitecturePolicy.java
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.computation.streamattach;
+
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+import java.util.Optional;
+import java.util.concurrent.Callable;
+
+/**
+ * GPUs with pre-Pascal architecture or older, with compute capability < 6.0,
+ * require to exclusively associate a managed memory array to a single stream to provide
+ * asynchronous host access to managed memory while a kernel is running.
+ * This interface wraps and executes the array association function specified in {@link GrCUDAComputationalElement}.
+ * The array association function will be done only if the available GPU has compute capability < 6.0;
+ */
+public interface StreamAttachArchitecturePolicy {
+    void execute(Runnable runnable);
+
+    Optional<CUDAStream> execute(Callable<Optional<CUDAStream>> callable);
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/AbstractGrCUDAExecutionContext.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/AbstractGrCUDAExecutionContext.java
new file mode 100644
index 00000000..42063248
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/AbstractGrCUDAExecutionContext.java
@@ -0,0 +1,197 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.executioncontext;
+
+import com.nvidia.grcuda.Binding;
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.DeviceList;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Kernel;
+import com.nvidia.grcuda.runtime.computation.streamattach.StreamAttachArchitecturePolicy;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.dependency.DefaultDependencyComputationBuilder;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyComputationBuilder;
+import com.nvidia.grcuda.runtime.computation.dependency.WithConstDependencyComputationBuilder;
+import com.nvidia.grcuda.runtime.computation.prefetch.AbstractArrayPrefetcher;
+import com.nvidia.grcuda.runtime.computation.prefetch.NoneArrayPrefetcher;
+import com.oracle.truffle.api.TruffleLogger;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+import java.util.HashSet;
+import java.util.Set;
+
+/**
+ * Abstract class that defines how {@link GrCUDAComputationalElement} are registered and scheduled for execution.
+ * It monitor sthe state of GrCUDA execution, keep track of memory allocated,
+ * kernels and other executable functions, and dependencies between elements.
+ */
+public abstract class AbstractGrCUDAExecutionContext {
+
+    protected static final TruffleLogger LOGGER = GrCUDALogger.getLogger(GrCUDALogger.EXECUTIONCONTEXT_LOGGER);
+
+    /**
+     * Reference to the inner {@link CUDARuntime} used to execute kernels and other {@link GrCUDAComputationalElement}
+     */
+    protected final CUDARuntime cudaRuntime;
+
+    /**
+     * Set that contains all the arrays allocated so far.
+     */
+    protected final Set<AbstractArray> arraySet = new HashSet<>();
+
+    /**
+     * Set that contains all the CUDA kernels declared so far.
+     */
+    protected final Set<Kernel> kernelSet = new HashSet<>();
+
+    /**
+     * Reference to the computational DAG that represents dependencies between computations;
+     */
+    protected final ExecutionDAG dag;
+
+    /**
+     * Reference to how dependencies between computational elements are computed within this execution context;
+     */
+    private final DependencyComputationBuilder dependencyBuilder;
+
+    /**
+     * Reference to the prefetching strategy to use in this execution context;
+     */
+    protected AbstractArrayPrefetcher arrayPrefetcher;
+
+    /**
+     * True if we consider that an argument can be "const" in the scheduling;
+     */
+    private final boolean isConstAware;
+
+    public AbstractGrCUDAExecutionContext(CUDARuntime cudaRuntime, GrCUDAOptionMap options) {
+        this.cudaRuntime = cudaRuntime;
+        // Compute the dependency policy to use;
+        switch (options.getDependencyPolicy()) {
+            case WITH_CONST:
+                this.isConstAware = true;
+                this.dependencyBuilder = new WithConstDependencyComputationBuilder();
+                break;
+            case NO_CONST:
+                this.isConstAware = false;
+                this.dependencyBuilder = new DefaultDependencyComputationBuilder();
+                break;
+            default:
+                LOGGER.severe(() -> "Cannot create a GrCUDAExecutionContext. The selected dependency policy is not valid: " + options.getDependencyPolicy());
+                throw new GrCUDAException("selected dependency policy is not valid: " + options.getDependencyPolicy());
+        }
+        // By default, assume no prefetching;
+        arrayPrefetcher = new NoneArrayPrefetcher(this.cudaRuntime);
+        // Initialize the DAG;
+        this.dag = new ExecutionDAG(options.getDependencyPolicy());
+    }
+
+    /**
+     * Register this computation for future execution by the {@link AbstractGrCUDAExecutionContext},
+     * and add it to the current computational DAG.
+     * The actual execution might be deferred depending on the inferred data dependencies;
+     */
+    abstract public Object registerExecution(GrCUDAComputationalElement computation) throws UnsupportedTypeException;
+
+    public void registerArray(AbstractArray array) {
+        arraySet.add(array);
+    }
+
+    public void registerKernel(Kernel kernel) {
+        kernelSet.add(kernel);
+    }
+
+    public ExecutionDAG getDag() {
+        return dag;
+    }
+
+    public CUDARuntime getCudaRuntime() {
+        return cudaRuntime;
+    }
+
+    public DependencyComputationBuilder getDependencyBuilder() {
+        return dependencyBuilder;
+    }
+
+    public boolean isConstAware() {
+        return isConstAware;
+    }
+
+    // Functions used to interface directly with the CUDA runtime;
+
+    public Kernel loadKernel(Binding binding) {
+        return cudaRuntime.loadKernel(this, binding);
+    }
+
+    public Kernel buildKernel(String code, String kernelName, String signature) {
+        return cudaRuntime.buildKernel(this, code, kernelName, signature);
+    }
+
+    public StreamAttachArchitecturePolicy getArrayStreamArchitecturePolicy() {
+        return cudaRuntime.getArrayStreamArchitecturePolicy();
+    }
+
+    public boolean isArchitecturePascalOrNewer() {
+        return this.cudaRuntime.isArchitectureIsPascalOrNewer();
+    }
+
+    public int getCurrentGPU() {
+        return this.cudaRuntime.getCurrentGPU();
+    }
+
+    /**
+     * Return a list of GPU devices managed by this execution context;
+     * @return a list of GPU devices managed by this execution context;
+     */
+    public abstract DeviceList getDeviceList();
+
+    /**
+     * Return a specific GPU device managed by this execution context;
+     * @return a GPU device managed by this execution context;
+     */
+    public abstract Device getDevice(int deviceId);
+
+    /**
+     * Check if any computation is currently marked as active, and is running on a stream managed by this context.
+     * If so, scheduling of new computations is likely to require synchronizations of some sort;
+     * @return if any computation is considered active on a stream managed by this context
+     */
+    public abstract boolean isAnyComputationActive();
+
+    /**
+     * Delete internal structures that require manual cleanup operations;
+     */
+    public void cleanup() { }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/AsyncGrCUDAExecutionContext.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/AsyncGrCUDAExecutionContext.java
new file mode 100644
index 00000000..6a8f87c2
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/AsyncGrCUDAExecutionContext.java
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.executioncontext;
+
+import com.nvidia.grcuda.GrCUDAContext;
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.DeviceList;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.prefetch.AsyncArrayPrefetcher;
+import com.nvidia.grcuda.runtime.stream.GrCUDAStreamManager;
+import com.oracle.truffle.api.TruffleLanguage;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+/**
+ * Class used to monitor the state of GrCUDA execution, keep track of memory allocated,
+ * kernels and other executable functions, and dependencies between elements.
+ */
+public class AsyncGrCUDAExecutionContext extends AbstractGrCUDAExecutionContext {
+
+    /**
+     * Reference to the {@link com.nvidia.grcuda.runtime.stream.GrCUDAStreamManager} that takes care of
+     * scheduling computations on different streams;
+     */
+    private final GrCUDAStreamManager streamManager;
+
+    public AsyncGrCUDAExecutionContext(GrCUDAContext context, TruffleLanguage.Env env) {
+        this(new CUDARuntime(context, env), context.getOptions());
+    }
+
+    public AsyncGrCUDAExecutionContext(CUDARuntime cudaRuntime, GrCUDAOptionMap options) {
+        this(cudaRuntime, options, new GrCUDAStreamManager(cudaRuntime, options));
+    }
+
+    public AsyncGrCUDAExecutionContext(CUDARuntime cudaRuntime, GrCUDAOptionMap options, GrCUDAStreamManager streamManager) {
+        super(cudaRuntime, options);
+        this.streamManager = streamManager;
+        // Compute if we should use a prefetcher;
+        if (options.isInputPrefetch() && this.cudaRuntime.isArchitectureIsPascalOrNewer()) {
+            arrayPrefetcher = new AsyncArrayPrefetcher(this.cudaRuntime);
+        }
+    }
+
+    /**
+     * Register this computation for future execution by the {@link AsyncGrCUDAExecutionContext},
+     * and add it to the current computational DAG.
+     * The actual execution might be deferred depending on the inferred data dependencies;
+     */
+    @Override
+    public Object registerExecution(GrCUDAComputationalElement computation) throws UnsupportedTypeException {
+        // Add the new computation to the DAG
+        ExecutionDAG.DAGVertex vertex = dag.append(computation);
+
+        // Compute the stream where the computation will be done, if the computation can be performed asynchronously;
+        streamManager.assignStream(vertex);
+
+        // Prefetching;
+        arrayPrefetcher.prefetchToGpu(vertex);
+
+        // Start the computation;
+        Object result = executeComputation(vertex);
+
+        // Associate a CUDA event to this computation, if performed asynchronously;
+        streamManager.assignEventStop(vertex);
+
+        GrCUDALogger.getLogger(GrCUDALogger.EXECUTIONCONTEXT_LOGGER).finest(() -> "-- running " + vertex.getComputation());
+
+        return result;
+    }
+
+    @Override
+    public DeviceList getDeviceList() {
+        // The device list is created only once, and we always return the same device list object.
+        // This is just an optimization to avoid creating new objects;
+        return this.getStreamManager().getDeviceList();
+    }
+
+    @Override
+    public Device getDevice(int deviceId) {
+        // The device list is created only once, and we always return the same device object.
+        // This is just an optimization to avoid creating new objects;
+        return this.getStreamManager().getDevice(deviceId);
+    }
+
+    @Override
+    public boolean isAnyComputationActive() {
+        return this.streamManager.isAnyComputationActive();
+    }
+
+    public GrCUDAStreamManager getStreamManager() {
+        return streamManager;
+    }
+
+    /**
+     * Delete internal structures that require manual cleanup operations;
+     */
+    @Override
+    public void cleanup() {
+        streamManager.cleanup();
+    }
+
+    private Object executeComputation(ExecutionDAG.DAGVertex vertex) throws UnsupportedTypeException {
+        // Before starting this computation, ensure that all its parents have finished their computation;
+        streamManager.syncParentStreams(vertex);
+
+        // Perform the computation;
+        vertex.getComputation().setComputationStarted();
+
+        // For all input arrays, update whether this computation is an array access done by the CPU;
+        vertex.getComputation().updateLocationOfArrays();
+
+        // Associate a CUDA event to the starting phase of the computation in order to get the Elapsed time from start to the end
+        streamManager.assignEventStart(vertex);
+
+        return vertex.getComputation().execute();
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/ExecutionDAG.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/ExecutionDAG.java
new file mode 100644
index 00000000..4c96751a
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/ExecutionDAG.java
@@ -0,0 +1,378 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.executioncontext;
+
+import com.nvidia.grcuda.runtime.computation.ComputationArgumentWithValue;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.dependency.DependencyPolicyEnum;
+import com.oracle.truffle.api.interop.TruffleObject;
+
+import java.util.ArrayDeque;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Directed Acyclic Graph (DAG) that represents the execution flow of GrCUDA kernels and other
+ * computations. Each vertex is a computation, and an edge between vertices represents a dependency
+ * such that the end vertex must wait for the start vertex to finish before starting.
+ */
+public class ExecutionDAG implements TruffleObject {
+
+    private final List<DAGVertex> vertices = new ArrayList<>();
+    private final List<DAGEdge> edges = new ArrayList<>();
+    private final KeepDependency keepDependency;
+
+    /**
+     * Current frontier of the DAG, i.e. vertices with no children.
+     */
+    private List<DAGVertex> frontier = new ArrayList<>();
+
+    public ExecutionDAG(DependencyPolicyEnum dependencyPolicy) {
+        switch (dependencyPolicy) {
+            case WITH_CONST:
+                this.keepDependency = new WithConstKeepDependency();
+                break;
+            case NO_CONST:
+                this.keepDependency = new DefaultKeepDependency();
+                break;
+            default:
+                this.keepDependency = new DefaultKeepDependency();
+        }
+    }
+
+    /**
+     * Add a new computation to the graph, and compute its dependencies.
+     * @param kernel a kernel computation, containing kernel configuration and input arguments
+     * @return the new vertex that has been appended to the DAG
+     */
+    public DAGVertex append(GrCUDAComputationalElement kernel) {
+        // Add it to the list of vertices;
+        DAGVertex newVertex = new DAGVertex(kernel);
+
+        //////////////////////////////
+        // Compute dependencies with other vertices in the DAG frontier, and create edges;
+        //////////////////////////////
+
+        // For each vertex in the frontier, compute dependencies of the vertex;
+
+        // Collect the vertices from which there are dependencies;
+        Map<DAGVertex, Collection<ComputationArgumentWithValue>> dependentVerticesMap = new HashMap<>();
+        List<DAGVertex> dependentVertices = new ArrayList<>();
+        for (DAGVertex frontierVertex : cleanFrontier()) {
+            Collection<ComputationArgumentWithValue> dependencies = computeDependencies(frontierVertex, newVertex);
+            if (dependencies.size() > 0) {
+                dependentVerticesMap.put(frontierVertex, dependencies);
+                dependentVertices.add(frontierVertex);
+            }
+        }
+
+        // Filter dependencies that are unnecessary. For example,
+        //   if a computation C depends on computations A and B, and B depends on A;
+        dependentVertices = dependentVertices.stream()
+                .filter(v -> keepDependency.keepDependency(v, dependentVerticesMap.keySet()))
+                .collect(Collectors.toList());
+
+        // Create new edges;
+        for (DAGVertex dependentVertex : dependentVertices) {
+            // Create a new edge between the two vertices (book-keeping is automatic);
+            new DAGEdge(dependentVertex, newVertex, dependentVerticesMap.get(dependentVertex));
+        }
+
+        // Remove from the frontier vertices that no longer belong to it;
+        frontier = cleanFrontier();
+        // Add the new vertex to the frontier if it has no children;
+        if (newVertex.isFrontier()) {
+            frontier.add(newVertex);
+        }
+        return newVertex;
+    }
+
+    private Collection<ComputationArgumentWithValue> computeDependencies(DAGVertex startVertex, DAGVertex endVertex) {
+        return startVertex.getComputation().computeDependencies(endVertex.getComputation());
+    }
+
+    public List<DAGVertex> getVertices() {
+        return vertices;
+    }
+
+    public List<DAGEdge> getEdges() {
+        return edges;
+    }
+
+    public int getNumVertices() {
+        return vertices.size();
+    }
+
+    public int getNumEdges() {
+        return edges.size();
+    }
+
+    public List<DAGVertex> getFrontier() {
+        return cleanFrontier();
+    }
+
+    /**
+     * Ensure that the internal representation of the frontier is up-to-date.
+     * Whether a vertex is part of the frontier can change dynamically (e.g. if a vertex computation is over),
+     * and we have to ensure that the "cached" internal frontier is up-to-date every time it is accessed;
+     * @return the updated DAG frontier
+     */
+    private List<DAGVertex> cleanFrontier() {
+        frontier = frontier.stream().filter(DAGVertex::isFrontier).collect(Collectors.toList());
+        return frontier;
+    }
+
+    @Override
+    public String toString() {
+        return "DAG(" +
+                "|V|=" + vertices.size() +
+                ", |E|=" + edges.size() +
+                "\nvertices=\n" + vertices.stream().map(Object::toString).collect(Collectors.joining(",\n")) +
+                ')';
+    }
+
+    /**
+     * By default, keep all dependencies;
+     */
+    private static class DefaultKeepDependency implements KeepDependency {
+        @Override
+        public boolean keepDependency(DAGVertex vertex, Set<DAGVertex> dependentVertices) {
+            return true;
+        }
+    }
+
+    private static class WithConstKeepDependency implements KeepDependency {
+        /**
+         * Determine if a vertex should really be a dependency, given a set of possible dependencies.
+         * The vertex is not going to be a dependency if any of its children is included in the dependency set;
+         * @param vertex a vertex we want to possibly filter, if it's an unnecessary dependency
+         * @param dependentVertices a list of possible dependencies
+         * @return if the vertex should be kept in the dependencies
+         */
+        @Override
+        public boolean keepDependency(DAGVertex vertex, Set<DAGVertex> dependentVertices) {
+            // Perform a BFS starting from the children of "vertex";
+            Queue<DAGVertex> queue = new ArrayDeque<>(vertex.getChildVertices());
+            Set<DAGVertex> visitedVertices = new HashSet<>();
+
+            while (!queue.isEmpty()) {
+                DAGVertex currentVertex = queue.poll();
+                // If the current vertex is in the set of candidate dependencies, we can filter it out;
+                if (dependentVertices.contains(currentVertex)) {
+                    return false;
+                } else if (!visitedVertices.contains(currentVertex)) {
+                    // Add children to the queue, but only if the current vertex hasn't been seen yet;
+                    visitedVertices.add(currentVertex);
+                    queue.addAll(currentVertex.getChildVertices());
+                }
+            }
+            return true;
+        }
+    }
+
+    /**
+     * Simple vertex class used to encapsulate {@link GrCUDAComputationalElement}.
+     */
+    public class DAGVertex {
+
+        private final GrCUDAComputationalElement computation;
+        private final int id;
+
+        /**
+         * False only if the vertex has parent vertices.
+         */
+        private boolean isStart = true;
+        /**
+         * List of edges that connect this vertex to its parents (they are the start of each edge).
+         */
+        private final List<DAGEdge> parents = new ArrayList<>();
+        /**
+         * List of edges that connect this vertex to its children (they are the end of each edge).
+         */
+        private final List<DAGEdge> children = new ArrayList<>();
+
+        DAGVertex(GrCUDAComputationalElement computation) {
+            this.computation = computation;
+            this.id = getNumVertices();
+            vertices.add(this);
+        }
+
+        public GrCUDAComputationalElement getComputation() {
+            return computation;
+        }
+
+        int getId() {
+            return id;
+        }
+
+        public boolean isStart() {
+            return isStart;
+        }
+
+        /**
+         * A vertex is considered part of the DAG frontier if it could lead to dependencies.
+         * In general, a vertex is not part of the frontier only if it has no arguments, it has already been executed,
+         * or all its arguments have already been superseded by the arguments of computations that depends on this one;
+         * @return if this vertex is part of the DAG frontier
+         */
+        public boolean isFrontier() {
+            return computation.hasPossibleDependencies() && !computation.isComputationFinished();
+        }
+
+        /**
+         * Check if this vertex corresponds to a computation that can be immediately executed.
+         * This usually happens if the computations has no parents, or all the parents have already completed their execution;
+         * @return if the computation can be started immediately
+         */
+        public boolean isExecutable() {
+            return !computation.isComputationStarted() && (parents.isEmpty() || allParentsHaveFinishedComputation());
+        }
+
+        private boolean allParentsHaveFinishedComputation() {
+            for (DAGEdge e : parents) {
+                if (!e.getStart().getComputation().isComputationFinished()) return false;
+            }
+            return true;
+        }
+
+        public List<DAGEdge> getParents() {
+            return parents;
+        }
+
+        public List<DAGEdge> getChildren() {
+            return children;
+        }
+
+        public List<DAGVertex> getParentVertices() { return parents.stream().map(DAGEdge::getStart).collect(Collectors.toList()); }
+
+        public List<DAGVertex> getChildVertices() { return children.stream().map(DAGEdge::getEnd).collect(Collectors.toList()); }
+
+        public List<GrCUDAComputationalElement> getParentComputations() {
+            return parents.stream().map(e -> e.getStart().getComputation()).collect(Collectors.toList());
+        }
+
+        public List<GrCUDAComputationalElement> getChildComputations() {
+            return children.stream().map(e -> e.getEnd().getComputation()).collect(Collectors.toList());
+        }
+
+        public void setStart(boolean start) {
+            isStart = start;
+        }
+
+        public void addParent(DAGEdge edge) {
+            parents.add(edge);
+            isStart = false;
+        }
+
+        public void addChild(DAGEdge edge) {
+            children.add(edge);
+        }
+
+        @Override
+        public String toString() {
+            return "V(" +
+                    "id=" + id +
+                    ", isStart=" + isStart +
+                    ", isFrontier=" + this.isFrontier() +
+                    ", parents=" + parents +
+                    ", children=" + children +
+                    ')';
+        }
+    }
+
+    /**
+     * Simple edge class used to connect {@link DAGVertex} with dependencies.
+     * An edge from a source to a destination means that the destination computation must wait
+     * for the start computation to finish before starting.
+     */
+    public class DAGEdge {
+
+        final private DAGVertex start;
+        final private DAGVertex end;
+        final private int id;
+        /**
+         * Set of objects that represents dependencies between the two vertices;
+         */
+        private Collection<ComputationArgumentWithValue> dependencies;
+
+        DAGEdge(DAGVertex start, DAGVertex end) {
+            this.start = start;
+            this.end = end;
+            this.id = getNumEdges();
+
+            // Update parents and children of the two vertices;
+            start.addChild(this);
+            end.addParent(this);
+            // Book-keeping of the edge;
+            edges.add(this);
+        }
+
+        DAGEdge(DAGVertex start, DAGVertex end, Collection<ComputationArgumentWithValue> dependencies) {
+            this(start, end);
+            this.dependencies = dependencies;
+        }
+
+        public DAGVertex getStart() {
+            return start;
+        }
+
+        public DAGVertex getEnd() {
+            return end;
+        }
+
+        public int getId() {
+            return id;
+        }
+
+        public Collection<ComputationArgumentWithValue> getDependencies() {
+            return dependencies;
+        }
+
+        public String toExportGraph(){
+            return "\"CE" + start.getId() + start.getComputation().getArgumentsThatCanCreateDependencies() + "\" -> \"CE" +  + end.getId() + end.getComputation().getArgumentsThatCanCreateDependencies() + "\"";
+        }
+
+        @Override
+        public String toString() {
+            return "E(" +
+                    "start=" + start.getId() +
+                    ", end=" + end.getId() +
+                    ')';
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/ExecutionPolicyEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/ExecutionPolicyEnum.java
new file mode 100644
index 00000000..9fdcd02f
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/ExecutionPolicyEnum.java
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.executioncontext;
+
+public enum ExecutionPolicyEnum {
+    SYNC("sync"),
+    ASYNC("async");
+
+    private final String name;
+
+    ExecutionPolicyEnum(String name) {
+        this.name = name;
+    }
+
+    @Override
+    public final String toString() {
+        return name;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/GraphExport.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/GraphExport.java
new file mode 100644
index 00000000..9afd5a91
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/GraphExport.java
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package com.nvidia.grcuda.runtime.executioncontext;
+
+import java.io.File;
+import java.io.FileWriter;
+import java.io.IOException;
+import java.util.*;
+import java.util.stream.Collectors;
+
+/**
+ * Class that receives from its constructor the ExecutionDAG and has
+ * the functionality of export its graphical representation in .dot format.
+ * The graph will be exported to the path inserted by the user in the option argument.
+ */
+public class GraphExport {
+
+    private final List<ExecutionDAG.DAGVertex> vertices;
+    private final List<ExecutionDAG.DAGEdge> edges;
+
+    public GraphExport(ExecutionDAG dag) {
+        this.vertices = dag.getVertices();
+        this.edges = dag.getEdges();
+    }
+
+    /**
+     * If path is valid, this method creates a .dot file with the graphical representation
+     * of the scheduling DAG created during the computations.
+     * @param path Destination path of the .dot file.
+     */
+    public void graphGenerator(String path) {
+        StringBuilder output;
+        List<Integer> streams = new ArrayList<>();
+        List<Integer> devices = new ArrayList<>();
+
+        for (ExecutionDAG.DAGVertex vertex : vertices){
+            streams.add(vertex.getComputation().getStream().getStreamNumber());
+            devices.add(vertex.getComputation().getStream().getStreamDeviceId());
+        }
+        streams = streams.stream().distinct().collect(Collectors.toList());
+        devices = devices.stream().distinct().collect(Collectors.toList());
+        int offset = streams.size();
+
+        output = new StringBuilder("digraph G {\n" +
+                "\tfontname=\"Helvetica,Arial,sans-serif\"\n" +
+                "\tnode [fontname=\"Helvetica,Arial,sans-serif\"]\n" +
+                "\tedge [fontname=\"Helvetica,Arial,sans-serif\"]\n" +
+                "\n\n");
+
+        for (Integer device : devices) {
+            output.append("\tsubgraph cluster_").append(device).append(" {\n");
+
+            for (Integer stream : streams) {
+                if (stream < 0 ) output.append("\tsubgraph cluster_").append((stream*-1)+offset).append(" {\n").append("\t\tstyle=filled;\n").append("\t\tnode [style=filled];\n");
+                else output.append("\tsubgraph cluster_").append(stream).append(" {\n").append("\t\tstyle=filled;\n").append("\t\tnode [style=filled];\n");
+
+                for (ExecutionDAG.DAGVertex vertex : vertices) {
+                    if (vertex.getComputation().getStream().getStreamNumber() == stream && vertex.getComputation().getStream().getStreamDeviceId() == device) {
+                        output = new StringBuilder(output + "\"CE" + vertex.getId() + vertex.getComputation().getArgumentsThatCanCreateDependencies() + "\";\n");
+                    }
+                }
+
+                output = new StringBuilder(output + "\n");
+                output = new StringBuilder(output + "\t\tlabel = \"stream " + stream + "\";\n" +
+                        "\t\tcolor=orange;\n" +
+                        "\t}\n");
+            }
+
+            output = new StringBuilder(output + "\n");
+            output = new StringBuilder(output + "\t\tlabel = \"device " + device + "\";\n" +
+                    "\t\tcolor=green;\n" +
+                    "\t}\n");
+
+        }
+
+        output = new StringBuilder(output + "\n");
+
+        for (ExecutionDAG.DAGVertex vertex : vertices) {
+            if (vertex.isStart()) {
+                output = new StringBuilder(output + "start -> " + "\"CE" + vertex.getId() + vertex.getComputation().getArgumentsThatCanCreateDependencies() + "\";\n");
+            }
+        }
+
+        for (ExecutionDAG.DAGEdge dependency : edges) {
+            output = new StringBuilder(output + dependency.toExportGraph() + ";\n");
+        }
+
+        output = new StringBuilder(output + "\n");
+
+        for (ExecutionDAG.DAGVertex vertex : vertices) {
+            if (vertex.isFrontier()) {
+                output = new StringBuilder(output + "\"CE" + vertex.getId() + vertex.getComputation().getArgumentsThatCanCreateDependencies() + "\" -> end;\n");
+            }
+        }
+
+        output = new StringBuilder(output + "\tstart [shape=Mdiamond];\n" +
+                "\tend [shape=Msquare];\n" +
+                "}");
+
+        path = path + ".dot";
+        File graph = new File(path);
+        try {
+            FileWriter writer = new FileWriter(graph);
+            writer.write(output.toString());
+            writer.close();
+            System.out.println("Execution DAG successfully exported at " + path);
+        } catch (IOException e) {
+            System.out.println("An error occurred while exporting the Execution DAG, please check the path");
+            e.printStackTrace();
+        }
+    }
+
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/KeepDependency.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/KeepDependency.java
new file mode 100644
index 00000000..50763ba0
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/KeepDependency.java
@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.executioncontext;
+
+import java.util.Set;
+
+public interface KeepDependency {
+    /**
+     * Determine if a vertex should really be a dependency, given a set of possible dependencies.
+     * The vertex is not going to be a dependency if any of its children is included in the dependency set;
+     * @param vertex a vertex we want to possibly filter, if it's an unnecessary dependency
+     * @param dependentVertices a list of possible dependencies
+     * @return if the vertex should be kept in the dependencies
+     */
+    boolean keepDependency(ExecutionDAG.DAGVertex vertex, Set<ExecutionDAG.DAGVertex> dependentVertices);
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/SyncGrCUDAExecutionContext.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/SyncGrCUDAExecutionContext.java
new file mode 100644
index 00000000..d771453e
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/executioncontext/SyncGrCUDAExecutionContext.java
@@ -0,0 +1,107 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.executioncontext;
+
+import com.nvidia.grcuda.GrCUDAContext;
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.DeviceList;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.computation.prefetch.SyncArrayPrefetcher;
+import com.oracle.truffle.api.TruffleLanguage;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+/**
+ * Execute all computations synchronously, without computing dependencies or using streams;
+ */
+public class SyncGrCUDAExecutionContext extends AbstractGrCUDAExecutionContext {
+
+    public SyncGrCUDAExecutionContext(GrCUDAContext context, TruffleLanguage.Env env) {
+        this(new CUDARuntime(context, env), context.getOptions());
+    }
+
+    public SyncGrCUDAExecutionContext(CUDARuntime cudaRuntime, GrCUDAOptionMap options) {
+        super(cudaRuntime, options);
+        // Compute if we should use a prefetcher;
+        if (options.isInputPrefetch() && this.cudaRuntime.isArchitectureIsPascalOrNewer()) {
+            arrayPrefetcher = new SyncArrayPrefetcher(this.cudaRuntime);
+        }
+    }
+
+    // TODO check correctness
+    /**
+     * Register this computation for future execution by the {@link SyncGrCUDAExecutionContext},
+     * and add it to the current computational DAG.
+     * The actual execution might be deferred depending on the inferred data dependencies;
+     */
+    @Override
+    public Object registerExecution(GrCUDAComputationalElement computation) throws UnsupportedTypeException {
+
+        // Prefetching;
+        arrayPrefetcher.prefetchToGpu(computation);
+
+        // Book-keeping;
+        computation.setComputationStarted();
+
+        // For all input arrays, update whether this computation is an array access done by the CPU;
+        computation.updateLocationOfArrays();
+
+        // Start the computation immediately;
+        Object result = computation.execute();
+
+        // Wait for the computation to end;
+        cudaRuntime.cudaDeviceSynchronize();
+
+        return result;
+    }
+
+    @Override
+    public DeviceList getDeviceList() {
+        // Create a new device list object when requested;
+        return new DeviceList(cudaRuntime);
+    }
+
+    @Override
+    public Device getDevice(int deviceId) {
+        // Create a new device list object when requested;
+        return new Device(deviceId, cudaRuntime);
+    }
+
+    /**
+     * All computations are synchronous, and atomic;
+     * @return false
+     */
+    @Override
+    public boolean isAnyComputationActive() {
+        return false;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/CUDAStream.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/CUDAStream.java
new file mode 100644
index 00000000..343e5c24
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/CUDAStream.java
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream;
+
+import java.util.Objects;
+
+import com.nvidia.grcuda.GPUPointer;
+import com.oracle.truffle.api.interop.InteropLibrary;
+import com.oracle.truffle.api.library.ExportLibrary;
+import com.oracle.truffle.api.library.ExportMessage;
+
+@ExportLibrary(InteropLibrary.class)
+public class CUDAStream extends GPUPointer {
+
+    private final int streamNumber;
+    private int deviceId;
+
+    public CUDAStream(long rawPointer, int streamNumber, int deviceId) {
+        super(rawPointer);
+        this.streamNumber = streamNumber;
+        this.deviceId = deviceId;
+    }
+
+    public int getStreamDeviceId(){
+        return this.deviceId;
+    }
+
+    public int getStreamNumber() {
+        return streamNumber;
+    }
+
+    public void setDeviceId(int deviceId) {
+        this.deviceId = deviceId;
+    }
+
+    public boolean isDefaultStream() {
+        return false;
+    }
+
+    @Override
+    public String toString() {
+        return "CUDAStream(streamNumber=" + this.streamNumber + "; address=0x" + Long.toHexString(this.getRawPointer()) + "; device=" + this.deviceId + ")";
+    }
+
+    @ExportMessage
+    public Object toDisplayString(boolean allowSideEffect) {
+        return this.toString();
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (this == o)
+            return true;
+        if (o == null || getClass() != o.getClass())
+            return false;
+        if (!super.equals(o))
+            return false;
+        CUDAStream that = (CUDAStream) o;
+        return streamNumber == that.streamNumber && this.getRawPointer() == that.getRawPointer();
+    }
+
+    @Override
+    public int hashCode() {
+        return Objects.hash(super.hashCode(), streamNumber);
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/CUSPARSESetStreamFunction.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/CUSPARSESetStreamFunction.java
new file mode 100644
index 00000000..8f72c6d1
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/CUSPARSESetStreamFunction.java
@@ -0,0 +1,33 @@
+package com.nvidia.grcuda.runtime.stream;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+
+import com.nvidia.grcuda.functions.Function;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+/**
+ * Class of functions to avoid managing streams in the CUSPARSE library
+ */
+
+public class CUSPARSESetStreamFunction extends LibrarySetStream {
+
+    private final long handle;
+
+    public CUSPARSESetStreamFunction(Function setStreamFunctionNFI, long handle) {
+        super(setStreamFunctionNFI);
+        this.handle = handle;
+    }
+
+    @Override
+    public void setStream(CUDAStream stream) {
+        Object[] cusparseSetStreamArgs = {this.handle, stream.getRawPointer()};
+        try {
+            INTEROP.execute(this.setStreamFunctionNFI, cusparseSetStreamArgs);
+        } catch (ArityException | UnsupportedTypeException | UnsupportedMessageException e) {
+            System.out.println("failed to set CUSPARSE stream");
+            e.printStackTrace();
+        }
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/DefaultStream.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/DefaultStream.java
new file mode 100644
index 00000000..7202e088
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/DefaultStream.java
@@ -0,0 +1,55 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+
+public class DefaultStream extends CUDAStream {
+    
+    static final int DEFAULT_STREAM_NUMBER = -1;
+
+    private static final DefaultStream defaultStream = new DefaultStream();
+    
+    private DefaultStream() {
+        super(0, DEFAULT_STREAM_NUMBER, CUDARuntime.DEFAULT_DEVICE);
+    }
+
+    public static DefaultStream get() { return defaultStream; }
+
+    @Override
+    public boolean isDefaultStream() {
+        return true; }
+
+    @Override
+    public String toString() {
+        return "DefaultCUDAStream(streamNumber=" + DEFAULT_STREAM_NUMBER + "; address=0x" + Long.toHexString(this.getRawPointer()) + ")";
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/GrCUDAStreamManager.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/GrCUDAStreamManager.java
new file mode 100644
index 00000000..efdfdbe1
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/GrCUDAStreamManager.java
@@ -0,0 +1,403 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream;
+
+import com.nvidia.grcuda.CUDAEvent;
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.GrCUDAOptionMap;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.DeviceList;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement;
+import com.nvidia.grcuda.runtime.stream.policy.DeviceSelectionPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.GrCUDAStreamPolicy;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveNewStreamPolicyEnum;
+import com.nvidia.grcuda.runtime.stream.policy.RetrieveParentStreamPolicyEnum;
+import com.oracle.truffle.api.TruffleLogger;
+
+import java.util.ArrayDeque;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Queue;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class GrCUDAStreamManager {
+
+    private static final TruffleLogger STREAM_LOGGER = GrCUDALogger.getLogger(GrCUDALogger.STREAM_LOGGER);
+    
+    /**
+     * Reference to the CUDA runtime that manages the streams;
+     */
+    protected final CUDARuntime runtime;
+    /**
+     * Logging of kernel execution times option
+     */
+    protected final Boolean isTimeComputation;
+    /**
+     * Track the active computations each stream has, excluding the default stream;
+     */
+    protected final Map<CUDAStream, Set<ExecutionDAG.DAGVertex>> activeComputationsPerStream = new HashMap<>();
+
+    /**
+     * Handle for all the policies to assign streams and devices to a new computation that can run on CUDA stream;
+     */
+    private final GrCUDAStreamPolicy streamPolicy;
+    
+    public GrCUDAStreamManager(CUDARuntime runtime, GrCUDAOptionMap options) {
+        this(runtime,
+             options.getRetrieveNewStreamPolicy(),
+             options.getRetrieveParentStreamPolicy(),
+             options.getDeviceSelectionPolicy(),
+             options.isTimeComputation(),
+             options.getBandwidthMatrix(),
+             options.getDataThreshold());
+    }
+
+    public GrCUDAStreamManager(
+            CUDARuntime runtime,
+            RetrieveNewStreamPolicyEnum retrieveNewStreamPolicyEnum,
+            RetrieveParentStreamPolicyEnum retrieveParentStreamPolicyEnum,
+            DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+            boolean isTimeComputation,
+            String bandwidthMatrixPath,
+            double dataThreshold) {
+        this(runtime, isTimeComputation, new GrCUDAStreamPolicy(runtime, retrieveNewStreamPolicyEnum, retrieveParentStreamPolicyEnum, deviceSelectionPolicyEnum, bandwidthMatrixPath, dataThreshold));
+    }
+
+    public GrCUDAStreamManager(
+            CUDARuntime runtime,
+            boolean isTimeComputation,
+            GrCUDAStreamPolicy streamPolicy) {
+        this.runtime = runtime;
+        this.isTimeComputation = isTimeComputation;
+        this.streamPolicy = streamPolicy;
+    }
+
+    /**
+     * Assign a {@link CUDAStream} to the input computation, based on its dependencies and on the available streams.
+     * This function has no effect if the stream was manually specified by the user;
+     * @param vertex an input computation for which we want to assign a stream
+     */
+    public void assignStream(ExecutionDAG.DAGVertex vertex) {
+        // If the computation cannot use customized streams, return immediately;
+        if (vertex.getComputation().canUseStream()) {
+            // Else, obtain the stream (and the GPU device) for this computation from the stream policy manager;
+            CUDAStream stream = this.streamPolicy.retrieveStream(vertex);
+            // Set the stream;
+            vertex.getComputation().setStream(stream);
+            // Update the computation counter;
+            addActiveComputation(vertex);
+            // Associate all the arrays in the computation to the selected stream,
+            //   to enable CPU accesses on managed memory arrays currently not being used by the GPU.
+            // This is required as on pre-Pascal GPUs all unified memory pages are locked by the GPU while code is running on the GPU,
+            //   even if the GPU is not using some of the pages. Enabling memory-stream association allows the CPU to
+            //   access memory not being currently used by a kernel;
+            vertex.getComputation().associateArraysToStream();
+        }
+    }
+
+    /**
+     * Associate a new {@link CUDAEvent} to this computation, if the computation is done on a {@link CUDAStream}.
+     * The event is created and recorded on the stream where the computation is running,
+     * and can be used to time the execution of the computation;
+     * @param vertex an input computation for which we want to assign an event
+     */
+    public void assignEventStart(ExecutionDAG.DAGVertex vertex) {
+        // If the computation cannot use customized streams, return immediately;
+        if (isTimeComputation && vertex.getComputation().canUseStream()) {
+            // cudaEventRecord is sensitive to the ctx of the device that is currently set, so we call cudaSetDevice;
+            runtime.cudaSetDevice(vertex.getComputation().getStream().getStreamDeviceId());
+            CUDAEvent event = runtime.cudaEventCreate();
+            runtime.cudaEventRecord(event, vertex.getComputation().getStream());
+            vertex.getComputation().setEventStart(event);
+        }
+    }
+
+    /**
+     * Associate a new {@link CUDAEvent} to this computation, if the computation is done on a {@link CUDAStream}.
+     * The event is created and recorded on the stream where the computation is running,
+     * and can be used for precise synchronization of children computation;
+     * @param vertex an input computation for which we want to assign an event
+     */
+    public void assignEventStop(ExecutionDAG.DAGVertex vertex) {
+        // If the computation cannot use customized streams, return immediately;
+        if (vertex.getComputation().canUseStream()) {
+            CUDAEvent event = runtime.cudaEventCreate();
+            runtime.cudaEventRecord(event, vertex.getComputation().getStream());
+            vertex.getComputation().setEventStop(event);
+        }
+    }
+
+    public void syncParentStreams(ExecutionDAG.DAGVertex vertex) {
+        // If the vertex can be executed on a CUDA stream, use CUDA events,
+        //   otherwise use stream/device synchronization to block the host until synchronization is done;
+        if (vertex.getComputation().canUseStream()) {
+            syncStreamsUsingEvents(vertex);
+        } else {
+            if (this.isAnyComputationActive()) {
+                Optional<CUDAStream> additionalStream = vertex.getComputation().additionalStreamDependency();
+                if (additionalStream.isPresent()) {
+                    CUDAStream stream = additionalStream.get();
+                    // If we require synchronization on the default stream, perform it in a specialized way;
+                    if (stream.isDefaultStream()) {
+                        STREAM_LOGGER.finest(() -> "--\tsync stream " + stream + " by " + vertex.getComputation());
+                        // Synchronize the device;
+                        syncDevice();
+                        // All computations are now finished;
+                        resetActiveComputationState();
+                    } else {
+                        // Else add the computations related to the additional streams to the set and sync it,
+                        //   as long as the additional stream isn't the same as the one that we have to sync anyway;
+                        syncParentStreamsImpl(vertex);
+                    }
+                } else {
+                    syncParentStreamsImpl(vertex);
+                }
+            }
+        }
+    }
+
+    /**
+     * Obtain the set of CUDAStreams that have to be synchronized;
+     * @param computationsToSync a set of computations to sync
+     * @return the set of CUDAStreams that have to be synchronized
+     */
+    protected Set<CUDAStream> getParentStreams(Collection<GrCUDAComputationalElement> computationsToSync) {
+        return computationsToSync.stream().map(GrCUDAComputationalElement::getStream).collect(Collectors.toSet());
+    }
+
+    /**
+     * If a computation can be scheduled on a stream, use {@link CUDAEvent} to synchronize parent computations
+     * without blocking the CPU host
+     * @param vertex a computation whose parent's streams must be synchronized
+     */
+    protected void syncStreamsUsingEvents(ExecutionDAG.DAGVertex vertex) {
+        for (GrCUDAComputationalElement parent : vertex.getParentComputations()) {
+            CUDAStream stream = parent.getStream();
+            // Skip synchronization on the same stream where the new computation is executed,
+            //   as operations scheduled on a stream are executed in order;
+            if (!vertex.getComputation().getStream().equals(stream)) {
+                // Synchronize on the events associated to the parents;
+                if (parent.getEventStop().isPresent()) {
+                    CUDAEvent event = parent.getEventStop().get();
+                    runtime.cudaStreamWaitEvent(vertex.getComputation().getStream(), event);
+
+                    STREAM_LOGGER.finest(() -> "\t* wait event on stream; stream to sync=" + stream.getStreamNumber()
+                            + "; stream that waits=" + vertex.getComputation().getStream().getStreamNumber()
+                            + "; event=" + event.getEventNumber());
+                } else {
+                    STREAM_LOGGER.warning(() -> "\t* missing event to sync child computation=" + vertex.getComputation() +
+                            " and parent computation=" + parent);
+                }
+            }
+        }
+    }
+
+    /**
+     * Synchronization is done in 2 parts:
+     * 1. Synchronize the streams where each parent computation is executed;
+     * 2. All computations currently active on the synchronized streams are finished, and so are their parents.
+     *   In this phase, check if any parent is executed on a stream different from the ones we synchronized, and store these streams.
+     *   Also set the streams that have no active computation as free
+     * @param vertex the vertex whose parents should be synchronized
+     */
+    protected void syncParentStreamsImpl(ExecutionDAG.DAGVertex vertex) {
+
+        Set<CUDAStream> streamsToSync = getParentStreams(vertex.getParentComputations());
+        // Synchronize streams;
+        streamsToSync.forEach(s -> {
+            STREAM_LOGGER.finest(() -> "--\tsync stream=" + s.getStreamNumber() + " by " + vertex.getComputation());
+            syncStream(s);
+        });
+
+        // Book-keeping: all computations on the synchronized streams are guaranteed to be finished;
+        streamsToSync.forEach(s -> {
+            activeComputationsPerStream.get(s).forEach(v -> {
+                // Skip computations that have already finished;
+                if (!v.getComputation().isComputationFinished()) {
+                    setComputationsFinished(v, streamsToSync);
+                }
+            });
+            // Now the stream is free to be re-used;
+            activeComputationsPerStream.remove(s);
+            this.streamPolicy.updateNewStreamRetrieval(s);
+        });
+    }
+
+    protected void setComputationFinishedInner(GrCUDAComputationalElement computation) {
+        computation.setComputationFinished();
+        if (computation.getEventStop().isPresent()) {
+            if (isTimeComputation && computation.getEventStart().isPresent()) {
+                // Switch to the device where the computation has been done, otherwise we cannot call the cudaEventElapsedTime API;
+                runtime.cudaSetDevice(computation.getStream().getStreamDeviceId());
+                float timeMilliseconds = runtime.cudaEventElapsedTime(computation.getEventStart().get(), computation.getEventStop().get());
+                computation.setExecutionTime(timeMilliseconds);
+                // Destroy the start event associated to this computation:
+                runtime.cudaEventDestroy(computation.getEventStart().get());
+            }
+            // Destroy the stop event associated to this computation:
+            runtime.cudaEventDestroy(computation.getEventStop().get());
+
+        } else {
+            STREAM_LOGGER.warning(() -> "missing event to destroy for computation=" + computation);
+        }
+    }
+
+    private void setComputationsFinished(ExecutionDAG.DAGVertex vertex, Set<CUDAStream> streamsToSync) {
+        // Vertices to process;
+        final Queue<ExecutionDAG.DAGVertex> queue = new ArrayDeque<>(Collections.singletonList(vertex));
+        // Vertices that have already been seen;
+        final Set<ExecutionDAG.DAGVertex> seen = new HashSet<>(Collections.singletonList(vertex));
+        // Perform a reverse BFS to process all the parents of the starting computation;
+        while (!queue.isEmpty()) {
+            ExecutionDAG.DAGVertex currentVertex = queue.poll();
+            setComputationFinishedInner(currentVertex.getComputation());
+            // Book-keeping on the stream of the current computation;
+            CUDAStream stream = currentVertex.getComputation().getStream();
+
+            // Skip streams that have already been synchronized, as they will be freed later;
+            if (!streamsToSync.contains(stream)) {
+                // Stop considering this computation as active on its stream;
+                activeComputationsPerStream.get(stream).remove(currentVertex);
+                // If this stream doesn't have any computation associated to it, it's free to use;
+                if (activeComputationsPerStream.get(stream).isEmpty()) {
+                    activeComputationsPerStream.remove(stream);
+                    this.streamPolicy.updateNewStreamRetrieval(stream);
+                }
+            }
+
+            // Process parents of the current computation;
+            for (ExecutionDAG.DAGVertex parent : currentVertex.getParentVertices()) {
+                if (!parent.getComputation().isComputationFinished() && !seen.contains(parent)) {
+                    queue.add(parent);
+                    seen.add(parent);
+                }
+            }
+        }
+    }
+
+    /**
+     * Check if a given stream is free to use, and has no active computations on it;
+     * @param stream a CUDAStream
+     * @return if the stream has no active computations on it
+     */
+    public boolean isStreamFree(CUDAStream stream) throws IllegalStateException {
+        if (activeComputationsPerStream.containsKey(stream)) {
+            if (activeComputationsPerStream.get(stream).isEmpty()) {
+                // The stream cannot be in the map without at least one active computation;
+                throw new IllegalStateException("stream " + stream.getStreamNumber() + " is tracked but has 0 active computations");
+            } else {
+                return false; // Stream is active;
+            }
+        } else {
+            return true; // Stream is not active;
+        }
+    }
+
+    public void syncStream(CUDAStream stream) {
+        runtime.cudaStreamSynchronize(stream);
+    }
+
+    protected void syncDevice() {
+        runtime.cudaDeviceSynchronize();
+    }
+
+    /**
+     * Obtain the number of streams managed by this manager;
+     */
+    public int getNumberOfStreams() {
+        return this.streamPolicy.getNumberOfStreams();
+    }
+
+    public int getNumActiveComputationsOnStream(CUDAStream stream) {
+        if (this.isStreamFree(stream)) {
+            return 0;
+        } else {
+            return activeComputationsPerStream.get(stream).size();
+        }
+    }
+
+    /**
+     * Check if any computation is currently marked as active, and is running on a stream.
+     * If so, scheduling of new computations is likely to require synchronizations of some sort;
+     * @return if any computation is considered active on a stream
+     */
+    public boolean isAnyComputationActive() { return !this.activeComputationsPerStream.isEmpty(); }
+
+    protected void addActiveComputation(ExecutionDAG.DAGVertex vertex) {
+        CUDAStream stream = vertex.getComputation().getStream();
+        // Start tracking the stream if it wasn't already tracked;
+        if (!activeComputationsPerStream.containsKey(stream)) {
+            activeComputationsPerStream.put(stream, new HashSet<>());
+        }
+        // Associate the computation to the stream;
+        activeComputationsPerStream.get(stream).add(vertex);
+    }
+
+    /**
+     * Reset the association between streams and computations. All computations are finished, and all streams are free;
+     */
+    protected void resetActiveComputationState() {
+        activeComputationsPerStream.keySet().forEach(s ->
+            activeComputationsPerStream.get(s).forEach(v -> v.getComputation().setComputationFinished())
+        );
+        // Streams don't have any active computation;
+        activeComputationsPerStream.clear();
+        // All streams are free;
+        this.streamPolicy.updateNewStreamRetrieval();
+    }
+
+    public DeviceList getDeviceList() {
+        return this.streamPolicy.getDevicesManager().getDeviceList();
+    }
+
+    public Device getDevice(int deviceId) {
+        return this.streamPolicy.getDevicesManager().getDevice(deviceId);
+    }
+
+    public GrCUDAStreamPolicy getStreamPolicy() {
+        return streamPolicy;
+    }
+
+    /**
+     * Cleanup and deallocate the streams managed by this manager;
+     */
+    public void cleanup() {
+        this.activeComputationsPerStream.clear();
+        this.streamPolicy.cleanup();
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStream.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStream.java
new file mode 100644
index 00000000..10a0e5a4
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStream.java
@@ -0,0 +1,24 @@
+package com.nvidia.grcuda.runtime.stream;
+
+import com.nvidia.grcuda.functions.Function;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+/**
+ * Abstract class to manage async streams for supported libraries
+ */
+
+abstract public class LibrarySetStream {
+
+    protected final Function setStreamFunctionNFI;
+
+    protected LibrarySetStream(Function setStreamFunctionNFI) {
+        this.setStreamFunctionNFI = setStreamFunctionNFI;
+    }
+
+    /**
+     * Set stream for the execution of supported libraries' functions
+     * 
+     * @param stream a CUDAstream
+     */
+    public abstract void setStream(CUDAStream stream);
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStreamCUBLAS.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStreamCUBLAS.java
new file mode 100644
index 00000000..0f84e49b
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStreamCUBLAS.java
@@ -0,0 +1,33 @@
+package com.nvidia.grcuda.runtime.stream;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+
+import com.nvidia.grcuda.functions.Function;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+/**
+ * Class of functions to manage streams in the CUBLAS library
+ */
+
+public class LibrarySetStreamCUBLAS extends LibrarySetStream {
+
+    private final long handle;
+
+    public LibrarySetStreamCUBLAS(Function setStreamFunctionNFI, long handle) {
+        super(setStreamFunctionNFI);
+        this.handle = handle;
+    }
+
+    @Override
+    public void setStream(CUDAStream stream) {
+        Object[] cublasSetStreamArgs = {this.handle, stream.getRawPointer()};
+        try {
+            INTEROP.execute(this.setStreamFunctionNFI, cublasSetStreamArgs);
+        } catch (ArityException | UnsupportedTypeException | UnsupportedMessageException e) {
+            System.out.println("failed to set CUBLAS stream");
+            e.printStackTrace();
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStreamCUML.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStreamCUML.java
new file mode 100644
index 00000000..0cb2bc98
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/LibrarySetStreamCUML.java
@@ -0,0 +1,33 @@
+package com.nvidia.grcuda.runtime.stream;
+
+import com.nvidia.grcuda.functions.Function;
+import com.oracle.truffle.api.interop.ArityException;
+import com.oracle.truffle.api.interop.UnsupportedMessageException;
+import com.oracle.truffle.api.interop.UnsupportedTypeException;
+
+import static com.nvidia.grcuda.functions.Function.INTEROP;
+
+/**
+ * Class of functions to manage streams in the CUML library
+ */
+
+public class LibrarySetStreamCUML extends LibrarySetStream {
+
+    private final long handle;
+
+    public LibrarySetStreamCUML(Function setStreamFunctionNFI, long handle) {
+        super(setStreamFunctionNFI);
+        this.handle = handle;
+    }
+
+    @Override
+    public void setStream(CUDAStream stream) {
+        Object[] cumlSetStreamArgs = {this.handle, stream.getRawPointer()};
+        try {
+            INTEROP.execute(this.setStreamFunctionNFI, cumlSetStreamArgs);
+        } catch (ArityException | UnsupportedTypeException | UnsupportedMessageException e) {
+            System.out.println("failed to set CUML stream");
+            e.printStackTrace();
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/DeviceSelectionPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/DeviceSelectionPolicy.java
new file mode 100644
index 00000000..eec42779
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/DeviceSelectionPolicy.java
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * When using multiple GPUs, selecting the stream where a computation is executed implies
+ * the selection of a GPU, as each stream is uniquely associated to a single GPU.
+ * This abstract class defines how a {@link GrCUDAStreamPolicy}
+ * selects a {@link com.nvidia.grcuda.runtime.Device} on which a {@link com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement}
+ * is executed. Device selection is performed by {@link RetrieveNewStreamPolicy} (when creating a new stream)
+ * and {@link RetrieveParentStreamPolicy} (when the parent's stream cannot be directly reused).
+ * For example, we can select the device that requires the least data transfer.
+ */
+public abstract class DeviceSelectionPolicy {
+
+    protected final GrCUDADevicesManager devicesManager;
+
+    public DeviceSelectionPolicy(GrCUDADevicesManager devicesManager) {
+        this.devicesManager = devicesManager;
+    }
+
+    /**
+     * Select the device where the computation will be executed.
+     * By default call {@link DeviceSelectionPolicy#retrieve(ExecutionDAG.DAGVertex, List)} on all devices,
+     * but it can be overridden to provide optimized behavior for the case when no restriction on specific devices is needed;
+     * @param vertex the computation for which we want to select the device
+     * @return the chosen device for the computation
+     */
+    public Device retrieve(ExecutionDAG.DAGVertex vertex) {
+        return retrieveImpl(vertex, devicesManager.getUsableDevices());
+    }
+
+    /**
+     * Restrict the device selection to the specific set of devices;
+     * @param vertex the computation for which we want to select the device
+     * @param devices the list of devices where the computation could be executed
+     * @return the chosen device for the computation
+     */
+    public Device retrieve(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        if (devices == null) {
+            throw new NullPointerException("the list of devices where the computation can be executed is null");
+        } else if (devices.size() == 0) {
+            throw new GrCUDAException("the list of devices where the computation can be executed is empty");
+        } else {
+            // Sort the devices by ID;
+            List<Device> sortedDevices = devices.stream().sorted(Comparator.comparingInt(Device::getDeviceId)).collect(Collectors.toList());
+            return this.retrieveImpl(vertex, sortedDevices);
+        }
+    }
+
+    /**
+     * Internal implementation of {@link DeviceSelectionPolicy#retrieve(ExecutionDAG.DAGVertex, List)},
+     * assuming that the list of devices contains at least one device;
+     * @param vertex the computation for which we want to select the device
+     * @param devices the list of devices where the computation could be executed
+     * @return the chosen device for the computation
+     */
+    abstract Device retrieveImpl(ExecutionDAG.DAGVertex vertex, List<Device> devices);
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/DeviceSelectionPolicyEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/DeviceSelectionPolicyEnum.java
new file mode 100644
index 00000000..7fd90105
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/DeviceSelectionPolicyEnum.java
@@ -0,0 +1,51 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+public enum DeviceSelectionPolicyEnum {
+    SINGLE_GPU("single-gpu"),
+    ROUND_ROBIN("round-robin"),
+    STREAM_AWARE("stream-aware"),
+    MIN_TRANSFER_SIZE("min-transfer-size"),
+    MINMIN_TRANSFER_TIME("minmin-transfer-time"),
+    MINMAX_TRANSFER_TIME("minmax-transfer-time");
+
+    private final String name;
+
+    DeviceSelectionPolicyEnum(String name) {
+        this.name = name;
+    }
+
+    @Override
+    public String toString() {
+        return name;
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/GrCUDADevicesManager.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/GrCUDADevicesManager.java
new file mode 100644
index 00000000..d01e9fd3
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/GrCUDADevicesManager.java
@@ -0,0 +1,121 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import java.util.Collection;
+import java.util.List;
+
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.DeviceList;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+public class GrCUDADevicesManager {
+
+    private final CUDARuntime runtime;
+    private final DeviceList deviceList;
+
+    /**
+     * Initialize the GrCUDADevicesManager, creating a DeviceList that tracks all available GPUs.
+     * @param runtime reference to the CUDA runtime
+     */
+    public GrCUDADevicesManager(CUDARuntime runtime) {
+        this(runtime, new DeviceList(runtime));
+    }
+
+    /**
+     * Initialize the GrCUDADevicesManager, using an existing DeviceList that tracks all available GPUs;
+     * @param runtime reference to the CUDA runtime
+     * @param deviceList list of available devices
+     */
+    public GrCUDADevicesManager(CUDARuntime runtime, DeviceList deviceList) {
+        this.runtime = runtime;
+        this.deviceList = deviceList;
+    }
+
+    /**
+     * Find the device with the lowest number of busy stream on it and returns it.
+     * A stream is busy if there's any computation assigned to it that has not been flagged as "finished".
+     * If multiple devices have the same number of free streams, return the first.
+     * In this implementation, consider all usable devices;
+     * @return the device with fewer busy streams
+     */
+    public Device findDeviceWithFewerBusyStreams(){
+        return findDeviceWithFewerBusyStreams(getUsableDevices());
+    }
+
+    /**
+     * Find the device with the lowest number of busy stream on it and returns it.
+     * A stream is busy if there's any computation assigned to it that has not been flagged as "finished".
+     * If multiple devices have the same number of free streams, return the first;
+     * @param devices the list of devices to inspect
+     * @return the device with fewer busy streams
+     */
+    public Device findDeviceWithFewerBusyStreams(List<Device> devices){
+        Device device = devices.get(0);
+        int min = device.getNumberOfBusyStreams();
+        for (Device d : devices) {
+            int numBusyStreams = d.getNumberOfBusyStreams();
+            if (numBusyStreams < min) {
+                min = numBusyStreams;
+                device = d;
+            }
+        }
+        return device;
+    }
+
+    public Device getCurrentGPU(){
+        return this.getDevice(this.runtime.getCurrentGPU());
+    }
+
+    public int getNumberOfGPUsToUse(){
+        return this.runtime.getNumberOfGPUsToUse();
+    }
+
+    public DeviceList getDeviceList() {
+        return deviceList;
+    }
+
+    public List<Device> getUsableDevices() {
+        return deviceList.getDevices().subList(0, this.getNumberOfGPUsToUse());
+    }
+
+    public Device getDevice(int deviceId) {
+        return deviceList.getDevice(deviceId);
+    }
+
+    /**
+     * Cleanup and deallocate the streams managed by this manager;
+     */
+    public void cleanup() {
+        this.deviceList.cleanup();
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/GrCUDAStreamPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/GrCUDAStreamPolicy.java
new file mode 100644
index 00000000..a5a3c7ab
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/GrCUDAStreamPolicy.java
@@ -0,0 +1,417 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.GrCUDAException;
+import com.nvidia.grcuda.GrCUDALogger;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.CUDARuntime;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.oracle.truffle.api.TruffleLogger;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class GrCUDAStreamPolicy {
+
+    /**
+     * Reference to the class that manages the GPU devices in this system;
+     */
+    protected final GrCUDADevicesManager devicesManager;
+    /**
+     * Total number of CUDA streams created so far;
+     */
+    private int totalNumberOfStreams = 0;
+
+    private final RetrieveNewStreamPolicy retrieveNewStreamPolicy;
+    private final RetrieveParentStreamPolicy retrieveParentStreamPolicy;
+    protected final DeviceSelectionPolicy deviceSelectionPolicy;
+
+    private static final TruffleLogger STREAM_LOGGER = GrCUDALogger.getLogger(GrCUDALogger.STREAM_LOGGER);
+
+    public GrCUDAStreamPolicy(GrCUDADevicesManager devicesManager,
+                              RetrieveNewStreamPolicyEnum retrieveNewStreamPolicyEnum,
+                              RetrieveParentStreamPolicyEnum retrieveParentStreamPolicyEnum,
+                              DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+                              String bandwidthMatrixPath,
+                              double dataThreshold) {
+        this.devicesManager = devicesManager;
+        // When using a stream selection policy that supports multiple GPUs,
+        // we also need a policy to choose the device where the computation is executed;
+        switch (deviceSelectionPolicyEnum) {
+            case ROUND_ROBIN:
+                this.deviceSelectionPolicy = new RoundRobinDeviceSelectionPolicy(devicesManager);
+                break;
+            case STREAM_AWARE:
+                this.deviceSelectionPolicy = new StreamAwareDeviceSelectionPolicy(devicesManager);
+                break;
+            case MIN_TRANSFER_SIZE:
+                this.deviceSelectionPolicy = new MinimizeTransferSizeDeviceSelectionPolicy(devicesManager, dataThreshold);
+                break;
+            case MINMIN_TRANSFER_TIME:
+                this.deviceSelectionPolicy = new TransferTimeDeviceSelectionPolicy.MinMinTransferTimeDeviceSelectionPolicy(devicesManager, dataThreshold, bandwidthMatrixPath);
+                break;
+            case MINMAX_TRANSFER_TIME:
+                this.deviceSelectionPolicy = new TransferTimeDeviceSelectionPolicy.MinMaxTransferTimeDeviceSelectionPolicy(devicesManager, dataThreshold, bandwidthMatrixPath);
+                break;
+            default:
+                STREAM_LOGGER.finer("Disabled device selection policy, it is not necessary to use one as retrieveParentStreamPolicyEnum=" + retrieveParentStreamPolicyEnum);
+                this.deviceSelectionPolicy = new SingleDeviceSelectionPolicy(devicesManager);
+        }
+        // Get how streams are retrieved for computations without parents;
+        switch (retrieveNewStreamPolicyEnum) {
+            case REUSE:
+                this.retrieveNewStreamPolicy = new ReuseRetrieveStreamPolicy(this.deviceSelectionPolicy);
+                break;
+            case ALWAYS_NEW:
+                this.retrieveNewStreamPolicy = new AlwaysNewRetrieveStreamPolicy(this.deviceSelectionPolicy);
+                break;
+            default:
+                STREAM_LOGGER.severe("Cannot select a RetrieveNewStreamPolicy. The selected execution policy is not valid: " + retrieveNewStreamPolicyEnum);
+                throw new GrCUDAException("selected RetrieveNewStreamPolicy is not valid: " + retrieveNewStreamPolicyEnum);
+        }
+        // Get how streams are retrieved for computations with parents;
+        switch (retrieveParentStreamPolicyEnum) {
+            case DISJOINT:
+                this.retrieveParentStreamPolicy = new DisjointRetrieveParentStreamPolicy(this.retrieveNewStreamPolicy);
+                break;
+            case SAME_AS_PARENT:
+                this.retrieveParentStreamPolicy = new SameAsParentRetrieveParentStreamPolicy();
+                break;
+            case MULTIGPU_EARLY_DISJOINT:
+                this.retrieveParentStreamPolicy = new MultiGPUEarlySelectionDisjointRetrieveParentStreamPolicy(this.retrieveNewStreamPolicy, this.deviceSelectionPolicy);
+                break;
+            case MULTIGPU_DISJOINT:
+                this.retrieveParentStreamPolicy = new MultiGPUDisjointRetrieveParentStreamPolicy(this.devicesManager, this.retrieveNewStreamPolicy, this.deviceSelectionPolicy);
+                break;
+            default:
+                STREAM_LOGGER.severe("Cannot select a RetrieveParentStreamPolicy. The selected execution policy is not valid: " + retrieveParentStreamPolicyEnum);
+                throw new GrCUDAException("selected RetrieveParentStreamPolicy is not valid: " + retrieveParentStreamPolicyEnum);
+        }
+    }
+
+    public GrCUDAStreamPolicy(CUDARuntime runtime,
+                              RetrieveNewStreamPolicyEnum retrieveNewStreamPolicyEnum,
+                              RetrieveParentStreamPolicyEnum retrieveParentStreamPolicyEnum,
+                              DeviceSelectionPolicyEnum deviceSelectionPolicyEnum,
+                              String bandwidthMatrixPath,
+                              double dataThrehsold) {
+        this(new GrCUDADevicesManager(runtime), retrieveNewStreamPolicyEnum, retrieveParentStreamPolicyEnum, deviceSelectionPolicyEnum, bandwidthMatrixPath, dataThrehsold);
+    }
+
+    /**
+     * Create a new {@link CUDAStream} on the current device;
+     */
+    public CUDAStream createStream() {
+        CUDAStream newStream = this.devicesManager.getCurrentGPU().createStream();
+        this.totalNumberOfStreams++;
+        return newStream;
+    }
+
+    /**
+     * Create a new {@link CUDAStream} on the specified device;
+     */
+    public CUDAStream createStream(int gpu) {
+        CUDAStream newStream = this.devicesManager.getDevice(gpu).createStream();
+        this.totalNumberOfStreams++;
+        return newStream;
+    }
+
+    /**
+     * Obtain the stream on which to execute the input computation.
+     * If the computation doesn't have any parent, obtain a new stream or a free stream.
+     * If the computation has parents, we might be reuse the stream of one of the parents.
+     * Each stream is uniquely associated to a single GPU. If using multiple GPUs,
+     * the choice of the stream also implies the choice of the GPU where the computation is executed;
+     *
+     * @param vertex the input computation for which we choose a stream;
+     * @return the stream on which we execute the computation
+     */
+    public CUDAStream retrieveStream(ExecutionDAG.DAGVertex vertex) {
+        if (vertex.isStart()) {
+            // If the computation doesn't have parents, provide a new stream to it.
+            // When using multiple GPUs, also select the device;
+            return retrieveNewStream(vertex);
+        } else {
+            // Else, compute the streams used by the parent computations.
+            // When using multiple GPUs, we might want to select the device as well,
+            // if multiple suitable parent streams are available;
+            return retrieveParentStream(vertex);
+        }
+    }
+
+    CUDAStream retrieveNewStream(ExecutionDAG.DAGVertex vertex) {
+        return this.retrieveNewStreamPolicy.retrieve(vertex);
+    }
+
+    CUDAStream retrieveParentStream(ExecutionDAG.DAGVertex vertex) {
+        return this.retrieveParentStreamPolicy.retrieve(vertex);
+    }
+
+    /**
+     * Update the status of a single stream within the NewStreamRetrieval policy;
+     *
+     * @param stream a stream to update;
+     */
+    public void updateNewStreamRetrieval(CUDAStream stream) {
+        this.retrieveNewStreamPolicy.update(stream);
+    }
+
+    /**
+     * Update the status of all streams within the NewStreamRetrieval policy,
+     * saying for example that all can be reused;
+     */
+    public void updateNewStreamRetrieval() {
+        // All streams are free to be reused;
+        this.retrieveNewStreamPolicy.update();
+    }
+
+    void cleanupNewStreamRetrieval() {
+        this.retrieveNewStreamPolicy.cleanup();
+    }
+
+    /**
+     * Obtain the number of streams created so far;
+     */
+    public int getNumberOfStreams() {
+        return this.totalNumberOfStreams;
+    }
+
+    public GrCUDADevicesManager getDevicesManager() {
+        return devicesManager;
+    }
+
+    /**
+     * Cleanup and deallocate the streams managed by this manager;
+     */
+    public void cleanup() {
+        this.cleanupNewStreamRetrieval();
+        this.devicesManager.cleanup();
+    }
+
+    ///////////////////////////////////////////////////////////////
+    // List of interfaces that implement RetrieveNewStreamPolicy //
+    ///////////////////////////////////////////////////////////////
+
+    /**
+     * By default, create a new stream every time;
+     */
+    private class AlwaysNewRetrieveStreamPolicy extends RetrieveNewStreamPolicy {
+
+        AlwaysNewRetrieveStreamPolicy(DeviceSelectionPolicy deviceSelectionPolicy) {
+            super(deviceSelectionPolicy, GrCUDAStreamPolicy.this.devicesManager);
+        }
+
+        @Override
+        CUDAStream retrieveStreamFromDevice(Device device) {
+            return createStream(device.getDeviceId());
+        }
+    }
+
+    /**
+     * Keep a set of free (currently not utilized) streams, and retrieve one of them instead of always creating new streams;
+     */
+    private class ReuseRetrieveStreamPolicy extends RetrieveNewStreamPolicy {
+
+        ReuseRetrieveStreamPolicy(DeviceSelectionPolicy deviceSelectionPolicy) {
+            super(deviceSelectionPolicy, GrCUDAStreamPolicy.this.devicesManager);
+        }
+
+        @Override
+        CUDAStream retrieveStreamFromDevice(Device device) {
+            if (device.getNumberOfFreeStreams() == 0) {
+                // Create a new stream if none is available;
+                return createStream(device.getDeviceId());
+            } else {
+                return device.getFreeStream();
+            }
+        }
+    }
+
+    //////////////////////////////////////////////////////////////////
+    // List of interfaces that implement RetrieveParentStreamPolicy //
+    //////////////////////////////////////////////////////////////////
+
+    /**
+     * By default, use the same stream as the parent computation;
+     */
+    private static class SameAsParentRetrieveParentStreamPolicy extends RetrieveParentStreamPolicy {
+
+        @Override
+        public CUDAStream retrieve(ExecutionDAG.DAGVertex vertex) {
+            return vertex.getParentComputations().get(0).getStream();
+        }
+    }
+
+    /**
+     * If a vertex has more than one children, each children is independent (otherwise the dependency would be added
+     * from one children to the other, and not from the actual parent).
+     * As such, children can be executed on different streams. In practice, this situation happens when children
+     * depends on disjoint arguments subsets of the parent kernel, e.g. K1(X,Y), K2(X), K3(Y).
+     * This policy re-uses the parent(s) stream(s) when possible,
+     * and computes other streams using the current {@link RetrieveNewStreamPolicy};
+     */
+    private static class DisjointRetrieveParentStreamPolicy extends RetrieveParentStreamPolicy {
+        protected final RetrieveNewStreamPolicy retrieveNewStreamPolicy;
+
+        // Keep track of computations for which we have already re-used the stream;
+        protected final Set<ExecutionDAG.DAGVertex> reusedComputations = new HashSet<>();
+
+        public DisjointRetrieveParentStreamPolicy(RetrieveNewStreamPolicy retrieveNewStreamPolicy) {
+            this.retrieveNewStreamPolicy = retrieveNewStreamPolicy;
+        }
+
+        @Override
+        public CUDAStream retrieve(ExecutionDAG.DAGVertex vertex) {
+            // Keep only parent vertices for which we haven't reused the stream yet;
+            List<ExecutionDAG.DAGVertex> availableParents = vertex.getParentVertices().stream()
+                    .filter(v -> !reusedComputations.contains(v)).collect(Collectors.toList());
+            // If there is at least one stream that can be re-used, take it.
+            // When using multiple devices, we just take a parent stream without considering the device of the parent;
+            // FIXME: we might take a random parent. Or use round-robin;
+            if (!availableParents.isEmpty()) {
+                // The computation cannot be considered again;
+                reusedComputations.add(availableParents.iterator().next());
+                // Return the stream associated to this computation;
+                return availableParents.iterator().next().getComputation().getStream();
+            } else {
+                // If no parent stream can be reused, provide a new stream to this computation
+                //   (or possibly a free one, depending on the policy);
+                return retrieveNewStreamPolicy.retrieve(vertex);
+            }
+        }
+    }
+
+    /**
+     * This policy extends DisjointRetrieveParentStreamPolicy with multi-GPU support for reused streams,
+     * not only for newly created streams.
+     * In this policy, we first select the ideal GPU for the input computation.
+     * Then, we find if any of the reusable streams is allocated on that device.
+     * If not, we create a new stream on the ideal GPU;
+     */
+    private static class MultiGPUEarlySelectionDisjointRetrieveParentStreamPolicy extends DisjointRetrieveParentStreamPolicy {
+
+        private final DeviceSelectionPolicy deviceSelectionPolicy;
+
+        public MultiGPUEarlySelectionDisjointRetrieveParentStreamPolicy(RetrieveNewStreamPolicy retrieveNewStreamPolicy, DeviceSelectionPolicy deviceSelectionPolicy) {
+            super(retrieveNewStreamPolicy);
+            this.deviceSelectionPolicy = deviceSelectionPolicy;
+        }
+
+        @Override
+        public CUDAStream retrieve(ExecutionDAG.DAGVertex vertex) {
+            // Keep only parent vertices for which we haven't reused the stream yet;
+            List<ExecutionDAG.DAGVertex> availableParents = vertex.getParentVertices().stream()
+                    .filter(v -> !reusedComputations.contains(v))
+                    .collect(Collectors.toList());
+            // First, select the ideal device to execute this computation;
+            Device selectedDevice = deviceSelectionPolicy.retrieve(vertex);
+
+            // If at least one of the parents' streams is on the selected device, use that stream.
+            // Otherwise, create a new stream on the selected device;
+            if (!availableParents.isEmpty()) {
+                for (ExecutionDAG.DAGVertex v : availableParents) {
+                    if (v.getComputation().getStream().getStreamDeviceId() == selectedDevice.getDeviceId()) {
+                        // We found a parent whose stream is on the selected device;
+                        reusedComputations.add(v);
+                        return v.getComputation().getStream();
+                    }
+                }
+            }
+            // If no parent stream can be reused, provide a new stream to this computation
+            //   (or possibly a free one, depending on the policy);
+            return retrieveNewStreamPolicy.retrieveStreamFromDevice(selectedDevice);
+        }
+    }
+
+    /**
+     * This policy extends DisjointRetrieveParentStreamPolicy with multi-GPU support for reused streams,
+     * not only for newly created streams.
+     * In this policy, we select the streams that can be reused from the computation's parents.
+     * Then, we find which of the parent's devices is the best for the input computation.
+     * If no stream can be reused, we select a new device and create a stream on it;
+     */
+    private static class MultiGPUDisjointRetrieveParentStreamPolicy extends DisjointRetrieveParentStreamPolicy {
+
+        private final DeviceSelectionPolicy deviceSelectionPolicy;
+        private final GrCUDADevicesManager devicesManager;
+
+        public MultiGPUDisjointRetrieveParentStreamPolicy(GrCUDADevicesManager devicesManager, RetrieveNewStreamPolicy retrieveNewStreamPolicy, DeviceSelectionPolicy deviceSelectionPolicy) {
+            super(retrieveNewStreamPolicy);
+            this.devicesManager = devicesManager;
+            this.deviceSelectionPolicy = deviceSelectionPolicy;
+        }
+
+        @Override
+        public CUDAStream retrieve(ExecutionDAG.DAGVertex vertex) {
+            // Keep only parent vertices for which we haven't reused the stream yet;
+            List<ExecutionDAG.DAGVertex> availableParents = vertex.getParentVertices().stream()
+                    .filter(v -> !reusedComputations.contains(v))
+                    .collect(Collectors.toList());
+            // Map each parent's device to the respective parent;
+            Map<Device, ExecutionDAG.DAGVertex> deviceParentMap = availableParents
+                    .stream().collect(Collectors.toMap(
+                            v -> devicesManager.getDevice(v.getComputation().getStream().getStreamDeviceId()),
+                            Function.identity(),
+                            (x, y) -> x, // If two parents have the same device, use the first parent;
+                            HashMap::new // Use hashmap;
+                    ));
+            // If there's at least one free stream on the parents' devices,
+            // select the best device among the available parent devices.
+            // If no stream is available, create a new stream on the best possible device;
+            if (!availableParents.isEmpty()) {
+                // First, select the best device among the ones available;
+                Device selectedDevice = deviceSelectionPolicy.retrieve(vertex, new ArrayList<>(deviceParentMap.keySet()));
+                ExecutionDAG.DAGVertex selectedParent = deviceParentMap.get(selectedDevice);
+                // We found a parent whose stream is on the selected device;
+                reusedComputations.add(selectedParent);
+                return selectedParent.getComputation().getStream();
+            }
+            // If no parent stream can be reused, provide a new stream to this computation
+            //   (or possibly a free one, depending on the policy);
+            return retrieveNewStreamPolicy.retrieve(vertex);
+        }
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/MinimizeTransferSizeDeviceSelectionPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/MinimizeTransferSizeDeviceSelectionPolicy.java
new file mode 100644
index 00000000..41fca930
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/MinimizeTransferSizeDeviceSelectionPolicy.java
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Given a computation, select the device that needs the least amount of data transfer.
+ * In other words, select the device that already has the maximum amount of bytes available,
+ * considering the size of the input arrays.
+ * For each input array, we look at the devices where the array is up to date, and give a "score"
+ * to that device that is equal to the array size. Then, we pick the device with maximum score.
+ * In case of ties, pick the device with lower ID.
+ * We do not consider the CPU as a meaningful location, because computations cannot be scheduled on the CPU.
+ * If the computation does not have any data already present on any device,
+ * choose the device with round-robin selection (using {@link RoundRobinDeviceSelectionPolicy};
+ */
+public class MinimizeTransferSizeDeviceSelectionPolicy extends DeviceSelectionPolicy {
+
+    /**
+     * Some policies can use a threshold that specifies how much data (in percentage) must be available
+     * on a device so that the device can be considered for execution.
+     * A low threshold favors exploitation (using the same device for most computations),
+     * while a high threshold favors exploration (distribute the computations on different devices
+     * even if some device would have slightly lower synchronization time);
+     */
+    protected final double dataThreshold;
+
+    /**
+     * Fallback policy in case no GPU has any up-tp-date data. We assume that for any GPU, transferring all the data
+     * from the CPU would have the same cost, so we use this policy as tie-breaker;
+     */
+    RoundRobinDeviceSelectionPolicy roundRobin = new RoundRobinDeviceSelectionPolicy(devicesManager);
+
+    public MinimizeTransferSizeDeviceSelectionPolicy(GrCUDADevicesManager devicesManager, double dataThreshsold) {
+        super(devicesManager);
+        this.dataThreshold = dataThreshsold;
+    }
+
+    /**
+     * For each input array of the computation, compute if the array is available on other devices and does not need to be
+     * transferred. We track the total size, in bytes, that is already present on each device;
+     * @param vertex the input computation
+     * @param alreadyPresentDataSize the array where we store the size, in bytes, of data that is already present on each device.
+     *                               The array must be zero-initialized and have size equal to the number of usable GPUs
+     * @return if any data is present on any GPU. If false, we can use a fallback policy instead
+     */
+    boolean computeDataSizeOnDevices(ExecutionDAG.DAGVertex vertex, long[] alreadyPresentDataSize) {
+        List<AbstractArray> arguments = vertex.getComputation().getArrayArguments();
+        boolean isAnyDataPresentOnGPUs = false;  // True if there's at least a GPU with some data already available;
+        for (AbstractArray a : arguments) {
+            for (int location : a.getArrayUpToDateLocations()) {
+                if (location != CPUDevice.CPU_DEVICE_ID) {
+                    alreadyPresentDataSize[location] += a.getSizeBytes();
+                    isAnyDataPresentOnGPUs = true;
+                }
+            }
+        }
+        return isAnyDataPresentOnGPUs;
+    }
+
+    /**
+     * Find if any of the array inputs of the computation is present on the selected devices.
+     * Used to understand if no device has any data already present, and the device selection policy
+     * should fallback to a simpler device selection policy.
+     * @param vertex the computation for which we want to select the device
+     * @param devices the list of devices where the computation could be executed
+     * @return if any of the computation's array inputs is already present on the specified devices
+     */
+    boolean isDataPresentOnGPUs(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        for (Device d : devices) {
+            for (AbstractArray a : vertex.getComputation().getArrayArguments()) {
+                if (a.getArrayUpToDateLocations().contains(d.getDeviceId())) {
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Find if any device has at least TRANSFER_THRESHOLD % of the size of data that is required by the computation;
+     * @param alreadyPresentDataSize the size in bytes that is available on each device.
+     *                              The array must contain all devices in the system, not just a subset of them
+     * @param vertex the computation the we analyze
+     * @param devices the list of devices considered by the function
+     * @return if any device has at least RANSFER_THRESHOLD % of required data
+     */
+    boolean findIfAnyDeviceHasEnoughData(long[] alreadyPresentDataSize, ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        // Total size of the input arguments;
+        long totalSize = vertex.getComputation().getArrayArguments().stream().map(AbstractArray::getSizeBytes).reduce(0L, Long::sum);
+        // True if at least one device already has at least X% of the data required by the computation;
+        for (Device d : devices) {
+            if ((double) alreadyPresentDataSize[d.getDeviceId()] / totalSize > dataThreshold) {
+                return true;
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Find the device with the most bytes in it. It's just an argmax on "alreadyPresentDataSize",
+     * returning the device whose ID correspond to the maximum's index
+     * @param devices the list of devices to consider for the argmax
+     * @param alreadyPresentDataSize the array where we store the size, in bytes, of data that is already present on each device.
+     *                               the array must be zero-initialized and have size equal to the number of usable GPUs
+     * @return the device with the most data
+     */
+    private Device selectDeviceWithMostData(List<Device> devices, long[] alreadyPresentDataSize) {
+        // Find device with maximum available data;
+        Device deviceWithMaximumAvailableData = devices.get(0);
+        for (Device d : devices) {
+            if (alreadyPresentDataSize[d.getDeviceId()] > alreadyPresentDataSize[deviceWithMaximumAvailableData.getDeviceId()]) {
+                deviceWithMaximumAvailableData = d;
+            }
+        }
+        return deviceWithMaximumAvailableData;
+    }
+
+    @Override
+    Device retrieveImpl(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        // Array that tracks the size, in bytes, of data that is already present on each device;
+        long[] alreadyPresentDataSize = new long[devicesManager.getNumberOfGPUsToUse()];
+        // Compute the amount of data on each device, and if any device has any data at all;
+        computeDataSizeOnDevices(vertex, alreadyPresentDataSize);
+        // If not device has at least X% of data available, it's not worth optimizing data locality (exploration preferred to exploitation);
+        if (findIfAnyDeviceHasEnoughData(alreadyPresentDataSize, vertex, devices)) {
+            // Find device with maximum available data;
+            return selectDeviceWithMostData(devices, alreadyPresentDataSize);
+        } else {
+            // No data is present on any GPU: select the device with round-robin;
+            return roundRobin.retrieve(vertex, devices);
+        }
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveNewStreamPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveNewStreamPolicy.java
new file mode 100644
index 00000000..d29bba7f
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveNewStreamPolicy.java
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+/**
+ * This abstract class defines how a {@link GrCUDAStreamPolicy}
+ * will assign a {@link CUDAStream} to a {@link com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement}
+ * that has no dependency on active computations.
+ * For example, it could create a new stream or provide an existing stream that is currently not used;
+ */
+public abstract class RetrieveNewStreamPolicy {
+
+    protected final DeviceSelectionPolicy deviceSelectionPolicy;
+    protected final GrCUDADevicesManager devicesManager;
+
+    RetrieveNewStreamPolicy(DeviceSelectionPolicy deviceSelectionPolicy, GrCUDADevicesManager devicesManager) {
+        this.deviceSelectionPolicy = deviceSelectionPolicy;
+        this.devicesManager = devicesManager;
+    }
+
+    /**
+     * Inner implementation of how, given a specified device, a stream is created or retrieved on this device.
+     * For example, create a new stream, or reuse an existing unused stream;
+     * @param device the device on which we retrieve a stream
+     * @return the stream where the input computation is executed
+     */
+    abstract CUDAStream retrieveStreamFromDevice(Device device);
+
+    /**
+     * Obtain a new stream, associated to a unique device, where the input computation is executed.
+     * First, select the device where the computation is executed. Then, create or retrieve a stream on this device;
+     * @param vertex a computation for which we need to find a stream for execution
+     * @return the stream where the computation is executed
+     */
+    final CUDAStream retrieve(ExecutionDAG.DAGVertex vertex) {
+        Device device = this.deviceSelectionPolicy.retrieve(vertex);
+        return this.retrieveStreamFromDevice(device);
+    }
+
+    /**
+     * Initialize the class with the provided stream on the currently active GPU,
+     * for example a new stream that can be provided by {@link RetrieveNewStreamPolicy#retrieve(ExecutionDAG.DAGVertex)} )}
+     * @param stream a stream that should be associated to the class
+     */
+    void update(CUDAStream stream) {
+        // Free a stream with respect to its device;
+        devicesManager.getDevice(stream.getStreamDeviceId()).updateFreeStreams(stream);
+    }
+
+    /**
+     * Initialize the class with the provided streams on the currently active GPU,
+     * for example new streams that can be provided by {@link RetrieveNewStreamPolicy#retrieve(ExecutionDAG.DAGVertex)}
+     */
+    void update() {
+        // Free all streams on all devices;
+        devicesManager.getDeviceList().forEach(Device::updateFreeStreams);
+    }
+
+    /**
+     * Cleanup the internal state of the class, if required;
+     */
+    void cleanup() { }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveNewStreamPolicyEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveNewStreamPolicyEnum.java
new file mode 100644
index 00000000..61d0d692
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveNewStreamPolicyEnum.java
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+public enum RetrieveNewStreamPolicyEnum {
+    REUSE("reuse"),
+    ALWAYS_NEW("always-new");
+
+    private final String name;
+
+    RetrieveNewStreamPolicyEnum(String name) {
+        this.name = name;
+    }
+
+    @Override
+    public String toString() {
+        return name;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveParentStreamPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveParentStreamPolicy.java
new file mode 100644
index 00000000..4f90a512
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveParentStreamPolicy.java
@@ -0,0 +1,45 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.stream.CUDAStream;
+
+/**
+ * This abstract class defines how a {@link GrCUDAStreamPolicy}
+ * will assign a {@link CUDAStream} to a {@link com.nvidia.grcuda.runtime.computation.GrCUDAComputationalElement}
+ * that has at least one parent active computation, possibly on a different GPU.
+ * For example, it can use the same stream of the parent, or use a different stream
+ * to have multiple children computation run in parallel.
+ */
+public abstract class RetrieveParentStreamPolicy {
+    abstract CUDAStream retrieve(ExecutionDAG.DAGVertex vertex);
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveParentStreamPolicyEnum.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveParentStreamPolicyEnum.java
new file mode 100644
index 00000000..8b13f949
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RetrieveParentStreamPolicyEnum.java
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+public enum RetrieveParentStreamPolicyEnum {
+    SAME_AS_PARENT("same-as-parent"),
+    DISJOINT("disjoint"),
+    MULTIGPU_EARLY_DISJOINT("multigpu-early-disjoint"),
+    MULTIGPU_DISJOINT("multigpu-disjoint");
+
+    private final String name;
+
+    RetrieveParentStreamPolicyEnum(String name) {
+        this.name = name;
+    }
+
+    @Override
+    public String toString() {
+        return name;
+    }
+}
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RoundRobinDeviceSelectionPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RoundRobinDeviceSelectionPolicy.java
new file mode 100644
index 00000000..a7600550
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/RoundRobinDeviceSelectionPolicy.java
@@ -0,0 +1,67 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.Device;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * Basic class for multi-GPU scheduling. Simply rotate between all the available device.
+ * Not recommended for real utilization, but it can be useful for debugging
+ * or as fallback for more complex policies.
+ */
+public class RoundRobinDeviceSelectionPolicy extends DeviceSelectionPolicy {
+    private int nextDevice = 0;
+
+    public RoundRobinDeviceSelectionPolicy(GrCUDADevicesManager devicesManager) {
+        super(devicesManager);
+    }
+
+    private void increaseNextDevice(int startDevice) {
+        this.nextDevice = (startDevice + 1) % this.devicesManager.getNumberOfGPUsToUse();
+    }
+
+    public int getInternalState() {
+        return nextDevice;
+    }
+
+    @Override
+    Device retrieveImpl(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        // Keep increasing the internal state, but make sure that the retrieved device is among the ones in the input list;
+        Device device = devices.get(nextDevice % devices.size());
+        increaseNextDevice(nextDevice);
+        return device;
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/SingleDeviceSelectionPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/SingleDeviceSelectionPolicy.java
new file mode 100644
index 00000000..20b06445
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/SingleDeviceSelectionPolicy.java
@@ -0,0 +1,52 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.Device;
+
+import java.util.List;
+
+/**
+ * With some policies (e.g. the ones that don't support multiple GPUs), we never have to perform device selection.
+ * Simply return the currently active device;
+ */
+public class SingleDeviceSelectionPolicy extends DeviceSelectionPolicy {
+    public SingleDeviceSelectionPolicy(GrCUDADevicesManager devicesManager) {
+        super(devicesManager);
+    }
+
+    @Override
+    Device retrieveImpl(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        // There's only one device available, anyway;
+        return devicesManager.getCurrentGPU();
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/StreamAwareDeviceSelectionPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/StreamAwareDeviceSelectionPolicy.java
new file mode 100644
index 00000000..391a7d8d
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/StreamAwareDeviceSelectionPolicy.java
@@ -0,0 +1,53 @@
+
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.Device;
+
+import java.util.List;
+
+/**
+ * We assign computations to the device with fewer active streams.
+ * A stream is active if there's any computation assigned to it that has not been flagged as "finished".
+ * The idea is to keep all devices equally busy, and avoid having devices that are used less than others;
+ */
+public class StreamAwareDeviceSelectionPolicy extends DeviceSelectionPolicy {
+    public StreamAwareDeviceSelectionPolicy(GrCUDADevicesManager devicesManager) {
+        super(devicesManager);
+    }
+
+    @Override
+    Device retrieveImpl(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        return devicesManager.findDeviceWithFewerBusyStreams(devices);
+    }
+}
\ No newline at end of file
diff --git a/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/TransferTimeDeviceSelectionPolicy.java b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/TransferTimeDeviceSelectionPolicy.java
new file mode 100644
index 00000000..c2c8eddd
--- /dev/null
+++ b/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/runtime/stream/policy/TransferTimeDeviceSelectionPolicy.java
@@ -0,0 +1,309 @@
+/*
+ * Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+package com.nvidia.grcuda.runtime.stream.policy;
+
+import com.nvidia.grcuda.runtime.executioncontext.ExecutionDAG;
+import com.nvidia.grcuda.runtime.Device;
+import com.nvidia.grcuda.runtime.CPUDevice;
+import com.nvidia.grcuda.runtime.array.AbstractArray;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.FileReader;
+import java.util.List;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Set;
+
+
+/**
+ * Given a computation, select the device that requires the least time to transfer data to it.
+ * Compared to {@link MinimizeTransferSizeDeviceSelectionPolicy} this policy does not simply select the
+ * device that requires the least data to be transferred to it, but also estimate the time that it takes
+ * to transfer the data, given a heterogeneous multi-GPU system.
+ * Given the complexity of CUDA's unified memory heuristics, we allow different heuristics to be used to estimate
+ * the actual transfer speed, e.g. take the min or max possible values;
+ * The speed of each GPU-GPU and CPU-GPU link is assumed to be stored in a file located in "$GRCUDA_HOME/grcuda_data/datasets/connection_graph/connection_graph.csv".
+ * This file is generated as "cd $GRCUDA_HOME/projects/resources/cuda", "make connection_graph", "bin/connection_graph";
+ */
+public abstract class TransferTimeDeviceSelectionPolicy extends DeviceSelectionPolicy {
+
+    /**
+     * This functional tells how the transfer bandwidth for some array and device is computed.
+     * It should be max, min, mean, etc.;
+     */
+    private final java.util.function.BiFunction<Double, Double, Double> reduction;
+    /**
+     * Starting value of the reduction. E.g. it can be 0 if using max or mean, +inf if using min, etc.
+     */
+    private final double startValue;
+    /**
+     * Some policies can use a threshold that specifies how much data (in percentage) must be available
+     * on a device so that the device can be considered for execution.
+     * A low threshold favors exploitation (using the same device for most computations),
+     * while a high threshold favors exploration (distribute the computations on different devices
+     * even if some device would have slightly lower synchronization time);
+     */
+    protected final double dataThreshold;
+    /**
+     * Fallback policy in case no GPU has any up-tp-date data. We assume that for any GPU, transferring all the data
+     * from the CPU would have the same cost, so we use this policy as tie-breaker;
+     */
+    RoundRobinDeviceSelectionPolicy roundRobin = new RoundRobinDeviceSelectionPolicy(devicesManager);
+
+    private final double[][] linkBandwidth = new double[devicesManager.getNumberOfGPUsToUse() + 1][devicesManager.getNumberOfGPUsToUse() + 1];
+
+    public TransferTimeDeviceSelectionPolicy(GrCUDADevicesManager devicesManager, double dataThreshold, String bandwidthMatrixPath, java.util.function.BiFunction<Double, Double, Double> reduction, double startValue) {
+        super(devicesManager);
+        this.dataThreshold = dataThreshold;
+        this.reduction = reduction;
+        this.startValue = startValue;
+
+        List<List<String>> records = new ArrayList<>();
+        // Read each line in the CSV and store each line into a string array, splitting strings on ",";
+        try (BufferedReader br = new BufferedReader(new FileReader(bandwidthMatrixPath))) {
+            String line;
+            while ((line = br.readLine()) != null) {
+                String[] values = line.split(",");
+                records.add(Arrays.asList(values));
+            }
+        } catch (IOException e) {
+            e.printStackTrace();
+        }
+        // Read each line, and reconstruct the bandwidth matrix.
+        // Given N GPUs and 1 CPU, we have a [GPU + 1][GPU+ 1] symmetric matrix.
+        // Each line is "start_id", "end_id", "bandwidth";
+        for (int il = 1; il < records.size(); il++) {
+            int startDevice = Integer.parseInt(records.get(il).get(0));
+            int endDevice = Integer.parseInt(records.get(il).get(1));
+            // Skip invalid entries, and ignore GPUs with ID larger than the number of GPUs to use;
+            if (startDevice >= -1 && startDevice < devicesManager.getNumberOfGPUsToUse()
+                    && endDevice >= -1 && endDevice < devicesManager.getNumberOfGPUsToUse()) {
+                // Approximate to the floor, to smooth random bandwidth fluctuations in data transfer;
+                double bandwidth = Math.floor(Double.parseDouble(records.get(il).get(2)));
+                if (startDevice != -1) {
+                    // GPU-GPU interconnection;
+                    this.linkBandwidth[startDevice][endDevice] = bandwidth;
+                } else {
+                    // -1 identifies CPU-GPU interconnection, store it in the last spot;
+                    this.linkBandwidth[devicesManager.getNumberOfGPUsToUse()][endDevice] = bandwidth;
+                    this.linkBandwidth[endDevice][devicesManager.getNumberOfGPUsToUse()] = bandwidth;
+                }
+            }
+        }
+        // Interconnections are supposedly symmetric. Enforce this behavior by averaging results.
+        // In other words, B[i][j] = B[j][j] <- (B[i][j] + B[j][i]) / 2.
+        // Ignore the diagonal, and the last column and row (it represents the CPU and is already symmetric by construction);
+        for (int i = 0; i < this.linkBandwidth.length - 1; i++) {
+            for (int j = i; j < this.linkBandwidth.length - 1; j++) {
+                double averageBandwidth = (this.linkBandwidth[i][j] + this.linkBandwidth[j][i]) / 2;
+                this.linkBandwidth[i][j] = averageBandwidth;
+                this.linkBandwidth[j][i] = averageBandwidth;
+            }
+        }
+    }
+
+    /**
+     * For each input array of the computation, compute if the array is available on other devices and does not need to be
+     * transferred. We track the total size, in bytes, that is already present on each device;
+     * @param vertex the input computation
+     * @param alreadyPresentDataSize the array where we store the size, in bytes, of data that is already present on each device.
+     *                               The array must be zero-initialized and have size equal to the number of usable GPUs
+     * @return if any data is present on any GPU. If false, we can use a fallback policy instead
+     */
+    boolean computeDataSizeOnDevices(ExecutionDAG.DAGVertex vertex, long[] alreadyPresentDataSize) {
+        List<AbstractArray> arguments = vertex.getComputation().getArrayArguments();
+        boolean isAnyDataPresentOnGPUs = false;  // True if there's at least a GPU with some data already available;
+        for (AbstractArray a : arguments) {
+            for (int location : a.getArrayUpToDateLocations()) {
+                if (location != CPUDevice.CPU_DEVICE_ID) {
+                    alreadyPresentDataSize[location] += a.getSizeBytes();
+                    isAnyDataPresentOnGPUs = true;
+                }
+            }
+        }
+        return isAnyDataPresentOnGPUs;
+    }
+
+    /**
+     * Estimate the bandwidth to transfer data to a "targetDevice" GPU, assuming
+     * that data can be transferred from devices with index in "upToDateLocations".
+     * @param targetDevice where we want to transfer data
+     * @param upToDateLocations from where we can transfer data
+     * @return an estimate of the transfer bandwidth
+     */
+    public double computeBandwidth(int targetDevice, Set<Integer> upToDateLocations) {
+        // Hypotheses: we consider the max bandwidth towards the device targetDevice.
+        // Initialization: min possible value, bandwidth = 0 GB/sec;
+        double bandwidth = startValue;
+        // Check that data is updated at least in some location. This is a precondition that must hold;
+        if (upToDateLocations == null || upToDateLocations.isEmpty()) {
+            throw new IllegalStateException("data is not updated in any location, when estimating bandwidth for device=" + targetDevice);
+        }
+        // If array a already present in device targetDevice, the transfer bandwidth to it is infinity.
+        // We don't need to transfer it, so its transfer time will be 0;
+        if (upToDateLocations.contains(targetDevice)) {
+            bandwidth = Double.POSITIVE_INFINITY;
+        } else {
+            // Otherwise we consider the bandwidth to move array a to device targetDevice,
+            // from each possible location containing the array a;
+            for (int location : upToDateLocations) {
+                // The CPU bandwidth is stored in the last column;
+                int fromDevice = location != CPUDevice.CPU_DEVICE_ID ? location : linkBandwidth.length - 1;
+                // The matrix is symmetric, loading [targetDevice][fromDevice] is faster than [fromDevice][targetDevice];
+                bandwidth = reduction.apply(linkBandwidth[targetDevice][fromDevice], bandwidth);
+            }
+        }
+        return bandwidth;
+    }
+
+    /**
+     * For each device, measure how long it takes to transfer the data that is required
+     * to run the computation in vertex
+     * @param vertex the computation that we want to run
+     * @param argumentTransferTime the array where we store the time, in seconds, to transfer the required data on each device
+     *                             The array must be zero-initialized and have size equal to the number of usable GPUs
+     * @return if any data is present on any GPU. If false, we can use a fallback policy instead
+     */
+    private boolean computeTransferTimes(ExecutionDAG.DAGVertex vertex, double[] argumentTransferTime) {
+        List<AbstractArray> arguments = vertex.getComputation().getArrayArguments();
+
+        // True if there's at least a GPU with some data already available;
+        boolean isAnyDataPresentOnGPUs = false;
+
+        // For each input array, consider how much time it takes to transfer it from every other device;
+        for (AbstractArray a : arguments) {
+            Set<Integer> upToDateLocations = a.getArrayUpToDateLocations();
+            if (upToDateLocations.size() > 1 || (upToDateLocations.size() == 1 && !upToDateLocations.contains(CPUDevice.CPU_DEVICE_ID))) {
+                isAnyDataPresentOnGPUs = true;
+            }
+            // Check all available GPUs and compute the tentative transfer time for each of them.
+            // to find the device where transfer time is minimum;
+            for (int i = 0; i < argumentTransferTime.length; i++) {
+                // Add estimated transfer time;
+                argumentTransferTime[i] += a.getSizeBytes() / computeBandwidth(i, upToDateLocations);
+            }
+        }
+        return isAnyDataPresentOnGPUs;
+    }
+
+    /**
+     * Find the devices with at least TRANSFER_THRESHOLD % of the size of data that is required by the computation;
+     * @param alreadyPresentDataSize the size in bytes that is available on each device.
+     *                              The array must contain all devices in the system, not just a subset of them
+     * @param vertex the computation the we analyze
+     * @param devices the list of devices considered by the function
+     * @return the list of devices that has at least TRANSFER_THRESHOLD % of required data
+     */
+    List<Device> findDevicesWithEnoughData(long[] alreadyPresentDataSize, ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        // List of devices with enough data;
+        List<Device> devicesWithEnoughData = new ArrayList<>();
+        // Total size of the input arguments;
+        long totalSize = vertex.getComputation().getArrayArguments().stream().map(AbstractArray::getSizeBytes).reduce(0L, Long::sum);
+        // True if at least one device already has at least X% of the data required by the computation;
+        for (Device d : devices) {
+            if ((double) alreadyPresentDataSize[d.getDeviceId()] / totalSize > dataThreshold) {
+                devicesWithEnoughData.add(d);
+            }
+        }
+        return devicesWithEnoughData;
+    }
+
+    /**
+     * Find the device with the lower synchronization time. It's just an argmin on "argumentTransferTime",
+     * returning the device whose ID correspond to the minimum's index
+     * @param devices the list of devices to consider for the argmin
+     * @param argumentTransferTime the array where we store the time, in seconds, to transfer the required data on each device
+     *                             The array must be zero-initialized and have size equal to the number of usable GPUs
+     * @return the device with the most data
+     */
+    private Device findDeviceWithLowestTransferTime(List<Device> devices, double[] argumentTransferTime) {
+        // The best device is the one with minimum transfer time;
+        Device deviceWithMinimumTransferTime = devices.get(0);
+        for (Device d : devices) {
+            if (argumentTransferTime[d.getDeviceId()] < argumentTransferTime[deviceWithMinimumTransferTime.getDeviceId()]) {
+                deviceWithMinimumTransferTime = d;
+            }
+        }
+        return deviceWithMinimumTransferTime;
+    }
+
+    @Override
+    Device retrieveImpl(ExecutionDAG.DAGVertex vertex, List<Device> devices) {
+        // Estimated transfer time if the computation is scheduled on device i-th;
+        double[] argumentTransferTime = new double[devicesManager.getNumberOfGPUsToUse()];
+        // Compute the synchronization time on each device, and if any device has any data at all;
+        boolean isAnyDataPresentOnGPUs = computeTransferTimes(vertex, argumentTransferTime);
+        List<Device> devicesWithEnoughData = new ArrayList<>();
+        if (isAnyDataPresentOnGPUs) {  // Skip this step if no GPU has any data in it;
+            // Array that tracks the size, in bytes, of data that is already present on each device;
+            long[] alreadyPresentDataSize = new long[devicesManager.getNumberOfGPUsToUse()];
+            // Compute the amount of data on each device, and if any device has any data at all;
+            computeDataSizeOnDevices(vertex, alreadyPresentDataSize);
+            // Compute the list of devices that have at least X% of data already available;
+            devicesWithEnoughData = findDevicesWithEnoughData(alreadyPresentDataSize, vertex, devices);
+        }
+        // If no device has at least X% of data available, it's not worth optimizing data locality (exploration preferred to exploitation);
+        if (!devicesWithEnoughData.isEmpty()) {
+            // The best device is the one with minimum transfer time;
+            return findDeviceWithLowestTransferTime(devicesWithEnoughData, argumentTransferTime);
+        } else {
+            // No data is present on any GPU: select the device with round-robin;
+            return roundRobin.retrieve(vertex, devices);
+        }
+    }
+
+    public double[][] getLinkBandwidth() {
+        return linkBandwidth;
+    }
+
+/**
+ * Assume that data are transferred from the device that gives the best possible bandwidth.
+ * In other words, find the minimum transfer time among all devices considering the minimum transfer time for each device;
+ */
+public static class MinMinTransferTimeDeviceSelectionPolicy extends TransferTimeDeviceSelectionPolicy {
+    public MinMinTransferTimeDeviceSelectionPolicy(GrCUDADevicesManager devicesManager, double dataThreshold, String bandwidthMatrixPath) {
+        // Use max, we pick the maximum bandwidth between two devices;
+        super(devicesManager, dataThreshold, bandwidthMatrixPath, Math::max, 0);
+    }
+}
+
+/**
+ * Assume that data are transferred from the device that gives the worst possible bandwidth.
+ * In other words, find the minimum transfer time among all devices considering the maximum transfer time for each device;
+ */
+public static class MinMaxTransferTimeDeviceSelectionPolicy extends TransferTimeDeviceSelectionPolicy {
+    public MinMaxTransferTimeDeviceSelectionPolicy(GrCUDADevicesManager devicesManager, double dataThreshold, String bandwidthMatrixPath) {
+        // Use min, we pick the minimum bandwidth between two devices;
+        super(devicesManager, dataThreshold, bandwidthMatrixPath, Math::min, Double.POSITIVE_INFINITY);
+    }
+}
+}
\ No newline at end of file
diff --git a/projects/resources/.gitignore b/projects/resources/.gitignore
new file mode 100644
index 00000000..888acda7
--- /dev/null
+++ b/projects/resources/.gitignore
@@ -0,0 +1 @@
+*.pbs
diff --git a/projects/resources/connection_graph/Makefile b/projects/resources/connection_graph/Makefile
new file mode 100644
index 00000000..7521051b
--- /dev/null
+++ b/projects/resources/connection_graph/Makefile
@@ -0,0 +1,48 @@
+# Copyright (c) 2021, 2022, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Use NVCC.
+# Set the appropriate GPU architecture, check https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
+CXX=nvcc
+FLAGS = -O3 -std=c++11 -arch=sm_70
+
+# (Experimental) Use Clang;
+# CXX=$(CLANG_DIR)/clang++
+# FLAGS = --cuda-gpu-arch=sm_70 -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -std=c++11 -O3
+
+BIN_FOLDER=bin
+FILES=connection_graph.cu
+.PHONY: all clean
+
+all:
+	mkdir -p $(BIN_FOLDER);
+	$(CXX) $(FILES) $(FLAGS) -o $(BIN_FOLDER)/connection_graph;
+
+clean:
+	rm $(BIN_FOLDER)/*;
diff --git a/projects/resources/connection_graph/connection_graph.cu b/projects/resources/connection_graph/connection_graph.cu
new file mode 100644
index 00000000..95f10aa3
--- /dev/null
+++ b/projects/resources/connection_graph/connection_graph.cu
@@ -0,0 +1,110 @@
+#include <iostream>
+#include <stdio.h>
+#include <nvml.h>
+#include <cuda_runtime.h>
+#include <iomanip>
+#include <fstream>
+
+#define N 500000000 // 500 MB
+
+// Test bandwidth between two GPUs;
+float dtod_copy(size_t size, int from, int to) {
+	int *pointers[2];
+
+	cudaSetDevice(from);
+	cudaDeviceEnablePeerAccess(to, 0);
+	cudaMalloc(&pointers[0], size);
+
+	cudaSetDevice(to);
+	cudaDeviceEnablePeerAccess(from, 0);
+	cudaMalloc(&pointers[1], size);
+
+	cudaEvent_t begin, end;
+	cudaEventCreate(&begin);
+	cudaEventCreate(&end);
+
+	cudaEventRecord(begin);
+	cudaMemcpyAsync(pointers[0], pointers[1], size, cudaMemcpyDeviceToDevice);
+	cudaEventRecord(end);
+	cudaEventSynchronize(end);
+
+	float elapsed;
+	cudaEventElapsedTime(&elapsed, begin, end);
+	elapsed /= 1000;
+
+	cudaSetDevice(from);
+	cudaFree(pointers[0]);
+
+	cudaSetDevice(to);
+	cudaFree(pointers[1]);
+
+	cudaEventDestroy(end);
+	cudaEventDestroy(begin);
+	cudaSetDevice(from);
+
+	return elapsed;
+}
+
+// Test bandwidth from the CPU to a device;
+float htod_copy(size_t size, int device_id) {
+	int *pointer, *d_pointer;
+
+	cudaSetDevice(device_id);
+	cudaMalloc(&d_pointer, size);
+	cudaMallocHost(&pointer, size);
+
+	cudaEvent_t begin, end;
+	cudaEventCreate(&begin);
+	cudaEventCreate(&end);
+
+	cudaEventRecord(begin);
+	cudaMemcpyAsync(d_pointer, pointer, size, cudaMemcpyHostToDevice);
+	cudaEventRecord(end);
+	cudaEventSynchronize(end);
+
+	float elapsed;
+	cudaEventElapsedTime(&elapsed, begin, end);
+	elapsed /= 1000;
+
+	cudaSetDevice(device_id);
+	cudaFree(d_pointer);
+
+	cudaEventDestroy(end);
+	cudaEventDestroy(begin);
+
+	return elapsed;
+}
+
+int main() {
+    int gpu_number = 0;
+
+	cudaGetDeviceCount(&gpu_number);  
+	printf("number of devices = %d\n", gpu_number);
+
+	double **bandwidths = (double**) malloc(gpu_number * sizeof(double*));
+	for (int i = 0; i < gpu_number; i++) {
+		bandwidths[i] = (double*) malloc(gpu_number * sizeof(double));
+    }
+	std::ofstream out_file;
+	// This is not safe, I guess;
+    std::string grcuda_home = getenv("GRCUDA_HOME");
+	out_file.open(grcuda_home + "/projects/resources/connection_graph/datasets/connection_graph.csv");
+	out_file << "From,To,Bandwidth\n";
+
+	for (int i = 0; i < gpu_number; i++) {
+        // Measure CPU-to-GPU transfer time;
+		double time_htod = htod_copy(N, 1);
+		printf("\nfrom: Host, to: %d, time spent: %f, transfer rate: %f GB/s \n",i, time_htod, (float(N) / 1000000000.0) / time_htod);
+		out_file << std::setprecision(15) << "-1" << "," << i << "," << (double(N) /1000000000.0) / time_htod << "\n";
+		
+        for (int j = 0 ; j < gpu_number; j++) {
+            // Measure GPU-to-GPU transfer time;
+			double time_dtod = dtod_copy(N, i, j);
+			bandwidths[i][j] = (double(N) / 1000000000.0) / time_dtod;
+			printf("from: %d, to: %d, time spent: %f, transfer rate: %f GB/s \n", i, j, time_dtod, bandwidths[i][j]);
+			out_file << i << "," << j << "," << bandwidths[i][j] << "\n";
+		}
+	}
+	out_file.close();
+	return 0;
+}
diff --git a/projects/resources/connection_graph/datasets/connection_graph_1_v100.csv b/projects/resources/connection_graph/datasets/connection_graph_1_v100.csv
new file mode 100644
index 00000000..8f9510ae
--- /dev/null
+++ b/projects/resources/connection_graph/datasets/connection_graph_1_v100.csv
@@ -0,0 +1,3 @@
+From,To,Bandwidth
+-1,0,11.2082293278281
+0,0,386.298477432757
diff --git a/projects/resources/connection_graph/datasets/connection_graph_2_v100.csv b/projects/resources/connection_graph/datasets/connection_graph_2_v100.csv
new file mode 100644
index 00000000..67baa52e
--- /dev/null
+++ b/projects/resources/connection_graph/datasets/connection_graph_2_v100.csv
@@ -0,0 +1,7 @@
+From,To,Bandwidth
+-1,0,11.2082293278281
+0,0,386.298477432757
+0,1,24.2378090397795
+-1,1,11.8987242032587
+1,0,24.2397635291757
+1,1,355.914443648453
diff --git a/projects/resources/connection_graph/datasets/connection_graph_4_v100.csv b/projects/resources/connection_graph/datasets/connection_graph_4_v100.csv
new file mode 100644
index 00000000..ee23c977
--- /dev/null
+++ b/projects/resources/connection_graph/datasets/connection_graph_4_v100.csv
@@ -0,0 +1,21 @@
+From,To,Bandwidth
+-1,0,11.2236144430009
+0,0,382.066735780204
+0,1,48.4246883140089
+0,2,48.4092355651239
+0,3,24.2319824613951
+-1,1,11.2110117353431
+1,0,48.4240855636543
+1,1,357.183679301002
+1,2,24.2413790114003
+1,3,24.239010586642
+-1,2,11.230164622341
+2,0,48.4451076994266
+2,1,24.2381088681883
+2,2,354.396789462445
+2,3,48.4221332796503
+-1,3,11.2010372359425
+3,0,24.2321705835029
+3,1,24.2387479432744
+3,2,48.4320404552588
+3,3,352.34292982227
diff --git a/projects/resources/connection_graph/datasets/connection_graph_8_v100.csv b/projects/resources/connection_graph/datasets/connection_graph_8_v100.csv
new file mode 100644
index 00000000..84199272
--- /dev/null
+++ b/projects/resources/connection_graph/datasets/connection_graph_8_v100.csv
@@ -0,0 +1,73 @@
+From,To,Bandwidth
+-1,0,11.9556544780527
+0,0,383.86888404846
+0,1,24.2384853055984
+0,2,24.2336340982501
+0,3,48.4161332847793
+0,4,8.50896302754308
+0,5,8.5099870084499
+0,6,48.4018859817334
+0,7,8.49216625212189
+-1,1,11.9548313139581
+1,0,24.2428107975716
+1,1,357.20815754871
+1,2,48.4009870766239
+1,3,24.2308931605693
+1,4,8.49467301822259
+1,5,8.49454882752103
+1,6,8.49501926601968
+1,7,48.4279906039737
+-1,2,11.9551411851966
+2,0,24.2400261945531
+2,1,48.4081836173467
+2,2,327.568138199876
+2,3,48.4038366333844
+2,4,24.2346142488113
+2,5,8.49529671474821
+2,6,8.50965302763833
+2,7,8.49319586804279
+-1,3,11.9555895153174
+3,0,48.3885454709329
+3,1,24.2358898759727
+3,2,48.4383459407071
+3,3,328.504745610336
+3,4,8.49362530669834
+3,5,24.2401028063604
+3,6,8.49529187537091
+3,7,8.51084932891115
+-1,4,11.9539720643538
+4,0,8.6911203699105
+4,1,8.6964751789896
+4,2,24.2286754892887
+4,3,8.69462680199385
+4,4,323.981911789954
+4,5,48.4153822984349
+4,6,24.2334853323273
+4,7,48.4101347766449
+-1,5,11.9552434142836
+5,0,8.69790326565072
+5,1,8.67850140654624
+5,2,8.67702246508929
+5,3,24.2314181125966
+5,4,48.4026365494416
+5,5,324.553934889375
+5,6,48.4225874882748
+5,7,24.2296136852879
+-1,6,11.9552604526348
+6,0,48.4221332796503
+6,1,8.68096050484178
+6,2,8.69531344343351
+6,3,8.69406528989518
+6,4,24.235664498114
+6,5,48.4475077690316
+6,6,325.371725519625
+6,7,24.2306678756329
+-1,7,11.9556086845757
+7,0,8.67635273106109
+7,1,48.1619866240754
+7,2,8.67483471355078
+7,3,8.69406022141015
+7,4,48.417635327363
+7,5,24.2311928178987
+7,6,24.2271732148561
+7,7,328.22181942038
\ No newline at end of file
diff --git a/projects/resources/connection_graph/datasets/connection_graph_test.csv b/projects/resources/connection_graph/datasets/connection_graph_test.csv
new file mode 100644
index 00000000..1a6bb0bf
--- /dev/null
+++ b/projects/resources/connection_graph/datasets/connection_graph_test.csv
@@ -0,0 +1,7 @@
+From,To,Bandwidth
+-1,0,10
+-1,1,20
+0,0,30
+0,1,40
+1,0,50
+1,1,60
diff --git a/projects/resources/connection_graph/run.sh b/projects/resources/connection_graph/run.sh
new file mode 100755
index 00000000..c101cc3b
--- /dev/null
+++ b/projects/resources/connection_graph/run.sh
@@ -0,0 +1,4 @@
+#!/bin/sh
+make
+mkdir -p datasets;
+bin/connection_graph
diff --git a/projects/resources/cuda/Makefile b/projects/resources/cuda/Makefile
new file mode 100644
index 00000000..3b621124
--- /dev/null
+++ b/projects/resources/cuda/Makefile
@@ -0,0 +1,52 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Use NVCC.
+# Set the appropriate GPU architecture, check https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
+CXX=nvcc
+#FLAGS = -G -g -std=c++11 -arch=sm_75
+FLAGS = -O3 -std=c++11 -arch=sm_70
+
+# (Experimental) Use Clang;
+# CXX=$(CLANG_DIR)/clang++
+# FLAGS = --cuda-gpu-arch=sm_70 -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -pthread -std=c++11 -O3
+
+BIN_FOLDER=bin
+FILES=main.cu mmio.cpp benchmark.cu
+SINGLE_GPU_FILES=single_gpu/b1.cu single_gpu/b5.cu single_gpu/b6.cu single_gpu/b7.cu single_gpu/b8.cu single_gpu/b10.cu
+MULTI_GPU_FILES=multi_gpu/b1.cu multi_gpu/b5.cu multi_gpu/b6.cu multi_gpu/b9.cu multi_gpu/b11.cu multi_gpu/b12.cu multi_gpu/b13.cu
+B12_FILES=multi_gpu/b12.cu
+.PHONY: all graph clean
+
+all:
+	mkdir -p $(BIN_FOLDER);
+	$(CXX) $(FILES) $(SINGLE_GPU_FILES) $(MULTI_GPU_FILES) $(FLAGS) -o $(BIN_FOLDER)/b;
+
+clean:
+	rm $(BIN_FOLDER)/*;
diff --git a/projects/resources/cuda/benchmark.cu b/projects/resources/cuda/benchmark.cu
new file mode 100644
index 00000000..ff23a1c5
--- /dev/null
+++ b/projects/resources/cuda/benchmark.cu
@@ -0,0 +1,159 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "cuda_profiler_api.h"
+#include "benchmark.cuh"
+
+namespace chrono = std::chrono;
+using clock_type = chrono::high_resolution_clock;
+
+#define GPU_ORDER_8 {0, 4, 1, 5, 2, 6, 3, 7} 
+#define GPU_ORDER_4 {0, 3, 1, 2}
+
+int Benchmark::select_gpu(int i, int max_devices) {
+    if (max_devices > 4) {
+        int gpu_order[] = GPU_ORDER_8;
+        return gpu_order[i % 8] % max_devices;
+    } else if (max_devices > 2) {
+        int gpu_order[] = GPU_ORDER_4;
+        return gpu_order[i % 4] % max_devices;
+    } else {
+        return i % max_devices;
+    }
+}
+
+int Benchmark::add_node(void **paramarray, cudaKernelNodeParams &param, void *func, dim3 gridsize, dim3 threads, cudaGraph_t &g, cudaGraphNode_t *n, std::vector<cudaGraphNode_t> &dependencies, int shared_memory) {
+    param.func = func;
+    param.blockDim = threads;
+    param.gridDim = gridsize;
+    param.kernelParams = paramarray;
+    param.sharedMemBytes = shared_memory;
+    param.extra = NULL;
+    return cudaGraphAddKernelNode(n, g, dependencies.data(), dependencies.size(), &param);
+}
+
+void Benchmark::run() {
+    auto start_tot = clock_type::now();
+    auto start_tmp = clock_type::now();
+    auto end_tmp = clock_type::now();
+
+    // Allocation;
+    start_tmp = clock_type::now();
+    alloc();
+    end_tmp = clock_type::now();
+    if (debug && err) std::cout << "error=" << err << std::endl;
+    if (debug) std::cout << "allocation time=" << chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count() / 1000 << " ms" << std::endl;
+
+    // Initialization;
+    start_tmp = clock_type::now();
+    init();
+    end_tmp = clock_type::now();
+    if (debug && err) std::cout << "error=" << err << std::endl;
+    if (debug) std::cout << "initialization time=" << chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count() / 1000 << " ms" << std::endl;
+
+    // Print header;
+    if (!debug) std::cout << "num_iter,gpu_result,total_time_sec,overhead_sec,computation_sec" << std::endl;
+
+    long tot_time = 0;
+    for (int i = 0; i < num_executions; i++) {
+        if (debug) std::cout << "\n-- iter=" << i << std::endl;
+
+        // Reset;
+        start_tmp = clock_type::now();
+        reset();
+        end_tmp = clock_type::now();
+        auto reset_time = chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count();
+        if (debug) std::cout << "  reset=" << (float)reset_time / 1000 << " ms" << std::endl;
+
+        // Execution;
+        if (nvprof) cudaProfilerStart();
+        start_tmp = clock_type::now();
+        switch (policy) {
+            case Policy::Sync:
+                execute_sync(i);
+                break;
+            case Policy::CudaGraph:
+                execute_cudagraph(i);
+                break;
+            case Policy::CudaGraphAsync:
+                execute_cudagraph_manual(i);
+                break;
+            case Policy::CudaGraphSingle:
+                execute_cudagraph_single(i);
+                break;
+            default:
+                execute_async(i);
+        }
+        if (debug && err) std::cout << "  error=" << err << std::endl;
+        end_tmp = clock_type::now();
+        auto exec_time = chrono::duration_cast<chrono::microseconds>(end_tmp - start_tmp).count();
+        if (nvprof) cudaProfilerStop();
+
+        if (i >= skip_iterations)
+            tot_time += exec_time;
+
+        if (debug) {
+            std::cout << "  result=" << print_result() << std::endl;
+            std::cout << "  execution(" << i << ")=" << (float)exec_time / 1000 << " ms" << std::endl;
+#if CPU_VALIDATION
+            cpu_validation(i);
+#endif
+        } else {
+            std::cout << i << "," << print_result(true) << "," << (float)(reset_time + exec_time) / 1e6 << "," << (float)reset_time / 1e6 << "," << (float)exec_time / 1e6 << std::endl;
+        }
+    }
+
+    auto end_time = chrono::duration_cast<chrono::microseconds>(clock_type::now() - start_tot).count();
+    if (debug) std::cout << "\ntotal execution time=" << end_time / 1e6 << " sec" << std::endl;
+    if (debug) std::cout << "mean exec time=" << (float)tot_time / (1000 * (num_executions - skip_iterations)) << " ms" << std::endl;
+}
+
+void Benchmark::execute_async(int iter) {
+    std::cout << "execution (async) not implemented for " << benchmark_name << std::endl;
+}
+
+void Benchmark::execute_sync(int iter) {
+    std::cout << "execution (sync) not implemented for " << benchmark_name << std::endl;
+}
+
+void Benchmark::execute_cudagraph(int iter) {
+    std::cout << "cudagraph (standard) not implemented for " << benchmark_name << std::endl;
+}
+
+void Benchmark::execute_cudagraph_manual(int iter) {
+    std::cout << "cudagraph (manual) not implemented for " << benchmark_name << std::endl;
+}
+
+void Benchmark::execute_cudagraph_single(int iter) {
+    std::cout << "cudagraph (single) not implemented for " << benchmark_name << std::endl;
+}
+
+void Benchmark::cpu_validation(int iter) {
+    std::cout << "cpu validation not implemented for " << benchmark_name << std::endl;
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/benchmark.cuh b/projects/resources/cuda/benchmark.cuh
new file mode 100644
index 00000000..9500cfa6
--- /dev/null
+++ b/projects/resources/cuda/benchmark.cuh
@@ -0,0 +1,105 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include <chrono>
+#include <iostream>
+#include <string>
+
+#include "options.hpp"
+#include "utils.hpp"
+
+struct Benchmark {
+   public:
+    virtual void alloc() = 0;
+    virtual void init() = 0;
+    virtual void reset() = 0;
+    virtual void execute_async(int iter);
+    virtual void execute_sync(int iter);
+    virtual void execute_cudagraph(int iter);
+    virtual void execute_cudagraph_manual(int iter);
+    virtual void execute_cudagraph_single(int iter);
+    virtual std::string print_result(bool short_form = false) = 0;
+    virtual void cpu_validation(int iter);
+    void run();
+    int add_node(void **paramarray, cudaKernelNodeParams &param, void *func, dim3 gridsize, dim3 threads, cudaGraph_t &g, cudaGraphNode_t *n, std::vector<cudaGraphNode_t> &dependencies, int shared_memory = 0);
+    int select_gpu(int i, int max_devices);
+
+    Benchmark(Options &options) : debug(options.debug),
+                                  num_executions(options.num_iter),
+                                  N(options.N),
+                                  block_size_1d(options.block_size_1d),
+                                  block_size_2d(options.block_size_2d),
+                                  num_blocks(options.num_blocks),
+                                  skip_iterations(options.skip_iterations),
+                                  do_prefetch(options.prefetch),
+                                  stream_attach(options.stream_attach),
+                                  policy(options.policy_choice),
+                                  benchmark_name(options.benchmark_choice),
+                                  max_devices(options.max_devices),
+                                  nvprof(options.nvprof),
+                                  num_partitions(options.num_partitions) {
+        cudaDeviceGetAttribute(&pascalGpu, cudaDeviceAttr::cudaDevAttrConcurrentManagedAccess, 0);
+        if (debug) {
+            std::cout << "------------------------------" << std::endl;
+            std::cout << "- running " << options.benchmark_map[benchmark_name] << std::endl;
+            std::cout << "- num executions=" << num_executions << std::endl;
+            std::cout << "- iterations to skip=" << skip_iterations << std::endl;
+            std::cout << "- N=" << N << std::endl;
+            std::cout << "- policy=" << options.policy_map[policy] << std::endl;
+            std::cout << "- block size 1d=" << block_size_1d << std::endl;
+            std::cout << "- block size 2d (where applicable)=" << block_size_2d << std::endl;
+            std::cout << "- num blocks=" << num_blocks << std::endl;
+            std::cout << "- max devices (where applicable)=" << max_devices << std::endl;
+            std::cout << "- use nvprof=" << nvprof << std::endl;
+            std::cout << "- num of partitions (where applicable)=" << num_partitions << std::endl;
+            std::cout << "------------------------------" << std::endl;
+        }
+    }
+
+    virtual ~Benchmark(){};
+
+   protected:
+    int debug = DEBUG;
+    int num_executions = NUM_ITER;
+    int N = 0;
+    int block_size_1d = DEFAULT_BLOCK_SIZE_1D;
+    int block_size_2d = DEFAULT_BLOCK_SIZE_2D;
+    int num_blocks = DEFAULT_NUM_BLOCKS;
+    int skip_iterations = 0;
+    bool do_prefetch = DEFAULT_PREFETCH;
+    bool stream_attach = DEFAULT_STREAM_ATTACH;
+    int max_devices = DEFAULT_MAX_DEVICES;
+    int pascalGpu = 0;
+    bool nvprof = DEFAULT_NVPROF;
+    int num_partitions = DEFAULT_NUM_PARTITIONS;
+    Policy policy;
+    BenchmarkEnum benchmark_name;
+    int err = 0;
+};
diff --git a/projects/resources/cuda/dvrapi_error_string.h b/projects/resources/cuda/dvrapi_error_string.h
new file mode 100644
index 00000000..5f155dd0
--- /dev/null
+++ b/projects/resources/cuda/dvrapi_error_string.h
@@ -0,0 +1,463 @@
+/* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#pragma once
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+// Error Code string definitions here
+typedef struct {
+  char const *error_string;
+  int error_id;
+} s_CudaErrorStr;
+
+/**
+ * Error codes
+ */
+static s_CudaErrorStr sCudaDrvErrorString[] = {
+    /**
+     * The API call returned with no errors. In the case of query calls, this
+     * can also mean that the operation being queried is complete (see
+     * ::cuEventQuery() and ::cuStreamQuery()).
+     */
+    {"CUDA_SUCCESS", 0},
+
+    /**
+     * This indicates that one or more of the parameters passed to the API call
+     * is not within an acceptable range of values.
+     */
+    {"CUDA_ERROR_INVALID_VALUE", 1},
+
+    /**
+     * The API call failed because it was unable to allocate enough memory to
+     * perform the requested operation.
+     */
+    {"CUDA_ERROR_OUT_OF_MEMORY", 2},
+
+    /**
+     * This indicates that the CUDA driver has not been initialized with
+     * ::cuInit() or that initialization has failed.
+     */
+    {"CUDA_ERROR_NOT_INITIALIZED", 3},
+
+    /**
+     * This indicates that the CUDA driver is in the process of shutting down.
+     */
+    {"CUDA_ERROR_DEINITIALIZED", 4},
+
+    /**
+     * This indicates profiling APIs are called while application is running
+     * in visual profiler mode.
+     */
+    {"CUDA_ERROR_PROFILER_DISABLED", 5},
+    /**
+     * This indicates profiling has not been initialized for this context.
+     * Call cuProfilerInitialize() to resolve this.
+     */
+    {"CUDA_ERROR_PROFILER_NOT_INITIALIZED", 6},
+    /**
+     * This indicates profiler has already been started and probably
+     * cuProfilerStart() is incorrectly called.
+     */
+    {"CUDA_ERROR_PROFILER_ALREADY_STARTED", 7},
+    /**
+     * This indicates profiler has already been stopped and probably
+     * cuProfilerStop() is incorrectly called.
+     */
+    {"CUDA_ERROR_PROFILER_ALREADY_STOPPED", 8},
+    /**
+     * This indicates that no CUDA-capable devices were detected by the
+     * installed CUDA driver.
+     */
+    {"CUDA_ERROR_NO_DEVICE (no CUDA-capable devices were detected)", 100},
+
+    /**
+     * This indicates that the device ordinal supplied by the user does not
+     * correspond to a valid CUDA device.
+     */
+    {"CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)",
+     101},
+
+    /**
+     * This indicates that the device kernel image is invalid. This can also
+     * indicate an invalid CUDA module.
+     */
+    {"CUDA_ERROR_INVALID_IMAGE", 200},
+
+    /**
+     * This most frequently indicates that there is no context bound to the
+     * current thread. This can also be returned if the context passed to an
+     * API call is not a valid handle (such as a context that has had
+     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
+     * mixes different API versions (i.e. 3010 context with 3020 API calls).
+     * See ::cuCtxGetApiVersion() for more details.
+     */
+    {"CUDA_ERROR_INVALID_CONTEXT", 201},
+
+    /**
+     * This indicated that the context being supplied as a parameter to the
+     * API call was already the active context.
+     * \deprecated
+     * This error return is deprecated as of CUDA 3.2. It is no longer an
+     * error to attempt to push the active context via ::cuCtxPushCurrent().
+     */
+    {"CUDA_ERROR_CONTEXT_ALREADY_CURRENT", 202},
+
+    /**
+     * This indicates that a map or register operation has failed.
+     */
+    {"CUDA_ERROR_MAP_FAILED", 205},
+
+    /**
+     * This indicates that an unmap or unregister operation has failed.
+     */
+    {"CUDA_ERROR_UNMAP_FAILED", 206},
+
+    /**
+     * This indicates that the specified array is currently mapped and thus
+     * cannot be destroyed.
+     */
+    {"CUDA_ERROR_ARRAY_IS_MAPPED", 207},
+
+    /**
+     * This indicates that the resource is already mapped.
+     */
+    {"CUDA_ERROR_ALREADY_MAPPED", 208},
+
+    /**
+     * This indicates that there is no kernel image available that is suitable
+     * for the device. This can occur when a user specifies code generation
+     * options for a particular CUDA source file that do not include the
+     * corresponding device configuration.
+     */
+    {"CUDA_ERROR_NO_BINARY_FOR_GPU", 209},
+
+    /**
+     * This indicates that a resource has already been acquired.
+     */
+    {"CUDA_ERROR_ALREADY_ACQUIRED", 210},
+
+    /**
+     * This indicates that a resource is not mapped.
+     */
+    {"CUDA_ERROR_NOT_MAPPED", 211},
+
+    /**
+     * This indicates that a mapped resource is not available for access as an
+     * array.
+     */
+    {"CUDA_ERROR_NOT_MAPPED_AS_ARRAY", 212},
+
+    /**
+     * This indicates that a mapped resource is not available for access as a
+     * pointer.
+     */
+    {"CUDA_ERROR_NOT_MAPPED_AS_POINTER", 213},
+
+    /**
+     * This indicates that an uncorrectable ECC error was detected during
+     * execution.
+     */
+    {"CUDA_ERROR_ECC_UNCORRECTABLE", 214},
+
+    /**
+     * This indicates that the ::CUlimit passed to the API call is not
+     * supported by the active device.
+     */
+    {"CUDA_ERROR_UNSUPPORTED_LIMIT", 215},
+
+    /**
+     * This indicates that the ::CUcontext passed to the API call can
+     * only be bound to a single CPU thread at a time but is already
+     * bound to a CPU thread.
+     */
+    {"CUDA_ERROR_CONTEXT_ALREADY_IN_USE", 216},
+
+    /**
+     * This indicates that peer access is not supported across the given
+     * devices.
+     */
+    {"CUDA_ERROR_PEER_ACCESS_UNSUPPORTED", 217},
+
+    /**
+     * This indicates that a PTX JIT compilation failed.
+     */
+    {"CUDA_ERROR_INVALID_PTX", 218},
+
+    /**
+     * This indicates an error with OpenGL or DirectX context.
+     */
+    {"CUDA_ERROR_INVALID_GRAPHICS_CONTEXT", 219},
+
+    /**
+     * This indicates that an uncorrectable NVLink error was detected during the
+     * execution.
+     */
+    {"CUDA_ERROR_NVLINK_UNCORRECTABLE", 220},
+
+    /**
+     * This indicates that the PTX JIT compiler library was not found.
+     */
+    {"CUDA_ERROR_JIT_COMPILER_NOT_FOUND", 221},
+
+    /**
+     * This indicates that the device kernel source is invalid.
+     */
+    {"CUDA_ERROR_INVALID_SOURCE", 300},
+
+    /**
+     * This indicates that the file specified was not found.
+     */
+    {"CUDA_ERROR_FILE_NOT_FOUND", 301},
+
+    /**
+     * This indicates that a link to a shared object failed to resolve.
+     */
+    {"CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND", 302},
+
+    /**
+     * This indicates that initialization of a shared object failed.
+     */
+    {"CUDA_ERROR_SHARED_OBJECT_INIT_FAILED", 303},
+
+    /**
+     * This indicates that an OS call failed.
+     */
+    {"CUDA_ERROR_OPERATING_SYSTEM", 304},
+
+    /**
+     * This indicates that a resource handle passed to the API call was not
+     * valid. Resource handles are opaque types like ::CUstream and ::CUevent.
+     */
+    {"CUDA_ERROR_INVALID_HANDLE", 400},
+
+    /**
+     * This indicates that a named symbol was not found. Examples of symbols
+     * are global/constant variable names, texture names }, and surface names.
+     */
+    {"CUDA_ERROR_NOT_FOUND", 500},
+
+    /**
+     * This indicates that asynchronous operations issued previously have not
+     * completed yet. This result is not actually an error, but must be
+     * indicated differently than ::CUDA_SUCCESS (which indicates completion).
+     * Calls that may return this value include ::cuEventQuery() and
+     * ::cuStreamQuery().
+     */
+    {"CUDA_ERROR_NOT_READY", 600},
+
+    /**
+     * While executing a kernel, the device encountered a
+     * load or store instruction on an invalid memory address.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_ILLEGAL_ADDRESS", 700},
+
+    /**
+     * This indicates that a launch did not occur because it did not have
+     * appropriate resources. This error usually indicates that the user has
+     * attempted to pass too many arguments to the device kernel, or the
+     * kernel launch specifies too many threads for the kernel's register
+     * count. Passing arguments of the wrong size (i.e. a 64-bit pointer
+     * when a 32-bit int is expected) is equivalent to passing too many
+     * arguments and can also result in this error.
+     */
+    {"CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES", 701},
+
+    /**
+     * This indicates that the device kernel took too long to execute. This can
+     * only occur if timeouts are enabled - see the device attribute
+     * ::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
+     * context cannot be used (and must be destroyed similar to
+     * ::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
+     * this context are invalid and must be reconstructed if the program is to
+     * continue using CUDA.
+     */
+    {"CUDA_ERROR_LAUNCH_TIMEOUT", 702},
+
+    /**
+     * This error indicates a kernel launch that uses an incompatible texturing
+     * mode.
+     */
+    {"CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING", 703},
+
+    /**
+     * This error indicates that a call to ::cuCtxEnablePeerAccess() is
+     * trying to re-enable peer access to a context which has already
+     * had peer access to it enabled.
+     */
+    {"CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED", 704},
+
+    /**
+     * This error indicates that ::cuCtxDisablePeerAccess() is
+     * trying to disable peer access which has not been enabled yet
+     * via ::cuCtxEnablePeerAccess().
+     */
+    {"CUDA_ERROR_PEER_ACCESS_NOT_ENABLED", 705},
+
+    /**
+     * This error indicates that the primary context for the specified device
+     * has already been initialized.
+     */
+    {"CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE", 708},
+
+    /**
+     * This error indicates that the context current to the calling thread
+     * has been destroyed using ::cuCtxDestroy }, or is a primary context which
+     * has not yet been initialized.
+     */
+    {"CUDA_ERROR_CONTEXT_IS_DESTROYED", 709},
+
+    /**
+     * A device-side assert triggered during kernel execution. The context
+     * cannot be used anymore, and must be destroyed. All existing device
+     * memory allocations from this context are invalid and must be
+     * reconstructed if the program is to continue using CUDA.
+     */
+    {"CUDA_ERROR_ASSERT", 710},
+
+    /**
+     * This error indicates that the hardware resources required to enable
+     * peer access have been exhausted for one or more of the devices
+     * passed to ::cuCtxEnablePeerAccess().
+     */
+    {"CUDA_ERROR_TOO_MANY_PEERS", 711},
+
+    /**
+     * This error indicates that the memory range passed to
+     * ::cuMemHostRegister() has already been registered.
+     */
+    {"CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED", 712},
+
+    /**
+     * This error indicates that the pointer passed to ::cuMemHostUnregister()
+     * does not correspond to any currently registered memory region.
+     */
+    {"CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED", 713},
+
+    /**
+     * While executing a kernel, the device encountered a stack error.
+     * This can be due to stack corruption or exceeding the stack size limit.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_HARDWARE_STACK_ERROR", 714},
+
+    /**
+     * While executing a kernel, the device encountered an illegal instruction.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_ILLEGAL_INSTRUCTION", 715},
+
+    /**
+     * While executing a kernel, the device encountered a load or store
+     * instruction on a memory address which is not aligned. This leaves the
+     * process in an inconsistent state and any further CUDA work will return
+     * the same error. To continue using CUDA, the process must be terminated
+     * and relaunched.
+     */
+    {"CUDA_ERROR_MISALIGNED_ADDRESS", 716},
+
+    /**
+     * While executing a kernel, the device encountered an instruction
+     * which can only operate on memory locations in certain address spaces
+     * (global, shared, or local), but was supplied a memory address not
+     * belonging to an allowed address space.
+     * This leaves the process in an inconsistent state and any further CUDA
+     * work will return the same error. To continue using CUDA, the process must
+     * be terminated and relaunched.
+     */
+    {"CUDA_ERROR_INVALID_ADDRESS_SPACE", 717},
+
+    /**
+     * While executing a kernel, the device program counter wrapped its address
+     * space. This leaves the process in an inconsistent state and any further
+     * CUDA work will return the same error. To continue using CUDA, the process
+     * must be terminated and relaunched.
+     */
+    {"CUDA_ERROR_INVALID_PC", 718},
+
+    /**
+     * An exception occurred on the device while executing a kernel. Common
+     * causes include dereferencing an invalid device pointer and accessing
+     * out of bounds shared memory. The context cannot be used }, so it must
+     * be destroyed (and a new one should be created). All existing device
+     * memory allocations from this context are invalid and must be
+     * reconstructed if the program is to continue using CUDA.
+     */
+    {"CUDA_ERROR_LAUNCH_FAILED", 719},
+
+    /**
+     * This error indicates that the number of blocks launched per grid for a
+     * kernel that was launched via either ::cuLaunchCooperativeKernel or
+     * ::cuLaunchCooperativeKernelMultiDevice exceeds the maximum number of
+     * blocks as allowed by ::cuOccupancyMaxActiveBlocksPerMultiprocessor or
+     * ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags times the number
+     * of multiprocessors as specified by the device attribute
+     * ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.
+     */
+    {"CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE", 720},
+
+    /**
+     * This error indicates that the attempted operation is not permitted.
+     */
+    {"CUDA_ERROR_NOT_PERMITTED", 800},
+
+    /**
+     * This error indicates that the attempted operation is not supported
+     * on the current system or device.
+     */
+    {"CUDA_ERROR_NOT_SUPPORTED", 801},
+
+    /**
+     * This indicates that an unknown internal error has occurred.
+     */
+    {"CUDA_ERROR_UNKNOWN", 999},
+    {NULL, -1}};
+
+// This is just a linear search through the array, since the error_id's are not
+// always ocurring consecutively
+inline const char *getCudaDrvErrorString(int error_id) {
+  int index = 0;
+
+  while (sCudaDrvErrorString[index].error_id != error_id &&
+         sCudaDrvErrorString[index].error_id != -1) {
+    index++;
+  }
+
+  if (sCudaDrvErrorString[index].error_id == error_id)
+    return (const char *)sCudaDrvErrorString[index].error_string;
+  else
+    return (const char *)"CUDA_ERROR not found!";
+}
diff --git a/projects/resources/cuda/main.cu b/projects/resources/cuda/main.cu
new file mode 100644
index 00000000..8d56bb0d
--- /dev/null
+++ b/projects/resources/cuda/main.cu
@@ -0,0 +1,109 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include <string>
+#include <iostream>
+#include <ctime>    // For time()
+#include <cstdlib>  // For srand()
+#include "options.hpp"
+#include "benchmark.cuh"
+
+#include "single_gpu/b1.cuh"
+#include "single_gpu/b5.cuh"
+#include "single_gpu/b6.cuh"
+#include "single_gpu/b7.cuh"
+#include "single_gpu/b8.cuh"
+#include "single_gpu/b10.cuh"
+#include "multi_gpu/b1.cuh"
+#include "multi_gpu/b5.cuh"
+#include "multi_gpu/b6.cuh"
+#include "multi_gpu/b9.cuh"
+#include "multi_gpu/b11.cuh"
+#include "multi_gpu/b12.cuh"
+#include "multi_gpu/b13.cuh"
+
+int main(int argc, char *argv[])
+{
+    // srand(time(0));
+    srand(12);
+    
+    Options options = Options(argc, argv);
+    BenchmarkEnum benchmark_choice = options.benchmark_choice;
+    Benchmark *b;
+
+    switch (benchmark_choice)
+    {
+    case BenchmarkEnum::B1:
+        b = new Benchmark1(options);
+        break;
+    case BenchmarkEnum::B5:
+        b = new Benchmark5(options);
+        break;
+    case BenchmarkEnum::B6:
+        b = new Benchmark6(options);
+        break;
+    case BenchmarkEnum::B7:
+        b = new Benchmark7(options);
+        break;
+    case BenchmarkEnum::B8:
+        b = new Benchmark8(options);
+        break;
+    case BenchmarkEnum::B10:
+        b = new Benchmark10(options);
+        break;
+    case BenchmarkEnum::B1M:
+        b = new Benchmark1M(options);
+        break;
+    case BenchmarkEnum::B5M:
+        b = new Benchmark5M(options);
+        break;
+    case BenchmarkEnum::B6M:
+        b = new Benchmark6M(options);
+        break;
+    case BenchmarkEnum::B9M:
+        b = new Benchmark9M(options);
+        break;
+    case BenchmarkEnum::B11M:
+        b = new Benchmark11M(options);
+        break;
+    case BenchmarkEnum::B12M:
+        b = new Benchmark12M(options);
+        break;
+    case BenchmarkEnum::B13M:
+        b = new Benchmark13M(options);
+        break;
+    default:
+        break;
+    }
+    if (b != nullptr) {
+        b->run();
+    } else {
+        std::cout << "ERROR = benchmark is null" << std::endl;
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/mmio.cpp b/projects/resources/cuda/mmio.cpp
new file mode 100644
index 00000000..a04a670a
--- /dev/null
+++ b/projects/resources/cuda/mmio.cpp
@@ -0,0 +1,446 @@
+/* 
+*   Matrix Market I/O library for ANSI C
+*
+*   See http://math.nist.gov/MatrixMarket for details.
+*
+*
+*/
+
+
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <ctype.h>
+#include <stdio.h>
+
+#include "mmio.hpp"
+extern "C" {
+int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
+                               float **val_, int **I_, int **J_) {
+    FILE *f;
+    MM_typecode matcode;
+    int M, N, nz;
+    int i;
+    float *val;
+    int *I, *J;
+
+    if ((f = fopen(fname, "r")) == NULL)
+        return -1;
+
+
+    if (mm_read_banner(f, &matcode) != 0) {
+        printf("mm_read_unsymetric: Could not process Matrix Market banner ");
+        printf(" in file [%s]\n", fname);
+        return -1;
+    }
+
+
+    /* find out size of sparse matrix: M, N, nz .... */
+
+    if (mm_read_mtx_crd_size(f, &M, &N, &nz) != 0) {
+        fprintf(stderr, "read_unsymmetric_sparse(): could not parse matrix size.\n");
+        return -1;
+    }
+
+    *M_ = M;
+    *N_ = N;
+    *nz_ = nz;
+
+    /* reseve memory for matrices */
+
+    I = (int *) malloc(nz * sizeof(int));
+    J = (int *) malloc(nz * sizeof(int));
+    val = (float *) malloc(nz * sizeof(float));
+
+    *val_ = val;
+    *I_ = I;
+    *J_ = J;
+
+    /* NOTE: when reading in floats, ANSI C requires the use of the "l"  */
+    /*   specifier as in "%lg", "%lf", "%le", otherwise errors will occur */
+    /*  (ANSI C X3.159-1989, Sec. 4.9.6.2, p. 136 lines 13-15)            */
+
+    for (i = 0; i < nz; i++) {
+        fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i]);
+        I[i]--;
+        J[i]--;
+    }
+    fclose(f);
+
+    return 0;
+}
+
+int mm_is_valid(MM_typecode matcode) {
+    if (!mm_is_matrix(matcode)) return 0;
+    if (mm_is_dense(matcode) && mm_is_pattern(matcode)) return 0;
+    if (mm_is_real(matcode) && mm_is_hermitian(matcode)) return 0;
+    if (mm_is_pattern(matcode) && (mm_is_hermitian(matcode) ||
+                                   mm_is_skew(matcode)))
+        return 0;
+    return 1;
+}
+
+int mm_read_banner(FILE *f, MM_typecode *matcode) {
+    char line[MM_MAX_LINE_LENGTH];
+    char banner[MM_MAX_TOKEN_LENGTH];
+    char mtx[MM_MAX_TOKEN_LENGTH];
+    char crd[MM_MAX_TOKEN_LENGTH];
+    char data_type[MM_MAX_TOKEN_LENGTH];
+    char storage_scheme[MM_MAX_TOKEN_LENGTH];
+    char *p;
+
+
+    mm_clear_typecode(matcode);
+
+    if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL)
+        return MM_PREMATURE_EOF;
+
+    if (sscanf(line, "%s %s %s %s %s", banner, mtx, crd, data_type,
+               storage_scheme) != 5)
+        return MM_PREMATURE_EOF;
+
+    for (p = mtx; *p != '\0'; *p = tolower(*p), p++);  /* convert to lower case */
+    for (p = crd; *p != '\0'; *p = tolower(*p), p++);
+    for (p = data_type; *p != '\0'; *p = tolower(*p), p++);
+    for (p = storage_scheme; *p != '\0'; *p = tolower(*p), p++);
+
+    /* check for banner */
+    if (strncmp(banner, MatrixMarketBanner, strlen(MatrixMarketBanner)) != 0)
+        return MM_NO_HEADER;
+
+    /* first field should be "mtx" */
+    if (strcmp(mtx, MM_MTX_STR) != 0)
+        return MM_UNSUPPORTED_TYPE;
+    mm_set_matrix(matcode);
+
+
+    /* second field describes whether this is a sparse matrix (in coordinate
+            storgae) or a dense array */
+
+
+    if (strcmp(crd, MM_SPARSE_STR) == 0)
+        mm_set_sparse(matcode);
+    else if (strcmp(crd, MM_DENSE_STR) == 0)
+        mm_set_dense(matcode);
+    else
+        return MM_UNSUPPORTED_TYPE;
+
+
+    /* third field */
+
+    if (strcmp(data_type, MM_REAL_STR) == 0)
+        mm_set_real(matcode);
+    else if (strcmp(data_type, MM_COMPLEX_STR) == 0)
+        mm_set_complex(matcode);
+    else if (strcmp(data_type, MM_PATTERN_STR) == 0)
+        mm_set_pattern(matcode);
+    else if (strcmp(data_type, MM_INT_STR) == 0)
+        mm_set_integer(matcode);
+    else
+        return MM_UNSUPPORTED_TYPE;
+
+
+    /* fourth field */
+
+    if (strcmp(storage_scheme, MM_GENERAL_STR) == 0)
+        mm_set_general(matcode);
+    else if (strcmp(storage_scheme, MM_SYMM_STR) == 0)
+        mm_set_symmetric(matcode);
+    else if (strcmp(storage_scheme, MM_HERM_STR) == 0)
+        mm_set_hermitian(matcode);
+    else if (strcmp(storage_scheme, MM_SKEW_STR) == 0)
+        mm_set_skew(matcode);
+    else
+        return MM_UNSUPPORTED_TYPE;
+
+
+    return 0;
+}
+
+int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz) {
+    if (fprintf(f, "%d %d %d\n", M, N, nz) != 3)
+        return MM_COULD_NOT_WRITE_FILE;
+    else
+        return 0;
+}
+
+int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz) {
+    char line[MM_MAX_LINE_LENGTH];
+    int num_items_read;
+
+    /* set return null parameter values, in case we exit with errors */
+    *M = *N = *nz = 0;
+
+    /* now continue scanning until you reach the end-of-comments */
+    do {
+        if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL)
+            return MM_PREMATURE_EOF;
+    } while (line[0] == '%');
+
+    /* line[] is either blank or has M,N, nz */
+    if (sscanf(line, "%d %d %d", M, N, nz) == 3)
+        return 0;
+
+    else
+        do {
+            num_items_read = fscanf(f, "%d %d %d", M, N, nz);
+            if (num_items_read == EOF) return MM_PREMATURE_EOF;
+        } while (num_items_read != 3);
+
+    return 0;
+}
+
+
+int mm_read_mtx_array_size(FILE *f, int *M, int *N) {
+    char line[MM_MAX_LINE_LENGTH];
+    int num_items_read;
+    /* set return null parameter values, in case we exit with errors */
+    *M = *N = 0;
+
+    /* now continue scanning until you reach the end-of-comments */
+    do {
+        if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL)
+            return MM_PREMATURE_EOF;
+    } while (line[0] == '%');
+
+    /* line[] is either blank or has M,N, nz */
+    if (sscanf(line, "%d %d", M, N) == 2)
+        return 0;
+
+    else /* we have a blank line */
+        do {
+            num_items_read = fscanf(f, "%d %d", M, N);
+            if (num_items_read == EOF) return MM_PREMATURE_EOF;
+        } while (num_items_read != 2);
+
+    return 0;
+}
+
+int mm_write_mtx_array_size(FILE *f, int M, int N) {
+    if (fprintf(f, "%d %d\n", M, N) != 2)
+        return MM_COULD_NOT_WRITE_FILE;
+    else
+        return 0;
+}
+
+
+
+/*-------------------------------------------------------------------------*/
+
+/******************************************************************/
+/* use when I[], J[], and val[]J, and val[] are already allocated */
+/******************************************************************/
+
+int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
+                         float val[], MM_typecode matcode) {
+    int i;
+    if (mm_is_complex(matcode)) {
+        for (i = 0; i < nz; i++)
+            if (fscanf(f, "%d %d %lg %lg", &I[i], &J[i], &val[2 * i], &val[2 * i + 1])
+                != 4)
+                return MM_PREMATURE_EOF;
+    } else if (mm_is_real(matcode)) {
+        for (i = 0; i < nz; i++) {
+            if (fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i])
+                != 3)
+                return MM_PREMATURE_EOF;
+
+        }
+    } else if (mm_is_pattern(matcode)) {
+        for (i = 0; i < nz; i++)
+            if (fscanf(f, "%d %d", &I[i], &J[i])
+                != 2)
+                return MM_PREMATURE_EOF;
+    } else
+        return MM_UNSUPPORTED_TYPE;
+
+    return 0;
+
+}
+
+int mm_read_mtx_crd_entry(FILE *f, int *I, int *J,
+                          float *real, float *imag, MM_typecode matcode) {
+    if (mm_is_complex(matcode)) {
+        if (fscanf(f, "%d %d %lg %lg", I, J, real, imag)
+            != 4)
+            return MM_PREMATURE_EOF;
+    } else if (mm_is_real(matcode)) {
+        if (fscanf(f, "%d %d %lg\n", I, J, real)
+            != 3)
+            return MM_PREMATURE_EOF;
+
+    } else if (mm_is_pattern(matcode)) {
+        if (fscanf(f, "%d %d", I, J) != 2) return MM_PREMATURE_EOF;
+    } else
+        return MM_UNSUPPORTED_TYPE;
+
+    return 0;
+
+}
+
+
+/************************************************************************
+    mm_read_mtx_crd()  fills M, N, nz, array of values, and return
+                        type code, e.g. 'MCRS'
+
+                        if matrix is complex, values[] is of size 2*nz,
+                            (nz pairs of real/imaginary values)
+************************************************************************/
+
+int mm_read_mtx_crd(char *fname, int *M, int *N, int *nz, int **I, int **J,
+                    float **val, MM_typecode *matcode) {
+    int ret_code;
+    FILE *f;
+
+    if (strcmp(fname, "stdin") == 0) f = stdin;
+    else if ((f = fopen(fname, "r")) == NULL)
+        return MM_COULD_NOT_READ_FILE;
+
+
+    if ((ret_code = mm_read_banner(f, matcode)) != 0)
+        return ret_code;
+
+    if (!(mm_is_valid(*matcode) && mm_is_sparse(*matcode) &&
+          mm_is_matrix(*matcode)))
+        return MM_UNSUPPORTED_TYPE;
+
+    if ((ret_code = mm_read_mtx_crd_size(f, M, N, nz)) != 0)
+        return ret_code;
+
+
+    *I = (int *) malloc(*nz * sizeof(int));
+    *J = (int *) malloc(*nz * sizeof(int));
+    *val = NULL;
+
+    if (mm_is_complex(*matcode)) {
+        *val = (float *) malloc(*nz * 2 * sizeof(float));
+        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val,
+                                        *matcode);
+        if (ret_code != 0) return ret_code;
+    } else if (mm_is_real(*matcode)) {
+        *val = (float *) malloc(*nz * sizeof(float));
+        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val,
+                                        *matcode);
+        if (ret_code != 0) return ret_code;
+    } else if (mm_is_pattern(*matcode)) {
+        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val,
+                                        *matcode);
+        if (ret_code != 0) return ret_code;
+    }
+
+    if (f != stdin) fclose(f);
+    return 0;
+}
+
+int mm_write_banner(FILE *f, MM_typecode matcode) {
+    char *str = mm_typecode_to_str(matcode);
+    int ret_code;
+
+    ret_code = fprintf(f, "%s %s\n", MatrixMarketBanner, str);
+    free(str);
+    if (ret_code != 2)
+        return MM_COULD_NOT_WRITE_FILE;
+    else
+        return 0;
+}
+
+int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
+                     float val[], MM_typecode matcode) {
+    FILE *f;
+    int i;
+
+    if (strcmp(fname, "stdout") == 0)
+        f = stdout;
+    else if ((f = fopen(fname, "w")) == NULL)
+        return MM_COULD_NOT_WRITE_FILE;
+
+    /* print banner followed by typecode */
+    fprintf(f, "%s ", MatrixMarketBanner);
+    fprintf(f, "%s\n", mm_typecode_to_str(matcode));
+
+    /* print matrix sizes and nonzeros */
+    fprintf(f, "%d %d %d\n", M, N, nz);
+
+    /* print values */
+    if (mm_is_pattern(matcode))
+        for (i = 0; i < nz; i++)
+            fprintf(f, "%d %d\n", I[i], J[i]);
+    else if (mm_is_real(matcode))
+        for (i = 0; i < nz; i++)
+            fprintf(f, "%d %d %20.16g\n", I[i], J[i], val[i]);
+    else if (mm_is_complex(matcode))
+        for (i = 0; i < nz; i++)
+            fprintf(f, "%d %d %20.16g %20.16g\n", I[i], J[i], val[2 * i],
+                    val[2 * i + 1]);
+    else {
+        if (f != stdout) fclose(f);
+        return MM_UNSUPPORTED_TYPE;
+    }
+
+    if (f != stdout) fclose(f);
+
+    return 0;
+}
+
+
+/**
+*  Create a new copy of a string s.  mm_strdup() is a common routine, but
+*  not part of ANSI C, so it is included here.  Used by mm_typecode_to_str().
+*
+*/
+char *mm_strdup(const char *s) {
+    int len = strlen(s);
+    char *s2 = (char *) malloc((len + 1) * sizeof(char));
+    return strcpy(s2, s);
+}
+
+char *mm_typecode_to_str(MM_typecode matcode) {
+    char buffer[MM_MAX_LINE_LENGTH];
+    const char *types[4];
+    char *mm_strdup(const char *);
+    int error = 0;
+
+    /* check for MTX type */
+    if (mm_is_matrix(matcode))
+        types[0] = MM_MTX_STR;
+    else
+        error = 1;
+
+    /* check for CRD or ARR matrix */
+    if (mm_is_sparse(matcode))
+        types[1] = MM_SPARSE_STR;
+    else if (mm_is_dense(matcode))
+        types[1] = MM_DENSE_STR;
+    else
+        return NULL;
+
+    /* check for element data type */
+    if (mm_is_real(matcode))
+        types[2] = MM_REAL_STR;
+    else if (mm_is_complex(matcode))
+        types[2] = MM_COMPLEX_STR;
+    else if (mm_is_pattern(matcode))
+        types[2] = MM_PATTERN_STR;
+    else if (mm_is_integer(matcode))
+        types[2] = MM_INT_STR;
+    else
+        return NULL;
+
+
+    /* check for symmetry type */
+    if (mm_is_general(matcode))
+        types[3] = MM_GENERAL_STR;
+    else if (mm_is_symmetric(matcode))
+        types[3] = MM_SYMM_STR;
+    else if (mm_is_hermitian(matcode))
+        types[3] = MM_HERM_STR;
+    else if (mm_is_skew(matcode))
+        types[3] = MM_SKEW_STR;
+    else
+        return NULL;
+
+    sprintf(buffer, "%s %s %s %s", types[0], types[1], types[2], types[3]);
+    return mm_strdup(buffer);
+
+}
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/mmio.hpp b/projects/resources/cuda/mmio.hpp
new file mode 100644
index 00000000..1e679c31
--- /dev/null
+++ b/projects/resources/cuda/mmio.hpp
@@ -0,0 +1,133 @@
+/* 
+*   Matrix Market I/O library for ANSI C
+*
+*   See http://math.nist.gov/MatrixMarket for details.
+*
+*
+*/
+
+#ifndef MM_IO_H
+#define MM_IO_H
+
+extern "C" {
+#define MM_MAX_LINE_LENGTH 1025
+#define MatrixMarketBanner "%%MatrixMarket"
+#define MM_MAX_TOKEN_LENGTH 64
+
+typedef char MM_typecode[4];
+char *mm_typecode_to_str(MM_typecode matcode);
+
+int mm_read_banner(FILE *f, MM_typecode *matcode);
+int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz);
+int mm_read_mtx_array_size(FILE *f, int *M, int *N);
+
+int mm_write_banner(FILE *f, MM_typecode matcode);
+int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz);
+int mm_write_mtx_array_size(FILE *f, int M, int N);
+
+
+/********************* MM_typecode query fucntions ***************************/
+
+#define mm_is_matrix(typecode)	((typecode)[0]=='M')
+
+#define mm_is_sparse(typecode)	((typecode)[1]=='C')
+#define mm_is_coordinate(typecode)((typecode)[1]=='C')
+#define mm_is_dense(typecode)	((typecode)[1]=='A')
+#define mm_is_array(typecode)	((typecode)[1]=='A')
+
+#define mm_is_complex(typecode)	((typecode)[2]=='C')
+#define mm_is_real(typecode)		((typecode)[2]=='R')
+#define mm_is_pattern(typecode)	((typecode)[2]=='P')
+#define mm_is_integer(typecode) ((typecode)[2]=='I')
+
+#define mm_is_symmetric(typecode)((typecode)[3]=='S')
+#define mm_is_general(typecode)	((typecode)[3]=='G')
+#define mm_is_skew(typecode)	((typecode)[3]=='K')
+#define mm_is_hermitian(typecode)((typecode)[3]=='H')
+
+int mm_is_valid(MM_typecode matcode);		/* too complex for a macro */
+
+
+/********************* MM_typecode modify fucntions ***************************/
+
+#define mm_set_matrix(typecode)	((*typecode)[0]='M')
+#define mm_set_coordinate(typecode)	((*typecode)[1]='C')
+#define mm_set_array(typecode)	((*typecode)[1]='A')
+#define mm_set_dense(typecode)	mm_set_array(typecode)
+#define mm_set_sparse(typecode)	mm_set_coordinate(typecode)
+
+#define mm_set_complex(typecode)((*typecode)[2]='C')
+#define mm_set_real(typecode)	((*typecode)[2]='R')
+#define mm_set_pattern(typecode)((*typecode)[2]='P')
+#define mm_set_integer(typecode)((*typecode)[2]='I')
+
+
+#define mm_set_symmetric(typecode)((*typecode)[3]='S')
+#define mm_set_general(typecode)((*typecode)[3]='G')
+#define mm_set_skew(typecode)	((*typecode)[3]='K')
+#define mm_set_hermitian(typecode)((*typecode)[3]='H')
+
+#define mm_clear_typecode(typecode) ((*typecode)[0]=(*typecode)[1]= \
+									(*typecode)[2]=' ',(*typecode)[3]='G')
+
+#define mm_initialize_typecode(typecode) mm_clear_typecode(typecode)
+
+
+/********************* Matrix Market error codes ***************************/
+
+
+#define MM_COULD_NOT_READ_FILE	11
+#define MM_PREMATURE_EOF		12
+#define MM_NOT_MTX				13
+#define MM_NO_HEADER			14
+#define MM_UNSUPPORTED_TYPE		15
+#define MM_LINE_TOO_LONG		16
+#define MM_COULD_NOT_WRITE_FILE	17
+
+
+/******************** Matrix Market internal definitions ********************
+
+   MM_matrix_typecode: 4-character sequence
+
+				    ojbect 		sparse/   	data        storage 
+						  		dense     	type        scheme
+
+   string position:	 [0]        [1]			[2]         [3]
+
+   Matrix typecode:  M(atrix)  C(oord)		R(eal)   	G(eneral)
+						        A(array)	C(omplex)   H(ermitian)
+											P(attern)   S(ymmetric)
+								    		I(nteger)	K(kew)
+
+ ***********************************************************************/
+
+#define MM_MTX_STR		"matrix"
+#define MM_ARRAY_STR	"array"
+#define MM_DENSE_STR	"array"
+#define MM_COORDINATE_STR "coordinate" 
+#define MM_SPARSE_STR	"coordinate"
+#define MM_COMPLEX_STR	"complex"
+#define MM_REAL_STR		"real"
+#define MM_INT_STR		"integer"
+#define MM_GENERAL_STR  "general"
+#define MM_SYMM_STR		"symmetric"
+#define MM_HERM_STR		"hermitian"
+#define MM_SKEW_STR		"skew-symmetric"
+#define MM_PATTERN_STR  "pattern"
+
+
+/*  high level routines */
+
+int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
+		 float val[], MM_typecode matcode);
+int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
+		float val[], MM_typecode matcode);
+int mm_read_mtx_crd_entry(FILE *f, int *I, int *J, float *real, float *img,
+			MM_typecode matcode);
+
+int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
+                float **val_, int **I_, int **J_);
+
+
+}
+#endif
diff --git a/projects/resources/cuda/multi_gpu/b1.cu b/projects/resources/cuda/multi_gpu/b1.cu
new file mode 100644
index 00000000..ff910534
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b1.cu
@@ -0,0 +1,173 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b1.cuh"
+
+#define P 16
+
+//////////////////////////////
+//////////////////////////////
+
+__global__ void square_m(const float *x, float *y, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i] * x[i];  
+    }
+}
+
+__inline__ __device__ float warp_reduce_m(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+__global__ void reduce_m(const float *x, const float *y, float *z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] - y[i];
+    }
+    sum = warp_reduce_m(sum);                    // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum);                     // The first thread in the warp updates the output;
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark1M::alloc() {
+
+    S = (N + P - 1) / P;
+
+    x = (float**) malloc(sizeof(float*) * P);
+    y = (float**) malloc(sizeof(float*) * P);
+    x1 = (float**) malloc(sizeof(float*) * P);
+    y1 = (float**) malloc(sizeof(float*) * P);
+    res = (float**) malloc(sizeof(float*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&x[i], sizeof(float) * S);
+        err = cudaMallocManaged(&y[i], sizeof(float) * S);
+        err = cudaMallocManaged(&x1[i], sizeof(float) * S);
+        err = cudaMallocManaged(&y1[i], sizeof(float) * S);
+        err = cudaMallocManaged(&res[i], sizeof(float));
+    }
+    
+    // Create 2P streams;
+    s = (cudaStream_t *) malloc(sizeof(cudaStream_t) * 2 * P);
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        err = cudaStreamCreate(&s[i]);
+        err = cudaStreamCreate(&s[i + P]);
+    }
+}
+
+void Benchmark1M::init() {
+    for (int i = 0; i < P; i++) {
+        for (int j = 0; j < S; j++) {
+            int index = i * S + j;
+            if (index < N) {
+                x[i][j] = 1.0 / (index + 1);
+                y[i][j] = 2.0 / (index + 1);
+            }
+        }
+    }
+}
+
+void Benchmark1M::reset() {
+    for (int i = 0; i < P; i++) {
+        for (int j = 0; j < S; j++) {
+            int index = i * S + j;
+            if (index < N) {
+                x[i][j] = 1.0 / (index + 1);
+                y[i][j] = 2.0 / (index + 1);
+            }
+        }
+        res[i][0] = 0.0;
+    }
+    res_tot = 0.0;
+}
+
+void Benchmark1M::execute_sync(int iter) {
+    for (int i = 0; i < P; i++) {
+        if (do_prefetch && pascalGpu) {
+            cudaMemPrefetchAsync(x[i], sizeof(float) * S, 0, 0);
+            cudaMemPrefetchAsync(x1[i], sizeof(float) * S, 0, 0);
+            cudaMemPrefetchAsync(y[i], sizeof(float) * S, 0, 0);
+            cudaMemPrefetchAsync(y1[i], sizeof(float) * S, 0, 0);
+        }
+        square_m<<<num_blocks, block_size_1d>>>(x[i], x1[i], S);
+        err = cudaDeviceSynchronize();
+        square_m<<<num_blocks, block_size_1d>>>(y[i], y1[i], S);
+        err = cudaDeviceSynchronize();
+        reduce_m<<<num_blocks, block_size_1d>>>(x1[i], y1[i], res[i], S);
+        err = cudaDeviceSynchronize();
+    }
+    for (int i = 0; i < P; i++) {
+        res_tot += res[i][0];
+    }
+}
+
+void Benchmark1M::execute_async(int iter) {
+    for (int i = 0; i < P; i++) {
+        int gpu = select_gpu(i, max_devices);
+        cudaSetDevice(gpu);
+        if (!pascalGpu || stream_attach) {
+            cudaStreamAttachMemAsync(s[i], x[i], sizeof(float) * S);
+            cudaStreamAttachMemAsync(s[i], x1[i], sizeof(float) * S);
+            cudaStreamAttachMemAsync(s[i + P], y[i], sizeof(float) * S);
+            cudaStreamAttachMemAsync(s[i + P], y1[i], sizeof(float) * S);
+        }
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(x[i], sizeof(float) * S, gpu, s[i]);
+            cudaMemPrefetchAsync(x1[i], sizeof(float) * S, gpu, s[i]);
+            cudaMemPrefetchAsync(y[i], sizeof(float) * S, gpu, s[i + P]);
+            cudaMemPrefetchAsync(y1[i], sizeof(float) * S, gpu, s[i + P]);
+        }
+
+        square_m<<<num_blocks, block_size_1d, 0, s[i]>>>(x[i], x1[i], S);
+        square_m<<<num_blocks, block_size_1d, 0, s[i + P]>>>(y[i], y1[i], S);
+
+        // Stream 1 waits stream 2;
+        cudaEvent_t e1;
+        cudaEventCreate(&e1);
+        cudaEventRecord(e1, s[i + P]);
+        cudaStreamWaitEvent(s[i], e1, 0);
+
+        reduce_m<<<num_blocks, block_size_1d, 0, s[i]>>>(x1[i], y1[i], res[i], S);
+    }
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        cudaStreamSynchronize(s[i]);
+        res_tot += res[i][0];        
+    }
+}
+
+std::string Benchmark1M::print_result(bool short_form) {
+    return std::to_string(res_tot);
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b1.cuh b/projects/resources/cuda/multi_gpu/b1.cuh
new file mode 100644
index 00000000..94a91787
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b1.cuh
@@ -0,0 +1,50 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+// Implement B1 using multi-GPU. 
+// Partition the computation across D devices, compute D partial results and aggregate them on the CPU
+class Benchmark1M : public Benchmark {
+   public:
+    Benchmark1M(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int S;
+    float **x, **y, **x1, **y1, **res;
+    float res_tot = 0.0;
+    cudaStream_t *s;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b11.cu b/projects/resources/cuda/multi_gpu/b11.cu
new file mode 100644
index 00000000..293f37c3
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b11.cu
@@ -0,0 +1,225 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b11.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+#if PARTITION_Z_B11
+extern "C" __global__ void matrix_vector_mult_1(const float* x, const float* y, float* z, int n, int m) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[i] = sum;
+    }
+}
+
+extern "C" __global__ void copy(const float *x, float *y, int n, int offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i + offset] = x[i];
+    }
+}
+#else
+extern "C" __global__ void matrix_vector_mult_1(const float* x, const float* y, float* z, int n, int m, int z_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[z_offset + i] = sum;
+    }
+}
+#endif
+
+#define BLOCK_DIM 16
+extern "C" __global__ void matrix_vector_mult_2(const float* x, const float* y, float* z, int n, int m, int z_offset) {
+    int tile_size = BLOCK_DIM;
+
+    // In the simplest implementation, each block computes a vertical tile of the Z vector, 
+    // whose coordinates are given by blockIdx.x;
+    // Here, we allow each block to process more tiles, hence the loops below;
+    for(int z_tile_i = blockIdx.x; z_tile_i < (m + tile_size - 1) / tile_size; z_tile_i += gridDim.x) {
+        // Index of the tile element computed by this thread, with respect of the current tile;
+        int z_i = threadIdx.x;
+        int z_j = threadIdx.y;
+        // Coordinate of the Z matrix element computed by this specific thread, with respect to the overall Z matrix (not counting host-level data partitioning);
+        int i = z_tile_i * blockDim.x + threadIdx.x;
+        // Value of the Z vector block being computed by this specific thread;
+        float z_val_i = 0;
+        // Loop over the tiles in the same row of X of the desired output tile in Z;
+        for (int curr_tile_index = 0; curr_tile_index < (n + tile_size - 1) / tile_size; curr_tile_index++) {
+            // Shared memory used to store the current tiles of X and Y;
+            __shared__ float x_tile[BLOCK_DIM][BLOCK_DIM];
+            __shared__ float y_tile[BLOCK_DIM];
+            // Each thread in the block loads a value into the tile;
+            if ((i < n) && (curr_tile_index * tile_size + z_j < m)) {
+                x_tile[z_i][z_j] = x[m * i + curr_tile_index * tile_size + z_j];
+            } else {
+                x_tile[z_i][z_j] = 0;
+            }
+            if (curr_tile_index * tile_size + z_j < m) {
+                y_tile[z_j] = y[curr_tile_index * tile_size + z_j];
+            } else {
+                y_tile[z_j] = 0;
+            }
+            // Synchronize threads in the block, ensure the tile has been loaded;
+            __syncthreads();
+            // Multiply the i row of the tile with the vector tile;
+            for (int k = 0; k < tile_size; k++) {   
+                z_val_i += x_tile[z_i][k] * y_tile[k];
+            }
+
+            // Synchronize threads in the block, ensure the computation has finished before loading the next tile;
+            __syncthreads();
+        }
+        // Write the output value into Z, keeping into account the offset of the current tile;
+        if (z_offset + i < n) {
+            z[z_offset + i] = z_val_i;
+        }     
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark11M::alloc() {
+    M = N;
+    S = (N + P - 1) / P;
+    // x_cpu = (float *) malloc(sizeof(float) * N * M);
+    x = (float **) malloc(sizeof(float*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&x[i], sizeof(float) * S * M);
+    }
+    err = cudaMallocManaged(&y, sizeof(float) * M);
+#if PARTITION_Z_B11
+    z = (float **) malloc(sizeof(float*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&z[i], sizeof(float) * S);
+    }
+    cudaMallocManaged(&z_out, sizeof(float) * N);
+#else
+    err = cudaMallocManaged(&z, sizeof(float) * N);
+#endif
+
+    // Create P streams;
+    s = (cudaStream_t *) malloc(sizeof(cudaStream_t) * P);
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        err = cudaStreamCreate(&s[i]);
+    }
+}
+
+void Benchmark11M::init() {
+}
+
+void Benchmark11M::reset() {
+    for (int i = 0; i < M; i++) {
+        y[i] = float(i + 1) / M; 
+    }
+    for (int i = 0; i < P; i++) {
+        for (int j = 0; j < S * M; j++) {
+            x[i][j] = float(i * S * M + j) / (N * M);
+        }
+    }
+}
+
+void Benchmark11M::execute_sync(int iter) {
+    if (do_prefetch && pascalGpu) {
+        for (int p = 0; p < P; p++) {
+            cudaMemPrefetchAsync(x[p], sizeof(float) * S * M, 0, 0);
+            cudaDeviceSynchronize();
+        }
+        cudaMemPrefetchAsync(y, sizeof(float) * M, 0, 0);
+    }
+    cudaDeviceSynchronize();
+    for (int p = 0; p < P; p++) {
+#if PARTITION_Z_B11
+        matrix_vector_mult_1<<<num_blocks, block_size_1d>>>(x[p], y, z[p], std::min(S, N - p * S), M);
+#else
+        matrix_vector_mult_1<<<num_blocks, block_size_1d>>>(x[p], y, z, std::min(S, N - p * S), M, p * S);
+#endif
+        cudaDeviceSynchronize();
+    } 
+    // Copy data to the output vector;
+#if PARTITION_Z_B11
+    for (int p = 0; p < P; p++) {
+        copy<<<num_blocks, block_size_1d>>>(z[p], z_out, std::min(S, N - p * S), p * S);
+    }
+#else
+    z_out = z;
+#endif
+    cudaDeviceSynchronize();
+}
+
+void Benchmark11M::execute_async(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    for (int p = 0; p < P; p++) {
+        cudaSetDevice(select_gpu(p, max_devices));
+        if (!pascalGpu || stream_attach) {
+            cudaStreamAttachMemAsync(s[p], x[p], sizeof(float) * S * M);
+        }
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(x[p], sizeof(float) * S * M, select_gpu(p, max_devices), s[p]);
+        }
+#if PARTITION_Z_B11
+        matrix_vector_mult_1<<<num_blocks, block_size_1d, 0, s[p]>>>(x[p], y, z[p], std::min(S, N - p * S), M);
+#else
+        matrix_vector_mult_1<<<num_blocks, block_size_1d, 0, s[p]>>>(x[p], y, z, std::min(S, N - p * S), M, p * S);
+#endif
+        // matrix_vector_mult_2<<<grid_size, block_size_2d_dim, 0, s[p]>>>(x[p], y, z, std::min(S, N - p * S), M, p * S);
+    }
+    // Copy data to the output vector;
+#if PARTITION_Z_B11
+    for (int p = 0; p < P; p++) {
+        copy<<<num_blocks, block_size_1d, 0, s[p]>>>(z[p], z_out, std::min(S, N - p * S), p * S);
+    }
+#else
+    z_out = z;
+#endif
+
+    for (int p = 0; p < P; p++) {
+        err = cudaStreamSynchronize(s[p]);
+    }
+}
+
+std::string Benchmark11M::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(z_out[0]);
+    } else {
+        std::string res = "[";
+        for (int i = 0; i < std::min(100, N); i++) {
+            res += std::to_string(z_out[i]) + ", ";
+        }
+        return res + "...]";
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b11.cuh b/projects/resources/cuda/multi_gpu/b11.cuh
new file mode 100644
index 00000000..41e7fe4a
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b11.cuh
@@ -0,0 +1,63 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+#define PARTITION_Z_B11 true
+
+class Benchmark11M : public Benchmark {
+   public:
+    Benchmark11M(Options &options) : Benchmark(options) {
+        P = num_partitions;
+    }
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int M;
+    int S;
+    int P;
+
+    float **x;
+    float *y;
+#if PARTITION_Z_B11
+    float **z;
+#else
+    float *z;
+#endif
+    float *x_cpu;
+    float *z_out;
+
+    cudaStream_t *s;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b12.cu b/projects/resources/cuda/multi_gpu/b12.cu
new file mode 100644
index 00000000..7b2a2b73
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b12.cu
@@ -0,0 +1,675 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+#include <sstream>
+#include "b12.cuh"
+
+void read_matrix(const std::string file_path, i32 &N, i32 &M, i32 &NNZ, std::vector<i32> &x, std::vector<i32> &y,
+                 std::vector<f32> &val) {
+    std::ifstream input(file_path);
+    std::string cur_line;
+    int cur_x, cur_y;
+
+    while (std::getline(input, cur_line)) {
+        if (cur_line[0] != '%') {
+
+            std::istringstream iss(cur_line);
+            iss >> N >> M >> NNZ;
+            break;
+        }
+    }
+
+
+    while (std::getline(input, cur_line)) {
+        std::istringstream iss(cur_line);
+
+        iss >> cur_x >> cur_y;
+        cur_x--;
+        cur_y--;
+        x.push_back(cur_x);
+        y.push_back(cur_y);
+        val.push_back(1.0);
+    }
+
+}
+
+/**
+ * From https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
+ */
+#define CUDA_CHECK_ERROR(kernel_ret_code) { gpuAssert((kernel_ret_code), __FILE__, __LINE__); }
+
+inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = false) {
+    if (code != cudaSuccess) {
+        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
+        //if (abort) exit(code);
+    }
+}
+
+__global__ void subtract(float *v1, const float *v2, const float alpha, int N, int offset) {
+    int init = threadIdx.x + blockIdx.x * blockDim.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for (int i = init; i < N; i += stride) {
+        v1[i] -= alpha * v2[i + offset];
+    }
+}
+
+__global__ void
+copy_partition_to_vec(const float *vec_in, float *vec_out, const int N, const int offset_out, const int offset_in) {
+    int init = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+    for (int i = init; i < N; i += stride) {
+        vec_out[i + offset_out] = vec_in[i + offset_in];
+    }
+}
+
+__global__ void normalize(const float *d_v_in, const float denominator, float *d_v_out, int N) {
+    int init = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for (int i = init; i < N; i += stride) {
+        d_v_out[i] = d_v_in[i] * denominator;
+    }
+}
+
+__global__ void spmv(const int *x, const int *y, const float *val, const float *v_in, float *v_out, int num_nnz) {
+    int init = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    //printf("BlockID: %d, ThreadID: %d. Init = %d, Stride = %d\n", blockIdx.x, threadIdx.x, init, stride);
+
+    for (int i = init; i < num_nnz; i += stride) {
+        //printf("v_out[%d] += v_in[%d] * val[%d]\n", y[i], x[i], i);
+        v_out[y[i]] += v_in[x[i]] * val[i];
+    }
+}
+
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+// z = <x, x>;
+extern "C" __global__ void l2_norm_b12(const float *x, float *z, int N, int offset) {
+    int warp_size = 32;
+    float sum = 0;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        float x_tmp = x[i + offset];
+        sum += x_tmp * x_tmp;
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+
+__global__ void dot_product(const float *x, const float *y, float *z, int N, int offset) {
+    int warp_size = 32;
+    float sum = 0;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] * y[i + offset];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+
+__global__ void set(float *v_in, float value, const int N) {
+    int init = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for (u32 i = init; i < N; i += stride) {
+        v_in[i] = value;
+    }
+
+}
+
+__global__ void
+axpb_xtended(const float alpha, const float *x, const float *b, const float beta, const float *c, float *out,
+             const int N, const int offset_x, const int offset_c) {
+    int init = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+    for (int i = init; i < N; i += stride) {
+        out[i] = alpha * x[i + offset_x] + b[i] + beta * c[i + offset_c];
+    }
+}
+
+
+template <typename T>
+T accumulate(T *arr, const u32 size, const T init = T(0)){
+    T accumulator = init;
+    for(u32 i = 0; i < size; ++i){
+        accumulator += arr[i];
+    }
+
+    return accumulator;
+}
+
+void Benchmark12M::alloc_vectors() {
+    for (const auto &partition: this->coo_partitions) {
+        f32 *tmp_vec_in, *tmp_spmv_out, *tmp_intermediate_dot_product_values;
+        f32 *tmp_vec_next, *tmp_lanczos_vectors, *tmp_normalized_out;
+
+        CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_vec_in, sizeof(f32) * this->matrix.N));
+        CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_spmv_out, sizeof(f32) * partition->N));
+        CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_intermediate_dot_product_values, sizeof(f32) * 32));
+        CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_vec_next, sizeof(f32) * partition->N));
+        CUDA_CHECK_ERROR(
+                cudaMallocManaged(&tmp_lanczos_vectors, sizeof(f32) * this->num_eigencomponents * partition->N));
+        CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_normalized_out, sizeof(f32) * partition->N));
+
+        this->vec_in.push_back(tmp_vec_in);
+        this->spmv_vec_out.push_back(tmp_spmv_out);
+        this->intermediate_dot_product_values.push_back(tmp_intermediate_dot_product_values);
+        this->vec_next.push_back(tmp_vec_next);
+        this->lanczos_vectors.push_back(tmp_lanczos_vectors);
+        this->normalized_out.push_back(tmp_normalized_out);
+    }
+
+    CUDA_CHECK_ERROR(cudaMallocManaged(&alpha_intermediate, sizeof(f32) * this->num_partitions));
+    CUDA_CHECK_ERROR(cudaMallocManaged(&beta_intermediate, sizeof(f32) * this->num_partitions));
+}
+
+void Benchmark12M::alloc_coo_partitions() {
+
+    const u32 nnz_per_partition = u32((this->matrix.nnz + this->num_partitions) / this->num_partitions);
+    u32 from_index = 0;
+    u32 to_index = nnz_per_partition;
+    u32 index_value = this->matrix.y[to_index];
+
+    for (u32 i = 0; i < this->num_partitions - 1; ++i) {
+        while (index_value == this->matrix.y[to_index]) {
+            to_index++;
+        }
+        const u32 offset = (from_index == 0) ? from_index : (this->matrix.y[from_index] - 1);
+        auto coo_partition = (this->assign_partition(from_index, to_index, offset));
+        this->coo_partitions.push_back(coo_partition);
+
+        from_index = to_index;
+        to_index += nnz_per_partition;
+        index_value = this->matrix.y[to_index];
+    }
+    const u32 offset = this->matrix.y[from_index];
+    auto coo_partition = (this->assign_partition(from_index, this->matrix.nnz, offset));
+    this->coo_partitions.push_back(coo_partition);
+}
+
+coo_matrix_t *Benchmark12M::assign_partition(u32 from_index, u32 to_index, u32 offset) {
+    i32 *tmp_x, *tmp_y;
+    f32 *tmp_val;
+    coo_matrix_t *coo_partition;
+    cudaMallocManaged(&coo_partition, sizeof(coo_matrix_t));
+    coo_partition->begin = from_index;
+    coo_partition->end = to_index;
+    CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_x, sizeof(u32) * (to_index - from_index)));
+    CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_y, sizeof(u32) * (to_index - from_index)));
+    CUDA_CHECK_ERROR(cudaMallocManaged(&tmp_val, sizeof(f32) * (to_index - from_index)));
+
+    coo_partition->x = tmp_x;
+    coo_partition->y = tmp_y;
+    coo_partition->val = tmp_val;
+
+    u32 j = 0;
+    for (u32 i = from_index; i < to_index; ++i, ++j) {
+        coo_partition->x[j] = this->matrix.x[i];
+        coo_partition->y[j] = this->matrix.y[i] - offset;
+        coo_partition->val[j] = this->matrix.val[i];
+    }
+
+    coo_partition->N = coo_partition->y[to_index - from_index - 1] + 1;
+    coo_partition->nnz = to_index - from_index;
+    return coo_partition;
+}
+
+void Benchmark12M::create_random_matrix(bool normalize = true) {
+    u32 total_nnz = RANDOM_MATRIX_AVG_NNZ_PER_ROW * RANDOM_MATRIX_NUM_ROWS;
+    i32 *x = (i32 *) std::malloc(total_nnz * sizeof(i32));
+    i32 *y = (i32 *) std::malloc(total_nnz * sizeof(i32));
+    f32 *val = (f32 *) std::malloc(total_nnz * sizeof(f32));
+
+    for (u32 i = 0; i < total_nnz; ++i)
+        val[i] = (f32) std::rand() / RAND_MAX;
+
+    f32 acc = 0.0;
+    for (u32 i = 0; i < total_nnz; ++i)
+        acc += val[i] * val[i];
+
+    f32 norm = std::sqrt(acc);
+
+    for (u32 i = 0; i < total_nnz; ++i)
+        val[i] /= norm;
+
+    auto random_node = [&]() {
+        return std::rand() % RANDOM_MATRIX_NUM_ROWS;
+    };
+
+
+    std::generate(x, x + total_nnz, random_node);
+    std::generate(y, y + total_nnz, random_node);
+
+    std::sort(y, y + total_nnz);
+
+    this->matrix.x = x;
+    this->matrix.y = y;
+    this->matrix.val = val;
+    this->matrix.begin = 0;
+    this->matrix.end = total_nnz;
+    this->matrix.N = RANDOM_MATRIX_NUM_ROWS;
+    this->matrix.nnz = total_nnz;
+
+}
+
+void Benchmark12M::load_matrix(bool normalize = false) {
+
+    i32 M, N, nnz;
+    std::vector<i32> x, y;
+    std::vector<f32> val;
+
+    i32 *x_mem, *y_mem;
+    f32 *val_mem;
+
+
+    read_matrix(this->matrix_path, N, M, nnz, x, y, val);
+
+    x_mem = (i32 *) std::malloc(nnz * sizeof(i32));
+    y_mem = (i32 *) std::malloc(nnz * sizeof(i32));
+    val_mem = (f32 *) std::malloc(nnz * sizeof(f32));
+
+    std::memcpy(x_mem, x.data(), nnz * sizeof(i32));
+    std::memcpy(y_mem, y.data(), nnz * sizeof(i32));
+    std::memcpy(val_mem, val.data(), nnz * sizeof(f32));
+
+
+    this->matrix = {x_mem, y_mem, val_mem, 0, nnz, N, nnz};
+
+    if (normalize) {
+
+        f32 norm = std::sqrt(std::accumulate(this->matrix.val, this->matrix.val + nnz, 0.0f,
+                                             [](f32 cur, f32 next) { return cur + next * next; }));
+        for (u32 i = 0; i < nnz; ++i)
+            this->matrix.val[i] = 1.0f / norm;
+
+    }
+
+}
+
+void Benchmark12M::alloc() {
+
+    cudaMallocManaged(&(this->alpha), sizeof(f32));
+    cudaMallocManaged(&(this->beta), sizeof(f32));
+
+    if (this->matrix_path.empty())
+        this->create_random_matrix();
+    else
+        this->load_matrix();
+
+
+    this->create_streams();
+    this->alloc_coo_partitions();
+    this->alloc_vectors();
+
+    // Create offsets
+    this->offsets.push_back(0);
+    for (u32 i = 1; i < this->num_partitions; ++i)
+        this->offsets.push_back(this->coo_partitions[i]->N - this->offsets[i - 1]);
+
+}
+
+void Benchmark12M::reset() {
+    // std::cout << "Called reset" << std::endl;
+    // Just call init, it resets all the necessary vectors;
+    this->init();
+
+    for (u32 i = 0; i < this->num_partitions; ++i) {
+        const auto *partition = this->coo_partitions[i];
+        for (u32 j = 0; j < partition->N; ++j) {
+            this->spmv_vec_out[i][j] = 0.0f;
+        }
+    }
+
+    // std::cout << "reset end" << std::endl;
+
+    this->tridiagonal_matrix.clear();
+
+}
+
+void Benchmark12M::sync_all() {
+
+    for (u32 i = 0; i < this->num_partitions; ++i) {
+        auto selected_device = select_gpu(i, this->num_devices);
+        CUDA_CHECK_ERROR(cudaSetDevice(selected_device));
+        CUDA_CHECK_ERROR(cudaStreamSynchronize(this->streams[i]));
+    }
+    cudaSetDevice(0);
+}
+
+void Benchmark12M::create_streams() {
+
+    for (u32 i = 0; i < this->num_partitions; ++i) {
+        auto selected_device = select_gpu(i, this->num_devices);
+        CUDA_CHECK_ERROR(cudaSetDevice(selected_device));
+        auto *stream = (cudaStream_t *) std::malloc(sizeof(cudaStream_t));
+        CUDA_CHECK_ERROR(cudaStreamCreate(stream));
+        this->streams.push_back(*stream);
+    }
+
+    cudaSetDevice(0);
+
+}
+
+template<typename Function>
+void Benchmark12M::launch_multi_kernel(Function kernel_launch_function) {
+
+    for (u32 i = 0; i < this->num_partitions; ++i) {
+        auto selected_device = select_gpu(i, this->num_devices);
+        CUDA_CHECK_ERROR(cudaSetDevice(selected_device));
+        const cudaStream_t &stream = streams[i];
+        kernel_launch_function(i, stream);
+
+        if (policy == Policy::Sync)
+            this->sync_all();
+    }
+
+}
+
+void Benchmark12M::execute(i32 iter) {
+
+    if (this->debug) {
+        std::cout << "[LANCZOS] Iteration " << iter << std::endl;
+    }
+
+
+    this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+        spmv<<<this->num_blocks, this->block_size, 0, stream>>>(
+                this->coo_partitions[p_idx]->x,
+                this->coo_partitions[p_idx]->y,
+                this->coo_partitions[p_idx]->val,
+                this->vec_in[p_idx],
+                this->spmv_vec_out[p_idx],
+                this->coo_partitions[p_idx]->nnz
+        );
+    });
+
+    this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+        // std::cout << "dot_product 0th" << std::endl;
+        this->alpha_intermediate[p_idx] = 0.0;
+        dot_product<<<this->num_blocks, this->block_size, 0, stream>>>(
+                this->spmv_vec_out[p_idx],
+                this->vec_in[p_idx],
+                &this->alpha_intermediate[p_idx],
+                this->coo_partitions[p_idx]->N,
+                this->offsets[p_idx]
+        );
+
+    });
+
+    this->sync_all();
+    *alpha = accumulate(this->alpha_intermediate, this->num_partitions);
+    this->tridiagonal_matrix.push_back(*alpha);
+
+    // std::cout << alpha << std::endl;
+    this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+        // std::cout << "axpb_xtedned 0th" << std::endl;
+        axpb_xtended<<<this->num_blocks, this->block_size, 0, stream>>>(
+                -(*alpha),
+                this->vec_in[p_idx],
+                this->spmv_vec_out[p_idx],
+                0,
+                this->vec_in[p_idx],
+                this->vec_next[p_idx],
+                this->coo_partitions[p_idx]->N,
+                this->offsets[p_idx],
+                0
+        );
+    });
+
+    for (u32 i = 1; i < this->num_eigencomponents; ++i) {
+
+        this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+            // std::cout << "l2_norm" << std::endl;
+            this->beta_intermediate[p_idx] = 0.0f;
+            l2_norm_b12<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    this->vec_next[p_idx],
+                    &this->beta_intermediate[p_idx],
+                    this->coo_partitions[p_idx]->N,
+                    0
+            );
+
+        });
+
+
+        this->sync_all();
+        *beta = accumulate(this->beta_intermediate, this->num_partitions);
+        *beta = std::sqrt(*beta);
+        this->tridiagonal_matrix.push_back(*beta);
+
+        this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+            // std::cout << "normalize" << std::endl;
+            normalize<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    this->vec_next[p_idx],
+                    1.0f / (*beta),
+                    this->normalized_out[p_idx],
+                    this->coo_partitions[p_idx]->N
+            );
+        });
+
+        this->launch_multi_kernel([&, i](u32 p_idx, const cudaStream_t &stream) {
+            // std::cout << "copy_partition_to_vec" << std::endl;
+            copy_partition_to_vec<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    this->vec_in[p_idx],
+                    this->lanczos_vectors[p_idx],
+                    this->offsets[p_idx],
+                    this->coo_partitions[p_idx]->N * (i - 1),
+                    this->offsets[p_idx]
+            );
+        });
+
+        for (u32 j = 0; j < this->num_partitions; ++j) {
+            this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+                copy_partition_to_vec<<<this->num_blocks, this->block_size, 0, stream>>>(
+                        this->normalized_out[p_idx],
+                        this->vec_in[p_idx],
+                        this->coo_partitions[p_idx]->N,
+                        offsets[p_idx],
+                        0
+                );
+            });
+
+            auto first = this->vec_in.front();
+            this->vec_in.erase(this->vec_in.begin());
+            this->vec_in.push_back(first);
+        }
+
+        this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream){
+            set<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    this->spmv_vec_out[p_idx],
+                    0.0f,
+                    this->coo_partitions[p_idx]->N
+            );
+        });
+
+        this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+            // std::cout << "spmv" << std::endl;
+            spmv<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    this->coo_partitions[p_idx]->x,
+                    this->coo_partitions[p_idx]->y,
+                    this->coo_partitions[p_idx]->val,
+                    this->vec_in[p_idx],
+                    this->spmv_vec_out[p_idx],
+                    this->coo_partitions[p_idx]->nnz
+            );
+        });
+
+        this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+            // std::cout << "dot_product" << std::endl;
+            this->alpha_intermediate[p_idx] = 0.0f;
+            dot_product<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    this->spmv_vec_out[p_idx],
+                    this->vec_in[p_idx],
+                    &this->alpha_intermediate[p_idx],
+                    this->coo_partitions[p_idx]->N,
+                    this->offsets[p_idx]
+            );
+
+        });
+
+        this->sync_all();
+        *alpha = accumulate(this->alpha_intermediate, this->num_partitions);
+        tridiagonal_matrix.push_back(*alpha);
+
+        this->launch_multi_kernel([&, i](u32 p_idx, const cudaStream_t &stream) {
+            // std::cout << "axpb_xtended" << std::endl;
+            axpb_xtended<<<this->num_blocks, this->block_size, 0, stream>>>(
+                    -(*alpha),
+                    this->vec_in[p_idx], // Right
+                    this->spmv_vec_out[p_idx], // Right
+                    -(*beta),
+                    this->lanczos_vectors[p_idx],
+                    this->vec_next[p_idx], // Right
+                    this->coo_partitions[p_idx]->N, // Right
+                    this->offsets[p_idx], // Right
+                    this->coo_partitions[p_idx]->N * (i - 1)
+            );
+        });
+
+
+        if (this->reorthogonalize) {
+
+            for (u32 j = 0; j < i; ++j) {
+                this->launch_multi_kernel([&, j](u32 p_idx, const cudaStream_t &stream) {
+
+                    dot_product<<<this->num_blocks, this->block_size, 0, stream>>>(
+                            this->vec_next[p_idx],
+                            this->lanczos_vectors[p_idx],
+                            this->alpha,
+                            this->coo_partitions[p_idx]->N,
+                            this->offsets[p_idx] * j
+                    );
+                });
+                this->sync_all();
+
+                this->launch_multi_kernel([&](u32 p_idx, const cudaStream_t &stream) {
+                    subtract<<<this->num_blocks, this->block_size, 0, stream>>>(
+                            this->vec_next[p_idx],
+                            this->lanczos_vectors[p_idx],
+                            *alpha,
+                            this->coo_partitions[p_idx]->N,
+                            this->coo_partitions[p_idx]->N
+                    );
+                });
+
+            }
+
+        }
+
+        this->sync_all();
+
+    }
+
+    this->sync_all();
+
+}
+
+void Benchmark12M::execute_sync(i32 iter) {
+    assert(this->policy == Policy::Sync);
+    this->execute(iter);
+}
+
+void Benchmark12M::execute_async(int iter) {
+    assert(this->policy == Policy::Async);
+
+    for (u32 i = 0; i < this->num_partitions; ++i)
+        assert(this->streams[i] != nullptr);
+
+    this->execute(iter);
+}
+
+std::string Benchmark12M::print_result(bool short_form = false) {
+
+    std::string base = "";
+    if(this->debug){
+       for (u32 i = 0; i < this->num_eigencomponents * 2 - 1; ++i) {
+           const auto &r = this->tridiagonal_matrix[i];
+           base += std::to_string(i) + " -> " + std::to_string(r) + ", ";
+       }
+
+       base += "\n";
+   }
+
+    return base;
+}
+
+void Benchmark12M::init() {
+    // Initialize vec_in[0]
+    std::generate(this->vec_in[0], this->vec_in[0] + this->matrix.N, [this]() { return 1.0f / this->matrix.N; });
+
+    // copy it to the other vectors
+    for (u32 i = 1; i < this->num_partitions; ++i) {
+        cudaMemcpy(this->vec_in[i], this->vec_in[0], this->matrix.N * sizeof(f32), cudaMemcpyHostToHost);
+    }
+
+    // Initialize the other vectors that get
+    // both read and written in a single computation
+    for (u32 i = 0; i < this->num_partitions; ++i) {
+        const auto &partition = this->coo_partitions[i];
+
+        for (u32 j = 0; j < partition->N; ++j) {
+            this->spmv_vec_out[i][j] = 0.0f;
+            this->vec_next[i][j] = 0.0f;
+            this->normalized_out[i][j] = 0.0f;
+        }
+
+
+    }
+
+    *alpha = 0.0f;
+    *beta = 0.0f;
+
+}
+
+void Benchmark12M::execute_cudagraph(int iter) {
+    throw new std::runtime_error("Benchmark12M::execute_cudagraph not implemented");
+}
+
+void Benchmark12M::execute_cudagraph_manual(int iter) {
+    throw new std::runtime_error("Benchmark12M::execute_cudagraph_manual not implemented");
+}
+
+void Benchmark12M::execute_cudagraph_single(int iter) {
+    throw new std::runtime_error("Benchmark12M::execute_cudagraph_single not implemented");
+}
+
+std::ostream &operator<<(std::ostream &os, const coo_matrix_t &matrix) {
+    os << "x: " << matrix.x << " y: " << matrix.y << " val: " << matrix.val << " begin: " << matrix.begin << " end: "
+       << matrix.end << " N: " << matrix.N << " nnz: " << matrix.nnz;
+    return os;
+}
diff --git a/projects/resources/cuda/multi_gpu/b12.cuh b/projects/resources/cuda/multi_gpu/b12.cuh
new file mode 100644
index 00000000..03de6713
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b12.cuh
@@ -0,0 +1,131 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include <stdexcept>
+#include <vector>
+#include <cstdlib>
+#include <numeric>
+#include <algorithm>
+#include <functional>
+#include <cassert>
+#include <thread>
+#include <ostream>
+#include <fstream>
+#include <cstring>
+#include <sstream>
+
+#include "../benchmark.cuh"
+#include "../mmio.hpp"
+
+using f32 = float;
+using u32 = unsigned;
+using f64 = double;
+using u64 = long;
+using i32 = int;
+
+struct coo_matrix_t {
+    friend std::ostream &operator<<(std::ostream &os, const coo_matrix_t &matrix);
+
+    i32 *x;
+    i32 *y;
+    float *val;
+    i32 begin;
+    i32 end;
+    i32 N;
+    i32 nnz;
+
+};
+
+
+
+#define DOT_PRODUCT_NUM_BLOCKS 32
+#define DOT_PRODUCT_THREADS_PER_BLOCK 64
+
+class Benchmark12M : public Benchmark {
+public:
+    Benchmark12M(Options &options) : Benchmark(options) {
+        // This test does not run on pascal gpus due to how Managed memory is handled
+
+        this->block_size = this->block_size_1d * this->block_size_2d;
+        this->num_partitions = options.max_devices;
+
+
+        cudaGetDeviceCount(&this->num_devices);
+        //this->num_devices = std::min(this->num_devices, this->num_partitions);
+        assert(this->num_devices > 0);
+
+
+    }
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(i32);
+    void execute_async(i32);
+    void execute_cudagraph(i32);
+    void execute_cudagraph_manual(i32);
+    void execute_cudagraph_single(i32);
+    void load_matrix(bool);
+    std::string print_result(bool);
+
+
+private:
+
+    unsigned num_eigencomponents = 8;
+    i32 num_partitions = -1;
+    i32 num_devices = -1;
+    std::string matrix_path = "../datasets/333SP.mtx";
+    bool reorthogonalize = false;
+    i32 block_size;
+    coo_matrix_t matrix;
+    std::vector<coo_matrix_t*> coo_partitions;
+    f32 *alpha, *beta;
+    std::vector<float*> vec_in, spmv_vec_out, intermediate_dot_product_values,  vec_next, lanczos_vectors, normalized_out;
+    float *alpha_intermediate, *beta_intermediate;
+    std::vector<cudaStream_t> streams;
+    std::vector<i32> offsets;
+
+    std::vector<f32> tridiagonal_matrix;
+
+    void alloc_coo_partitions();
+    void alloc_vectors();
+    void create_random_matrix(bool);
+    void execute(i32);
+    void sync_all();
+    void create_streams();
+    coo_matrix_t *assign_partition(unsigned, unsigned, unsigned);
+
+    template<typename Function>
+    void launch_multi_kernel(Function);
+
+    static constexpr u32 RANDOM_MATRIX_NUM_ROWS = 1000000;
+    static constexpr u32 RANDOM_MATRIX_AVG_NNZ_PER_ROW = 100;
+
+};
+
diff --git a/projects/resources/cuda/multi_gpu/b13.cu b/projects/resources/cuda/multi_gpu/b13.cu
new file mode 100644
index 00000000..66d7b22f
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b13.cu
@@ -0,0 +1,348 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b13.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+#if PARTITION_Z
+// Assume that z is partitioned in blocks;
+extern "C" __global__ void matrix_matrix_mult_1(const float* x, const float* y, float* z, int x_num_rows, int x_num_cols, int y_num_cols, int z_num_cols) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < x_num_rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < y_num_cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            for (int k = 0; k < x_num_cols; k++) {                
+                sum += x[i * x_num_cols + k] * y[k * x_num_cols + j];
+            }
+            z[i * z_num_cols + j] = sum;
+        }
+    }
+}
+#else
+// Use a single array for z, but still access it as if it were divided in
+extern "C" __global__ void matrix_matrix_mult_1(const float* x, const float* y, float* z, int x_num_rows, int x_num_cols, int y_num_cols, int x_offset, int y_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < x_num_rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < y_num_cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            for (int k = 0; k < x_num_cols; k++) {  
+                // Y is partioned in vertical bands, and accessed column-major.
+                // Here, it is still presented as a horizontal band;             
+                sum += x[i * x_num_cols + k] * y[j * x_num_cols  + k];
+            }
+            z[(x_offset + i) * x_num_cols + (y_offset + j)] = sum;
+        }
+    }
+}
+
+#define BLOCK_DIM 14
+// Better implementation, using shared memory to compute square tiles of z;
+extern "C" __global__ void matrix_matrix_mult_2(const float* x, const float* y, float* z, int x_num_rows, int x_num_cols, int y_num_cols, int z_num_cols, int x_offset, int y_offset) {
+
+    // int tile_size = BLOCK_DIM;
+    int tile_size = blockDim.x;
+
+    // In the simplest implementation, each block computes a tile of the Z matrix, 
+    // whose coordinates are given by blockIdx.x and blockIdx.y;
+    // Here, we allow each block to process more tiles, hence the loops below;
+    for(int z_block_i = blockIdx.x; z_block_i < (x_num_rows + tile_size - 1) / tile_size; z_block_i += gridDim.x) {
+        for(int z_block_j = blockIdx.y; z_block_j < (y_num_cols + tile_size - 1) / tile_size; z_block_j += gridDim.y) {
+            // Coordinate of the Z matrix element computed by this specific thread, with respect to the current tile;
+            int z_i = threadIdx.x;
+            int z_j = threadIdx.y;
+            // Coordinate of the Z matrix element computed by this specific thread, with respect to the overall Z matrix (not counting host-level data partitioning);
+            int i = z_block_i * blockDim.x + threadIdx.x;
+            int j = z_block_j * blockDim.y + threadIdx.y;
+
+            // Value of the Z matrix block being computed by this specific thread;
+            float z_val_ij = 0;
+
+            // Loop over the tiles in the same row (for X) and column (for Y) of the desired output tile in Z;
+            for (int curr_block_index = 0; curr_block_index < (x_num_cols + tile_size - 1) / tile_size; curr_block_index++) {
+                // Shared memory used to store the current tiles of X and Y;
+                extern __shared__ float tiles[];
+                float *x_tile = tiles;
+                float *y_tile = tiles + tile_size * tile_size;
+                // __shared__ float x_tile[BLOCK_DIM][BLOCK_DIM];
+                // __shared__ float y_tile[BLOCK_DIM][BLOCK_DIM];
+                // Each thread in the block loads a value into the tile;
+                if ((i < x_num_rows) && (curr_block_index * tile_size + z_j < x_num_cols)) {
+                    x_tile[z_i * tile_size + z_j] = x[x_num_cols * i + curr_block_index * tile_size + z_j];
+                    // x_tile[z_i][z_j] = x[x_num_cols * i + curr_block_index * tile_size + z_j];
+                } else {
+                    x_tile[z_i * tile_size + z_j] = 0;
+                    // x_tile[z_i][z_j] = 0;
+                }
+                if ((j < y_num_cols) && (curr_block_index * tile_size + z_i < x_num_cols)) {
+                    y_tile[z_i * tile_size + z_j] = y[x_num_cols * j + curr_block_index * tile_size + z_i];
+                    // y_tile[z_i][z_j] = y[x_num_cols * j + curr_block_index * tile_size + z_i];
+                } else {
+                    y_tile[z_i * tile_size + z_j] = 0;
+                    // y_tile[z_i][z_j] = 0;
+                }
+                // Synchronize threads in the block, ensure the tile has been loaded;
+                __syncthreads();
+
+                // Multiply the i row and j column of the tile;
+                for (int k = 0; k < tile_size; k++) {   
+                    z_val_ij += x_tile[z_i * tile_size + k] * y_tile[k * tile_size + z_j];
+                    // z_val_ij += x_tile[z_i][k] * y_tile[k][z_j];
+                }
+
+                // Synchronize threads in the block, ensure the computation has finished before loading the next tile;
+                __syncthreads();
+            }
+            // Write the output value into Z, keeping into account the offset of the current block;
+            if (((x_offset + i) < x_num_cols) & (y_offset + j < z_num_cols)) {
+                z[(x_offset + i) * x_num_cols + (y_offset + j)] = z_val_ij;
+            } 
+        }
+    }
+}
+#endif
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark13M::alloc() {
+    S = (N + P - 1) / P;
+    PZ = P * P;
+    // X is partitioned by rows (horizontal bands), Y is partitioned by columns (vertical bands).
+    // Z is partitioned in square blocks.
+    // Data are copied into the GPU as row-major for X, and column-major for Y (i.e. Y GPU contains the transpose of Y CPU);
+    x_cpu = (float *) malloc(sizeof(float) * N * N);
+    y_cpu = (float *) malloc(sizeof(float) * N * N);
+    x = (float **) malloc(sizeof(float*) * P);
+    y = (float **) malloc(sizeof(float*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&x[i], sizeof(float) * S * N);
+        err = cudaMallocManaged(&y[i], sizeof(float) * S * N);
+    }
+#if PARTITION_Z
+    z = (float **) malloc(sizeof(float*) * PZ);
+    for (int i = 0; i < PZ; i++) {
+        err = cudaMallocManaged(&z[i], sizeof(float) * S * S);
+    }
+#else
+    err = cudaMallocManaged(&z, sizeof(float) * N * N);
+#endif
+    // Create P * P streams;
+#if P2_STREAMS
+    s = (cudaStream_t *) malloc(sizeof(cudaStream_t) * P * P);
+    for (int p1 = 0; p1 < P; p1++) {
+        for (int p2 = 0; p2 < P; p2++) {
+            int p = p1 * P + p2;
+            cudaSetDevice(select_gpu(p, max_devices));
+            err = cudaStreamCreate(&s[p]);
+        }
+    }    
+#else
+    s = (cudaStream_t *) malloc(sizeof(cudaStream_t) * P);
+    for (int p1 = 0; p1 < P; p1++) {
+        cudaSetDevice(select_gpu(p1, max_devices));
+        err = cudaStreamCreate(&s[p1]);
+    } 
+#endif
+
+#if CPU_VALIDATION
+    z_cpu = (float*) malloc(sizeof(float) * N * N);
+#endif
+}
+
+void Benchmark13M::init() {
+    // X and Y contains the same data
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < N; j++) {
+            x_cpu[i * N + j] = float(i * N + j) / (N * N);
+            y_cpu[i * N + j] = float(i * N + j) / (N * N);
+        }
+    }
+}
+
+void Benchmark13M::reset() {
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < N; j++) {
+            int p = i / S; 
+            int s = (i * N + j) % (N * S);
+            x[p][s] = x_cpu[i * N + j];
+            y[p][s] = y_cpu[j * N + i]; // Y is transposed, so the GPU matrix is column-major;
+            z[i * N + j] = 0;
+        }
+    }
+    // for (int i = 0; i < N; i++) {
+    //     for (int j = 0; j < N; j++) {
+    //         std::cout << "xcpu[" << i << "][" << j << "]=" << x_cpu[i * N + j] << ", ";
+    //     }
+    //     std::cout << std::endl;
+    // }
+    // for (int i = 0; i < N; i++) {
+    //     for (int j = 0; j < N; j++) {
+    //         std::cout << "ycpu[" << i << "][" << j << "]=" << y_cpu[i * N + j] << ", ";
+    //     }
+    //     std::cout << std::endl;
+    // }
+    // for (int i = 0; i < P; i++) {
+    //     for (int j = 0; j < S * N; j++) {
+    //         std::cout << "x[" << i << "][" << j << "]=" << x[i][j] << ", ";
+    //     }
+    //     std::cout << std::endl;
+    // }
+    // for (int i = 0; i < P; i++) {
+    //     for (int j = 0; j < S * N; j++) {
+    //         std::cout << "y[" << i << "][" << j << "]=" << y[i][j] << ", ";
+    //     }
+    //     std::cout << std::endl;
+    // }
+}
+
+void Benchmark13M::execute_sync(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    if (do_prefetch && pascalGpu) {
+        for (int p1 = 0; p1 < P; p1++) {
+            for (int p2 = 0; p2 < P; p2++) {
+                // int p = p1 * P + p2;
+                // cudaMemPrefetchAsync(z[p], sizeof(float) * S * S, 0, 0);
+                // Redundant prefetching in the sync implementation, but possibly necessary in multi-GPU;
+                cudaMemPrefetchAsync(x[p1], sizeof(float) * S * N, 0, 0);
+                cudaMemPrefetchAsync(y[p2], sizeof(float) * S * N, 0, 0);
+                cudaDeviceSynchronize();
+            }
+        }
+    }
+    cudaDeviceSynchronize();
+    for (int p1 = 0; p1 < P; p1++) {
+        for (int p2 = 0; p2 < P; p2++) {
+#if PARTITION_Z
+            matrix_matrix_mult_1<<<grid_size, block_size_2d_dim>>>(x[p1], y[p2], z[p1 * P + p2], std::min(S, N - p1 * S), N, std::min(S, N - p2 * S), S);
+#else
+            // matrix_matrix_mult_1<<<grid_size, block_size_2d_dim>>>(x[p1], y[p2], z, std::min(S, N - p1 * S), N, std::min(S, N - p2 * S), p1 * S, p2 * S);
+            matrix_matrix_mult_2<<<grid_size, block_size_2d_dim, 2 * block_size_2d * block_size_2d * sizeof(float)>>>(x[p1], y[p2], z, std::min(S, N - p1 * S), N, std::min(S, N - p2 * S), N, p1 * S, p2 * S);
+#endif
+            cudaDeviceSynchronize();
+        }
+    } 
+}
+
+void Benchmark13M::execute_async(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+
+    for (int p1 = 0; p1 < P; p1++) {
+#if !P2_STREAMS
+        if (pascalGpu && do_prefetch) {  
+            cudaSetDevice(select_gpu(p1, max_devices));
+            cudaMemPrefetchAsync(x[p1], sizeof(float) * S * N, select_gpu(p1, max_devices), s[p1]);
+            if (p1 == 0) cudaMemPrefetchAsync(z, sizeof(float) * N * N, select_gpu(p1, max_devices), s[p1]);
+        }   
+#endif
+        for (int p2 = 0; p2 < P; p2++) {
+#if P2_STREAMS
+            int p = p1 * P + p2;  
+            cudaSetDevice(select_gpu(p, max_devices));
+    #if PARTITION_Z
+            matrix_matrix_mult_1<<<grid_size, block_size_2d_dim, 0, s[p]>>>(x[p1], y[p2], z[p], std::min(S, N - p1 * S), N, std::min(S, N - p2 * S), S);
+    #else
+            matrix_matrix_mult_2<<<grid_size, block_size_2d_dim, 2 * block_size_2d * block_size_2d * sizeof(float), s[p]>>>(x[p1], y[p2], z, std::min(S, N - p1 * S), N, std::min(S, N - p2 * S), N, p1 * S, p2 * S);
+    #endif
+#else
+            if (pascalGpu && do_prefetch && (p1 == 0)) {  
+                cudaSetDevice(select_gpu(p1, max_devices));
+                cudaMemPrefetchAsync(y[p2], sizeof(float) * S * N, select_gpu(p1, max_devices), s[p2]);
+            } 
+            cudaSetDevice(select_gpu(p1, max_devices));
+            matrix_matrix_mult_2<<<grid_size, block_size_2d_dim, 2 * block_size_2d * block_size_2d * sizeof(float), s[p1]>>>(x[p1], y[p2], z, std::min(S, N - p1 * S), N, std::min(S, N - p2 * S), N, p1 * S, p2 * S);
+#endif
+        }
+    }
+    // Synchronization;
+    for (int p1 = 0; p1 < P; p1++) {
+#if P2_STREAMS
+        for (int p2 = 0; p2 < P; p2++) {
+            int p = p1 * P + p2;  
+            err = cudaStreamSynchronize(s[p]);
+        }
+#else
+        err = cudaStreamSynchronize(s[p1]);
+#endif
+    }
+}
+
+std::string Benchmark13M::print_result(bool short_form) {
+    if (short_form) {
+#if PARTITION_Z
+        return std::to_string(z[0][0]);
+#else
+        return std::to_string(z[0]);
+#endif
+    } else {
+        int old_precision = std::cout.precision();
+		std::cout.precision(2);
+        std::string res = "[\n";
+        for (int i = 0; i < std::min(30, N); i++) {
+            res += "[";
+            for (int j = 0; j < std::min(30, N); j++) {
+#if PARTITION_Z
+                int p1 = i / S; 
+                int p2 = j / S; 
+                res += std::to_string(z[p1 * P + p2][(i % S) * S + j % S]) + ", ";
+#else
+                res += std::to_string(z[i * N + j]) + ", ";
+#endif
+            }
+            res += "...]\n";
+        }
+        std::cout.precision(old_precision);
+        return res + "...]";
+    }
+}
+
+void Benchmark13M::cpu_validation(int iter) {
+#if CPU_VALIDATION
+    if (iter > 0) return;
+    int max_errors = 20;
+    int errors = 0;
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < N; j++) {
+            float res = 0;
+            for (int k = 0; k < N; k++) {
+                res += x_cpu[i * N + k] * y_cpu[k * N + j];
+            }
+            z_cpu[i * N + j] = res;
+#if !PARTITION_Z
+            if (std::abs(z[i * N + j] - z_cpu[i * N + j]) > 1e-3) {
+                if (errors < max_errors) std::cout << "error, z[" << i << "][" << j << "]=" << z[i * N + j] << ", cpu=" << z_cpu[i * N + j] << std::endl;
+                errors += 1;
+            }
+#endif
+        }
+    }
+    std::cout << "errors=" << errors << std::endl;
+#endif
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b13.cuh b/projects/resources/cuda/multi_gpu/b13.cuh
new file mode 100644
index 00000000..da516573
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b13.cuh
@@ -0,0 +1,65 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+#define PARTITION_Z false
+#define P2_STREAMS false
+
+class Benchmark13M : public Benchmark {
+   public:
+    Benchmark13M(Options &options) : Benchmark(options) {
+        P = num_partitions;
+    }
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    std::string print_result(bool short_form = false);
+    void cpu_validation(int iter);
+
+   private:
+    int S;
+    int P, PZ;
+
+    float *x_cpu, *y_cpu;
+    float **x, **y;
+#if PARTITION_Z
+    float **z;
+#else
+    float *z;
+#endif
+
+#if CPU_VALIDATION
+    float *z_cpu;
+#endif
+    cudaStream_t *s;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b5.cu b/projects/resources/cuda/multi_gpu/b5.cu
new file mode 100644
index 00000000..1f4f1449
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b5.cu
@@ -0,0 +1,154 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b5.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+
+__device__ inline double
+cndGPU_m(double d) {
+    const double A1 = 0.31938153f;
+    const double A2 = -0.356563782f;
+    const double A3 = 1.781477937f;
+    const double A4 = -1.821255978f;
+    const double A5 = 1.330274429f;
+    const double RSQRT2PI = 0.39894228040143267793994605993438f;
+
+    double K = 1.0 / (1.0 + 0.2316419 * fabs(d));
+
+    double cnd = RSQRT2PI * exp(-0.5f * d * d) *
+                 (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
+
+    if (d > 0)
+        cnd = 1.0 - cnd;
+
+    return cnd;
+}
+
+extern "C" __global__ void
+bs_m(const double *x, double *y, int N, double R, double V, double T, double K) {
+    double sqrtT = 1.0 / rsqrt(T);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
+         i += blockDim.x * gridDim.x) {
+        double expRT;
+        double d1, d2, CNDD1, CNDD2;
+        d1 = (log(x[i] / K) + (R + 0.5 * V * V) * T) / (V * sqrtT);
+        d2 = d1 - V * sqrtT;
+
+        CNDD1 = cndGPU_m(d1);
+        CNDD2 = cndGPU_m(d2);
+
+        // Calculate Call and Put simultaneously
+        expRT = exp(-R * T);
+        y[i] = x[i] * CNDD1 - K * expRT * CNDD2;
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark5M::alloc() {
+    x = (double**) malloc(sizeof(double *) * M);
+    y = (double**) malloc(sizeof(double *) * M);
+    tmp_x = (double*) malloc(sizeof(double) * N);
+
+    for (int i = 0; i < M; i++) {
+        cudaMallocManaged(&x[i], sizeof(double) * N);
+        cudaMallocManaged(&y[i], sizeof(double) * N);
+    }
+}
+
+void Benchmark5M::init() {
+    for (int j = 0; j < N; j++) {
+        tmp_x[j] = 60 - 0.5 + (double) rand() / RAND_MAX;
+        for (int i = 0; i < M; i++) {
+            x[i][j] = tmp_x[j];
+        }
+    }
+
+    s = (cudaStream_t*) malloc(sizeof(cudaStream_t) * M);
+    for (int i = 0; i < M; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        err = cudaStreamCreate(&s[i]);
+    }
+}
+
+void Benchmark5M::reset() {
+    for (int i = 0; i < M; i++) {
+        for (int j = 0; j < N; j++) {
+            x[i][j] = tmp_x[j];
+        }
+    }
+}
+
+void Benchmark5M::execute_sync(int iter) {
+    for (int j = 0; j < M; j++) {
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(x[j], sizeof(double) * N, 0, 0);
+            cudaMemPrefetchAsync(y[j], sizeof(double) * N, 0, 0);
+        }
+        bs_m<<<num_blocks, block_size_1d>>>(x[j], y[j], N, R, V, T, K);
+        err = cudaDeviceSynchronize();
+    }
+}
+
+void Benchmark5M::execute_async(int iter) {
+    for (int j = 0; j < M; j++) {
+        int gpu = select_gpu(j, max_devices);
+        cudaSetDevice(gpu);
+        if (!pascalGpu || stream_attach) {
+            cudaStreamAttachMemAsync(s[j], x[j], sizeof(double) * N);
+            cudaStreamAttachMemAsync(s[j], y[j], sizeof(double) * N);
+        }
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(x[j], sizeof(double) * N, gpu, s[j]);
+            cudaMemPrefetchAsync(y[j], sizeof(double) * N, gpu, s[j]);
+        }
+        bs_m<<<num_blocks, block_size_1d, 0, s[j]>>>(x[j], y[j], N, R, V, T, K);
+    }
+
+    for (int j = 0; j < M; j++) {
+        err = cudaStreamSynchronize(s[j]);
+    }
+}
+
+std::string
+Benchmark5M::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(y[0][0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < M; j++) {
+            res += std::to_string(y[j][0]) + ", ";
+        }
+        return res + ", ...]";
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b5.cuh b/projects/resources/cuda/multi_gpu/b5.cuh
new file mode 100644
index 00000000..b38e063c
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b5.cuh
@@ -0,0 +1,63 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark5M : public Benchmark {
+   public:
+    Benchmark5M(Options &options) : Benchmark(options) {
+        graphs = std::vector<cudaGraph_t>(M);
+        graphExec = std::vector<cudaGraphExec_t>(M);
+        kernels = std::vector<cudaGraphNode_t>(M);
+        kernel_params = std::vector<cudaKernelNodeParams>(M);
+    }
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    double R = 0.08;
+    double V = 0.3;
+    double T = 1.0;
+    double K = 60.0;
+
+    int M = 24;
+    double **x, **y, *tmp_x;
+    cudaStream_t *s;
+    std::vector<cudaGraph_t> graphs;
+    std::vector<cudaGraphExec_t> graphExec;
+
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    std::vector<cudaGraphNode_t> kernels;
+    std::vector<cudaKernelNodeParams> kernel_params;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b6.cu b/projects/resources/cuda/multi_gpu/b6.cu
new file mode 100644
index 00000000..b33675db
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b6.cu
@@ -0,0 +1,467 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b6.cuh"
+
+#define BLOCK_SIZE_V100 64 // Just a recommendation of optimal block size for the V100;
+#define P 16
+
+//////////////////////////////
+//////////////////////////////
+
+extern "C" __global__ void nb_1_m(const int* x, const float* y, float* z, int n, int partition_rows, int n_feat, int n_classes, int partition_num) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < min(partition_rows, n - partition_num * partition_rows); i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_classes; j++) {
+            for (int q = 0; q < n_feat; q++) {
+                z[partition_num * partition_rows * n_classes + i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+            }
+        }
+    }
+}
+
+extern "C" __global__ void nb_2_m(const float* x, float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float curr_max = x[i * n_col_x];
+        for (int j = 0; j < n_col_x; j++) {
+            curr_max = fmaxf(curr_max, x[i * n_col_x + j]);
+        }
+        y[i] = curr_max;
+    }
+}
+
+extern "C" __global__ void nb_3_m(const float* x, const float* y, float* z, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < n_col_x; j++) {
+            sum += expf(x[i * n_col_x + j] - y[i]);
+        }
+        z[i] = logf(sum) + y[i];
+    }
+}
+
+extern "C" __global__ void nb_4_m(float* x, const float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] = expf(x[i * n_col_x + j] - y[i]);
+        }
+    }
+}
+
+// extern "C" __global__ void rr_1_m(const int* x, float* sum, float *sum_squared, int n_row_x, int n_col_x) {
+//     for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+//         float feature_mean = 0;
+//         float sum_sq = 0;
+//         // Compute mean and variance;
+//         for (int i = 0; i < n_row_x; i++) {
+//             float x_tmp = x[j * n_row_x + i];
+//             feature_mean += x_tmp;
+//             sum_sq += x_tmp * x_tmp;
+//         }
+//         sum[j] += feature_mean;
+//         sum_squared[j] += sum_sq;
+//     }
+// }
+
+// extern "C" __global__ void rr_1_2_m(const int *x, float *y, const float* sum, const float *sum_squared, int n_row_x, int n_col_x, int partition, int partition_size) {
+//     // Normalize each row;
+//     for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < partition_size; i += blockDim.x * gridDim.x) {
+//         for (int j = 0; j < n_col_x; j++) {
+//             float mean = sum[j] / n_row_x;
+//             float std = sqrtf(sum_squared[j] / n_row_x - mean * mean);
+//             y[partition * partition_size * n_col_x + i * n_col_x + j] = (x[i * n_col_x + j] - mean) / std;
+//         }
+//     }
+// }
+
+extern "C" __global__ void rr_1_m(const int* x, float* mean, float *std, int n_row_x, int n_col_x, int partition, int partition_size) {
+    for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+        float feature_mean = 0;
+        float sum_sq = 0;
+        // Compute mean and variance;
+        for (int i = 0; i < partition_size; i++) {
+            float x_tmp = x[j * partition_size + i];
+            feature_mean += x_tmp;
+            sum_sq += x_tmp * x_tmp;
+        }
+        // feature_mean /= n_row_x;
+        // std[j] = sqrtf(sum_sq / n_row_x - feature_mean * feature_mean);
+        // mean[j] = feature_mean;
+
+        // Keep just the sum and squared sum, compute mean and std later;
+        mean[j] += feature_mean;
+        std[j] += sum_sq;
+    }
+}
+
+// extern "C" __global__ void rr_1_m(const int* x, float* mean, float *std, int n_row_x, int n_col_x, int partition, int partition_size) {
+//     extern __shared__ volatile float scratch[];
+//     if (threadIdx.x == 0) {
+//         for (int k = 0; k < blockDim.x; k++) { 
+//             scratch[k] = 0;
+//         }
+//     }
+//     __syncthreads();
+//     for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < partition_size; i += blockDim.x * gridDim.x) {
+//         // Compute sum and sum of squares for mean and variance;
+//         for (int j = 0; j < n_col_x; j++) {
+//             float x_tmp = x[i * n_col_x + j];
+//             scratch[threadIdx.x] = x_tmp;
+//             // We read blockDim.x values at the same time, let the first thread do the reduction;
+//             __syncthreads();
+//             if (threadIdx.x == 0) {
+//                 float mean_tmp = 0;
+//                 float std_tmp = 0;
+//                 for (int k = 0; k < blockDim.x; k++) { 
+//                     mean_tmp += scratch[k];
+//                     std_tmp += scratch[k] * scratch[k];
+//                 }
+//                 mean[j] += mean_tmp;
+//                 std[j] += std_tmp;
+//             }
+//         }
+//     }
+// }
+
+extern "C" __global__ void rr_1_1_m(float* mean, float *std, const float *mean_curr, const float *std_curr, int n_row_x, int n_col_x, int partition_index, int partition_size) {
+    // We use partition 0 to accumulate, so skip it;
+    if (partition_index == 0) return;
+
+    // Aggregate mean and std from different partitions;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_col_x; i += blockDim.x * gridDim.x) {
+        mean[i] += mean_curr[i];
+        std[i] += std_curr[i];
+        // When processing the last partition, compute the final mean and std;
+        if (partition_index == P - 1) {
+            mean[i] /= n_row_x;
+            std[i] = sqrtf(std[i] / n_row_x - mean[i] * mean[i]);
+        }
+    }
+}
+
+extern "C" __global__ void rr_1_2_m(const int *x, float *y, const float* mean, const float *std, int n_row_x, int n_col_x, int partition_size) {
+    // Normalize each row;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < partition_size; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            // if (i == 0) printf("[i=%d][j=%d] mean=%f std=%f\n", i, j, mean[j], mean[j] * mean[j]);
+            // float mean_curr = mean[j] / n_row_x;
+            // float std_curr = sqrtf(std[j] / n_row_x - mean_curr * mean_curr);
+            float mean_curr = mean[j];
+            float std_curr = std[j];
+            y[i * n_col_x + j] = (x[i * n_col_x + j] - mean_curr) / std_curr;
+        }
+    }
+}
+
+extern "C" __global__ void rr_2_m(const float* x, const float* y, float* z, int n, int partition_rows, int n_feat, int n_classes, int partition_num) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < min(partition_rows, n - partition_num * partition_rows); i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_classes; j++) {
+            for (int q = 0; q < n_feat; q++) {
+                z[partition_num * partition_rows * n_classes + i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+            }
+        }
+    }
+}
+
+extern "C" __global__ void rr_3_m(float* x, const float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] += y[j];
+        }
+    }
+}
+
+extern "C" __global__ void softmax_m(float* x, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float row_exp_sum = 0;
+        for (int j = 0; j < n_col_x; j++) {
+            row_exp_sum += expf(x[i * n_col_x + j]);
+        }
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] = expf(x[i * n_col_x + j]) / row_exp_sum;
+        }
+    }
+}
+
+extern "C" __global__ void argmax_m(const float* x, const float* y, int* z, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        int curr_best_index = 0;
+        float curr_best = x[i * n_col_x] + y[i * n_col_x];
+        for (int j = 0; j < n_col_x; j++) {
+            float curr = x[i * n_col_x + j] + y[i * n_col_x + j];
+            if (curr > curr_best) {
+                curr_best = curr;
+                curr_best_index = j;
+            }
+        }
+        z[i] = curr_best_index;
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark6M::alloc() {
+    S = (N + P - 1) / P;
+    x = (int**) malloc(sizeof(int*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&x[i], sizeof(int) * S * num_features);
+    }
+    err = cudaMallocManaged(&x_o, sizeof(int) * N * num_features);
+    z = (float**) malloc(sizeof(float*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&z[i], sizeof(float) * S * num_features);
+    }
+    mean = (float**) malloc(sizeof(float*) * P);
+    std = (float**) malloc(sizeof(float*) * P);
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&mean[i], sizeof(float) * num_features);
+        err = cudaMallocManaged(&std[i], sizeof(float) * num_features);
+    }
+    err = cudaMallocManaged(&nb_feat_log_prob, sizeof(float) * num_classes * num_features);
+    err = cudaMallocManaged(&nb_class_log_prior, sizeof(float) * num_classes);
+    err = cudaMallocManaged(&ridge_coeff, sizeof(float) * num_classes * num_features);
+    err = cudaMallocManaged(&ridge_intercept, sizeof(float) * num_classes);
+    err = cudaMallocManaged(&nb_amax, sizeof(float) * N);
+    err = cudaMallocManaged(&nb_l, sizeof(float) * N);
+    err = cudaMallocManaged(&r1, sizeof(float) * N * num_classes);
+    err = cudaMallocManaged(&r2, sizeof(float) * N * num_classes);
+    err = cudaMallocManaged(&r, sizeof(int) * N);
+
+    // Stream 0;
+    int gpu = select_gpu(0, max_devices);
+    cudaSetDevice(gpu);
+    err = cudaStreamCreate(&s1);
+    // Stream 1;
+    gpu = select_gpu(1, max_devices);
+    cudaSetDevice(gpu);
+    err = cudaStreamCreate(&s2);
+    // Other streams;
+    s_n = (cudaStream_t *) malloc(sizeof(cudaStream_t) * P);
+    s_r = (cudaStream_t *) malloc(sizeof(cudaStream_t) * P);
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        err = cudaStreamCreate(&s_n[i]);
+        err = cudaStreamCreate(&s_r[i]);
+    }
+}
+
+void Benchmark6M::init() {
+    for (int i = 0; i < num_classes; i++) {
+        for (int j = 0; j < num_features; j++) {
+            nb_feat_log_prob[i * num_features + j] = (float)(rand()) / (float)(RAND_MAX);
+            ridge_coeff[i * num_features + j] = (float)(rand()) / (float)(RAND_MAX);
+        }
+        nb_class_log_prior[i] = (float)(rand()) / (float)(RAND_MAX);
+        ridge_intercept[i] = (float)(rand()) / (float)(RAND_MAX);
+    }
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < num_features; j++) {
+            x_o[i * num_features + j] = rand() % max_occurrence_of_ngram;
+        }
+        for (int j = 0; j < num_classes; j++) {
+            r1[i * num_classes + j] = nb_class_log_prior[j];
+            r2[i * num_classes + j] = 0;
+        }
+    }
+    for (int p = 0; p < P; p++) {
+        for (int i = 0; i < S; i++) {
+            for (int j = 0; j < num_features; j++) {
+                int index = p * S * num_features + i * num_features + j;
+                if (index < N * num_features) {
+                    x[p][i * num_features + j] = x_o[index];
+                } else {
+                    x[p][i * num_features + j] = 0;
+                }
+            }
+        }
+    }
+}
+
+void Benchmark6M::reset() {
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < num_classes; j++) {
+            r1[i * num_classes + j] = nb_class_log_prior[j];
+            r2[i * num_classes + j] = 0;
+        }
+    }
+    for (int p = 0; p < P; p++) {
+        for (int i = 0; i < num_features; i++) {
+            mean[p][i] = 0;
+            std[p][i] = 0;
+        }
+    }
+}
+
+void Benchmark6M::execute_sync(int iter) {
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(r1, sizeof(float) * N * num_classes, 0, 0);
+        cudaMemPrefetchAsync(r2, sizeof(float) * N * num_classes, 0, 0);
+        cudaMemPrefetchAsync(r, sizeof(int) * N, 0, 0);
+    }
+    for (int i = 0; i < P; i++) {
+        rr_1_m<<<num_blocks, block_size_1d>>>(x[i], mean[i], std[i], N, num_features, i, S);
+        cudaDeviceSynchronize();
+    }
+    for (int i = 0; i < P; i++) {
+        rr_1_1_m<<<num_blocks, block_size_1d>>>(mean[0], std[0], mean[i], std[i], N, num_features, i, S);
+        cudaDeviceSynchronize();
+    }
+    for (int i = 0; i < P; i++) {
+        rr_1_2_m<<<num_blocks, block_size_1d>>>(x[i], z[i], mean[0], std[0], N, num_features, S);
+        cudaDeviceSynchronize();
+        rr_2_m<<<num_blocks, block_size_1d>>>(z[i], ridge_coeff, r2, N, S, num_features, num_classes, i);
+        cudaDeviceSynchronize();
+    }
+    rr_3_m<<<num_blocks, block_size_1d>>>(r2, ridge_intercept, N, num_classes);
+    cudaDeviceSynchronize();
+    softmax_m<<<num_blocks, block_size_1d>>>(r2, N, num_classes);
+    cudaDeviceSynchronize();
+
+    for (int i = 0; i < P; i++) {
+        nb_1_m<<<num_blocks, block_size_1d>>>(x[i], nb_feat_log_prob, r1, N, S, num_features, num_classes, i);
+        cudaDeviceSynchronize();
+    }
+    nb_2_m<<<num_blocks, block_size_1d>>>(r1, nb_amax, N, num_classes);
+    cudaDeviceSynchronize();
+    nb_3_m<<<num_blocks, block_size_1d>>>(r1, nb_amax, nb_l, N, num_classes);
+    cudaDeviceSynchronize();
+    nb_4_m<<<num_blocks, block_size_1d>>>(r1, nb_l, N, num_classes);
+    cudaDeviceSynchronize();
+    softmax_m<<<num_blocks, block_size_1d>>>(r1, N, num_classes);
+    cudaDeviceSynchronize();
+
+    argmax_m<<<num_blocks, block_size_1d>>>(r1, r2, r, N, num_classes);
+    cudaDeviceSynchronize();
+}
+
+void Benchmark6M::execute_async(int iter) {
+
+    // RR;
+    int gpu = select_gpu(0, max_devices);
+    cudaSetDevice(gpu);
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, ridge_coeff, 0);
+        cudaStreamAttachMemAsync(s1, r2, 0);
+        cudaStreamAttachMemAsync(s1, ridge_intercept, 0);
+    }
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(r2, sizeof(float) * N * num_classes, gpu, s1);
+        cudaMemPrefetchAsync(r, sizeof(int) * N, gpu, s1);
+    }
+    cudaEvent_t e_r0[P];
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        rr_1_m<<<num_blocks, block_size_1d, 0, s_r[i]>>>(x[i], mean[i], std[i], N, num_features, i, S);
+        cudaEventCreate(&e_r0[i]);
+        cudaEventRecord(e_r0[i], s_r[i]);
+    }
+    cudaSetDevice(select_gpu(0, max_devices));
+    for (int i = 0; i < P; i++) {
+        cudaStreamWaitEvent(s1, e_r0[i], 0);
+        rr_1_1_m<<<num_blocks, block_size_1d, 0, s1>>>(mean[0], std[0], mean[i], std[i], N, num_features, i, S);
+    }
+    cudaEvent_t e1;
+    cudaEventCreate(&e1);
+    cudaEventRecord(e1, s1);
+    cudaEvent_t e_r1[P];
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        cudaStreamWaitEvent(s_r[i], e1, 0);
+        rr_1_2_m<<<num_blocks, block_size_1d, 0, s_r[i]>>>(x[i], z[i], mean[0], std[0], N, num_features, S);
+        rr_2_m<<<num_blocks, block_size_1d, 0, s_r[i]>>>(z[i], ridge_coeff, r2, N, S, num_features, num_classes, i);
+        cudaEventCreate(&e_r1[i]);
+        cudaEventRecord(e_r1[i], s_r[i]);
+    }
+    // Stream 1 waits all other streams;
+    cudaSetDevice(gpu);
+    for (int i = 0; i < P; i++) {
+        cudaStreamWaitEvent(s1, e_r1[i], 0);
+    }
+    rr_3_m<<<num_blocks, block_size_1d, 0, s1>>>(r2, ridge_intercept, N, num_classes);
+    softmax_m<<<num_blocks, block_size_1d, 0, s1>>>(r2, N, num_classes);
+
+    // NB;
+    gpu = select_gpu(1, max_devices);
+    cudaSetDevice(gpu);
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s2, nb_feat_log_prob, 0);
+        cudaStreamAttachMemAsync(s2, r1, 0);
+        cudaStreamAttachMemAsync(s2, nb_amax, 0);
+        cudaStreamAttachMemAsync(s2, nb_l, 0);
+    }
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(r1, sizeof(float) * N * num_classes, gpu, s2);
+    }
+    cudaEvent_t e2;
+    cudaEventCreate(&e2);
+    cudaEventRecord(e2, s2);
+    cudaEvent_t e_n[P];
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        cudaStreamWaitEvent(s_n[i], e2, 0);
+        nb_1_m<<<num_blocks, block_size_1d, 0, s_n[i]>>>(x[i], nb_feat_log_prob, r1, N, S, num_features, num_classes, i);
+        cudaEventCreate(&e_n[i]);
+        cudaEventRecord(e_n[i], s_n[i]);
+    }
+    // Stream 2 waits all other streams;
+    cudaSetDevice(gpu);
+    for (int i = 0; i < P; i++) {
+        cudaStreamWaitEvent(s2, e_n[i], 0);
+    }
+    nb_2_m<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_amax, N, num_classes);
+    nb_3_m<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_amax, nb_l, N, num_classes);
+    nb_4_m<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_l, N, num_classes);
+    softmax_m<<<num_blocks, block_size_1d, 0, s2>>>(r1, N, num_classes);
+
+    // Stream 1 waits stream 2;
+    cudaEvent_t e3;
+    cudaEventCreate(&e3);
+    cudaEventRecord(e3, s2);
+    cudaSetDevice(select_gpu(0, max_devices));
+    cudaStreamWaitEvent(s1, e3, 0);
+    argmax_m<<<num_blocks, block_size_1d, 0, s1>>>(r1, r2, r, N, num_classes);
+    cudaDeviceSynchronize();
+}
+
+std::string Benchmark6M::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(r[0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < 10; j++) {
+            res += std::to_string(r[j]) + ", ";
+        }
+
+        int sum = 0;
+        for (int j = 0; j < N; j++) {
+            sum += r[j];
+        }
+        return res + "...], sum=" + std::to_string(sum);
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b6.cuh b/projects/resources/cuda/multi_gpu/b6.cuh
new file mode 100644
index 00000000..55846ba4
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b6.cuh
@@ -0,0 +1,57 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark6M : public Benchmark {
+   public:
+    Benchmark6M(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int S;
+    int num_features = 1024;
+    int num_classes = 16;
+    int max_occurrence_of_ngram = 10;
+    int **x;
+    int *x_o;
+    float **z;
+    float **mean, **std;
+    float *nb_feat_log_prob, *nb_class_log_prior, *ridge_coeff, *ridge_intercept, *nb_amax, *nb_l, *r1, *r2;
+    int *r;
+    cudaStream_t s1, s2;
+    cudaStream_t *s_r;
+    cudaStream_t *s_n;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b9.cu b/projects/resources/cuda/multi_gpu/b9.cu
new file mode 100644
index 00000000..c6fe174e
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b9.cu
@@ -0,0 +1,407 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b9.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+#define BLOCK_SIZE_V100 64 // Just a recommendation of optimal block size for the V100;
+#define P 16
+#define ITER 50
+#define EPS 1e-12
+
+// Add a small epsilon to the main diagonal:
+extern "C" __global__ void precondition(float *A, int n, int m, int offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < m; i += blockDim.x * gridDim.x) {
+        A[i * n + i + offset] += EPS; 
+    }
+}
+
+// z = x @ y;
+extern "C" __global__ void matrix_vector_mult(const float* x, const float* y, float* z, int n, int m, int z_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[z_offset + i] = sum;
+    }
+}
+
+// z := w + alpha * A @ y;
+extern "C" __global__ void matrix_vector_mult_axpy(const float* x, const float* y, const float *w, const float alpha, float* z, int n, int m, int z_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[z_offset + i] = alpha * sum + w[z_offset + i];
+    }
+}
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+// z = <x, x>;
+extern "C" __global__ void l2_norm(const float *x, float* z, int N, int offset) {
+    int warp_size = 32;
+    float sum = 0;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        float x_tmp = x[i + offset];
+        sum += x_tmp * x_tmp;
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+
+// z = <x, y>;
+extern "C" __global__ void dot(const float *x, const float *y, float* z, int N, int offset) {
+    int warp_size = 32;
+    float sum = 0;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i + offset] * y[i + offset];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+
+// y = val + alpha * x;
+extern "C" __global__ void saxpy(float* y, const float *val, const float *x, float alpha, int n, int offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i + offset] = val[i + offset] + alpha * x[i + offset];
+    }
+}
+
+// Simply copy array x into y;
+extern "C" __global__ void cpy(float *y, const float *x, int n, int offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i + offset] = x[i + offset];
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark9M::alloc() {
+    S = (N + P - 1) / P;
+    A = (float **) malloc(sizeof(float*) * P);
+
+    for (int i = 0; i < P; i++) {
+        err = cudaMallocManaged(&A[i], sizeof(float) * S * N);
+    }
+    err = cudaMallocManaged(&x, sizeof(float) * N);
+    err = cudaMallocManaged(&b, sizeof(float) * N);
+    err = cudaMallocManaged(&p, sizeof(float) * N);
+    err = cudaMallocManaged(&r, sizeof(float) * N);
+    err = cudaMallocManaged(&y, sizeof(float) * N);
+
+    err = cudaMallocManaged(&t1, sizeof(float));
+    err = cudaMallocManaged(&t2, sizeof(float));
+    // Other implementation;
+    // t1 = (float **) malloc(sizeof(float*) * P);
+    // t2 = (float **) malloc(sizeof(float*) * P);
+    // for (int i = 0; i < P; i++) {
+    //     err = cudaMallocManaged(&t1[i], sizeof(float));
+    //     err = cudaMallocManaged(&t2[i], sizeof(float));
+    // }
+
+    // Create streams;
+    cudaStream_t s1, s2;
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+    // Create P streams;
+    s = (cudaStream_t *) malloc(sizeof(cudaStream_t) * P);
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        err = cudaStreamCreate(&s[i]);
+    }
+    cudaSetDevice(select_gpu(0, max_devices));
+}
+
+void Benchmark9M::init() {
+    // Random symmetric invertible input matrix;
+    float max = float(RAND_MAX);
+
+    for (int i = 0; i < N; i++) {
+        int p = i / S;
+        for (int j = i; j < N; j++) {
+            float val = (float(rand()) / max) * 2 - 1;
+            A[p][(i % S) * N + j] = val;
+            // A[j / S][(j % S) * N + i] = val;
+        }
+        // A[p][(i % S) * N + i] += 10e-12; 
+    }
+
+    // Random input b;
+    for (int i = 0; i < N; i++) {
+        b[i] = (float(rand()) / max) * 2 - 1;
+    }
+}
+
+void Benchmark9M::reset() {
+    // Default init of solution x;
+    for (int i = 0; i < N; i++) {
+        x[i] = 1.0;
+    }
+    // Reset norms;
+    *t1 = 0.0;
+    *t2 = 0.0;
+    // Other implementation;
+    // t1_tot = 0.0;
+    // t2_tot = 0.0;
+    // for (int i = 0; i < P; i++) {
+    //     t1[i][0] = 0;
+    //     t2[i][0] = 0;
+    // }
+}
+
+void Benchmark9M::execute_sync(int iter) { 
+
+    if (pascalGpu && do_prefetch) {
+        for (int i = 0; i < P; i++) {
+            cudaMemPrefetchAsync(A[i], sizeof(float) * S * N, 0);
+        }
+        cudaMemPrefetchAsync(x, sizeof(float) * N, 0);
+        cudaMemPrefetchAsync(b, sizeof(float) * N, 0);
+        cudaMemPrefetchAsync(r, sizeof(float) * N, 0);
+        cudaMemPrefetchAsync(p, sizeof(float) * N, 0);
+    }
+
+    for (int i = 0; i < P; i++) {
+        precondition<<<num_blocks, block_size_1d>>>(A[i], N, std::min(S, N - i * S), i * S);
+        matrix_vector_mult_axpy<<<num_blocks, block_size_1d>>>(A[i], x, b, -1, r, std::min(S, N - i * S), N, i * S);
+        cudaDeviceSynchronize();
+    }
+    cpy<<<num_blocks, block_size_1d>>>(p, r, N, 0);
+    cudaDeviceSynchronize();
+    l2_norm<<<num_blocks, block_size_1d>>>(r, t1, N, 0);
+    cudaDeviceSynchronize();
+    for (int iter = 0; iter < ITER; iter++) {
+        for (int i = 0; i < P; i++) {
+            matrix_vector_mult<<<num_blocks, block_size_1d>>>(A[i], p, y, std::min(S, N - i * S), N, i * S);
+            cudaDeviceSynchronize();
+        }
+        dot<<<num_blocks, block_size_1d>>>(p, y, t2, N, 0);
+        cudaDeviceSynchronize();
+        float alpha = *t1 / *t2;
+        float old_t1 = *t1;
+        *t1 = 0.0;
+        *t2 = 0.0;
+        saxpy<<<num_blocks, block_size_1d>>>(x, x, p, alpha, N, 0);
+        cudaDeviceSynchronize();
+        saxpy<<<num_blocks, block_size_1d>>>(r, r, y, -1.0 * alpha, N, 0);
+        cudaDeviceSynchronize();
+        l2_norm<<<num_blocks, block_size_1d>>>(r, t1, N, 0);
+        cudaDeviceSynchronize();
+        float beta = *t1 / old_t1;
+        saxpy<<<num_blocks, block_size_1d>>>(p, r, p, beta, N, 0);
+        cudaDeviceSynchronize();
+    }
+    cudaDeviceSynchronize();
+
+    // Other implementation;
+    // for (int i = 0; i < P; i++) {
+    //     matrix_vector_mult_axpy<<<num_blocks, block_size_1d>>>(A[i], x, b, -1, r, std::min(S, N - i * S), N, i * S);
+    //     cudaDeviceSynchronize();
+    //     cpy<<<num_blocks, block_size_1d>>>(p, r, std::min(S, N - i * S), i * S);
+    //     cudaDeviceSynchronize();
+    //     l2_norm<<<num_blocks, block_size_1d>>>(r, t1[i], std::min(S, N - i * S), i * S);
+    //     cudaDeviceSynchronize();
+    //     t1_tot += t1[i][0];
+    // }   
+    // for (int iter = 0; iter < ITER; iter++) {
+    //     for (int i = 0; i < P; i++) {
+    //         matrix_vector_mult<<<num_blocks, block_size_1d>>>(A[i], p, y, std::min(S, N - i * S), N, i * S);
+    //         cudaDeviceSynchronize();
+    //         dot<<<num_blocks, block_size_1d>>>(p, y, t2[i], std::min(S, N - i * S), i * S);
+    //         cudaDeviceSynchronize();
+    //         t2_tot += t2[i][0];
+    //     }
+    //     float alpha = t1_tot / t2_tot;
+    //     float old_t1 = t1_tot;
+    //     t1_tot = 0.0;
+    //     t2_tot = 0.0;
+    //     for (int i = 0; i < P; i++) {
+    //         saxpy<<<num_blocks, block_size_1d>>>(x, x, p, alpha, std::min(S, N - i * S), i * S);
+    //         cudaDeviceSynchronize();
+    //         saxpy<<<num_blocks, block_size_1d>>>(r, r, y, -1.0 * alpha, std::min(S, N - i * S), i * S);
+    //         cudaDeviceSynchronize();
+    //         t1[i][0] = 0;
+    //         l2_norm<<<num_blocks, block_size_1d>>>(r, t1[i], std::min(S, N - i * S), i * S);
+    //         cudaDeviceSynchronize();
+    //         t1_tot += t1[i][0];
+    //     }
+    //     float beta = t1_tot / old_t1;
+    //     for (int i = 0; i < P; i++) {
+    //         saxpy<<<num_blocks, block_size_1d>>>(p, r, p, beta, std::min(S, N - i * S), i * S);
+    //         cudaDeviceSynchronize();
+    //         t2[i][0] = 0;
+    //     }
+    // }
+    // cudaDeviceSynchronize();
+}
+
+void Benchmark9M::execute_async(int iter) {
+    if (pascalGpu && do_prefetch) {
+        for (int i = 0; i < P; i++) {
+            cudaSetDevice(select_gpu(i, max_devices));
+            cudaMemPrefetchAsync(A[i], sizeof(float) * S * N, 0, s[i]);
+        }
+        cudaSetDevice(select_gpu(0, max_devices));
+        cudaMemPrefetchAsync(x, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(b, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(r, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(p, sizeof(float) * N, 0, s1);
+    }
+
+    cudaEvent_t e[P];
+    for (int i = 0; i < P; i++) {
+        cudaSetDevice(select_gpu(i, max_devices));
+        precondition<<<num_blocks, block_size_1d, 0, s[i]>>>(A[i], N, std::min(S, N - i * S), i * S);
+        matrix_vector_mult_axpy<<<num_blocks, block_size_1d, 0, s[i]>>>(A[i], x, b, -1, r, std::min(S, N - i * S), N, i * S);
+        cudaEventCreate(&e[i]);
+        cudaEventRecord(e[i], s[i]);
+    }
+    cudaSetDevice(select_gpu(0, max_devices));
+    for (int i = 0; i < P; i++) {
+        cudaStreamWaitEvent(s1, e[i], 0);
+    }
+    cudaEvent_t e_c;
+    cudaEventCreate(&e_c);
+    cpy<<<num_blocks, block_size_1d, 0, s1>>>(p, r, N, 0);
+    cudaEventRecord(e_c, s1);
+    for (int i = 0; i < P; i++) {
+        cudaStreamWaitEvent(s2, e[i], 0);
+    }
+    l2_norm<<<num_blocks, block_size_1d, 0, s2>>>(r, t1, N, 0);
+    for (int iter = 0; iter < ITER; iter++) {
+        cudaEvent_t e2[P];
+        for (int i = 0; i < P; i++) {
+            cudaSetDevice(select_gpu(i, max_devices));
+            cudaStreamWaitEvent(s[i], e_c, 0);
+            matrix_vector_mult<<<num_blocks, block_size_1d, 0, s[i]>>>(A[i], p, y, std::min(S, N - i * S), N, i * S);
+            cudaEventCreate(&e2[i]);
+            cudaEventRecord(e2[i], s[i]);
+        }
+        cudaSetDevice(select_gpu(0, max_devices));
+        for (int i = 0; i < P; i++) {
+            cudaStreamWaitEvent(s1, e2[i], 0);
+        }
+        dot<<<num_blocks, block_size_1d, 0, s1>>>(p, y, t2, N, 0);
+        cudaStreamSynchronize(s1);
+        cudaStreamSynchronize(s2);
+        float alpha = *t1 / *t2;
+        float old_t1 = *t1;
+        *t1 = 0.0;
+        *t2 = 0.0;
+        saxpy<<<num_blocks, block_size_1d, 0, s1>>>(x, x, p, alpha, N, 0);
+        saxpy<<<num_blocks, block_size_1d, 0, s2>>>(r, r, y, -1.0 * alpha, N, 0);
+        l2_norm<<<num_blocks, block_size_1d, 0, s2>>>(r, t1, N, 0);
+        cudaStreamSynchronize(s2);
+        float beta = *t1 / old_t1;
+        saxpy<<<num_blocks, block_size_1d, 0, s1>>>(p, r, p, beta, N, 0);
+    }
+    cudaStreamSynchronize(s1);
+
+    // Other implementation;
+    // for (int i = 0; i < P; i++) {
+    //     cudaSetDevice(select_gpu(i, max_devices));
+    //     matrix_vector_mult_axpy<<<num_blocks, block_size_1d, 0, s[i]>>>(A[i], x, b, -1, r, std::min(S, N - i * S), N, i * S);
+    //     cpy<<<num_blocks, block_size_1d, 0, s[i]>>>(p, r, std::min(S, N - i * S), i * S);
+    //     l2_norm<<<num_blocks, block_size_1d, 0, s[i]>>>(r, t1[i], std::min(S, N - i * S), i * S);
+    // }
+    // for (int i = 0; i < P; i++) {
+    //     cudaStreamSynchronize(s[i]);
+    //     t1_tot += t1[i][0];
+    // }
+
+    // for (int iter = 0; iter < ITER; iter++) {
+    //     for (int i = 0; i < P; i++) {
+    //         cudaSetDevice(select_gpu(i, max_devices));
+    //         matrix_vector_mult<<<num_blocks, block_size_1d, 0, s[i]>>>(A[i], p, y, std::min(S, N - i * S), N, i * S);
+    //         dot<<<num_blocks, block_size_1d, 0, s[i]>>>(p, y, t2[i], std::min(S, N - i * S), i * S);
+    //     }
+    //     for (int i = 0; i < P; i++) {
+    //         cudaStreamSynchronize(s[i]);
+    //         t2_tot += t2[i][0];
+    //     }
+    //     float alpha = t1_tot / t2_tot;
+    //     float old_t1 = t1_tot;
+    //     t1_tot = 0.0;
+    //     t2_tot = 0.0;
+    //     for (int i = 0; i < P; i++) {
+    //         cudaSetDevice(select_gpu(i, max_devices));
+    //         t1[i][0] = 0;
+    //         saxpy<<<num_blocks, block_size_1d, 0, s[i]>>>(x, x, p, alpha, std::min(S, N - i * S), i * S);
+    //         saxpy<<<num_blocks, block_size_1d, 0, s[i]>>>(r, r, y, -1.0 * alpha, std::min(S, N - i * S), i * S);
+    //         l2_norm<<<num_blocks, block_size_1d, 0, s[i]>>>(r, t1[i], std::min(S, N - i * S), i * S);
+    //     }
+    //     for (int i = 0; i < P; i++) {
+    //         cudaStreamSynchronize(s[i]);
+    //         t1_tot += t1[i][0];
+    //     }
+    //     float beta = t1_tot / old_t1;
+    //     for (int i = 0; i < P; i++) {
+    //         cudaSetDevice(select_gpu(i, max_devices));
+    //         saxpy<<<num_blocks, block_size_1d, 0, s[i]>>>(p, r, p, beta, std::min(S, N - i * S), i * S);
+    //     }
+    //     for (int i = 0; i < P; i++) {
+    //         cudaStreamSynchronize(s[i]);
+    //         t2[i][0] = 0;
+    //     }
+    // }
+}
+
+std::string Benchmark9M::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(x[0]);
+    } else {
+        for (int i = 0; i < P; i++) {
+            matrix_vector_mult_axpy<<<num_blocks, block_size_1d>>>(A[i], x, b, -1, y, std::min(S, N - i * S), N, i * S);
+        }
+        cudaDeviceSynchronize();
+        std::string res = "[";
+        for (int j = 0; j < std::min(10, N); j++) {
+            res += std::to_string(y[j]) + ", ";
+        }
+
+        float sum = 0;
+        for (int j = 0; j < N; j++) {
+            sum += y[j];
+        }
+        return res + "...], sum=" + std::to_string(sum);
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/multi_gpu/b9.cuh b/projects/resources/cuda/multi_gpu/b9.cuh
new file mode 100644
index 00000000..33da1964
--- /dev/null
+++ b/projects/resources/cuda/multi_gpu/b9.cuh
@@ -0,0 +1,57 @@
+// Copyright (c) 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark9M : public Benchmark {
+   public:
+    Benchmark9M(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int S;
+
+    float **A;
+    float *x, *y, *r, *p, *b;
+    
+    float *t1, *t2;    
+    // Other implementation;
+    // float **t1, **t2;
+    // float t1_tot = 0;
+    // float t2_tot = 0;
+
+    cudaStream_t s1, s2;
+    cudaStream_t *s;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/options.hpp b/projects/resources/cuda/options.hpp
new file mode 100644
index 00000000..abd3694e
--- /dev/null
+++ b/projects/resources/cuda/options.hpp
@@ -0,0 +1,242 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+
+#include <getopt.h>
+
+#include <cstdlib>
+#include <map>
+#include <string>
+
+#include "utils.hpp"
+
+//////////////////////////////
+//////////////////////////////
+
+#define CPU_VALIDATION false
+#define DEBUG false
+#define NUM_ITER 30
+#define DEFAULT_BLOCK_SIZE_1D 32
+#define DEFAULT_BLOCK_SIZE_2D 8
+#define DEFAULT_NUM_BLOCKS 64
+#define DEFAULT_SKIP 3
+#define DEFAULT_POLICY "async"
+#define DEFAULT_PREFETCH false
+#define DEFAULT_STREAM_ATTACH false
+#define DEFAULT_MAX_DEVICES 1
+#define DEFAULT_NVPROF false
+// In some benchmarks, allow the computation to be split into an arbitrary number of partitions;
+#define DEFAULT_NUM_PARTITIONS 16
+
+//////////////////////////////
+//////////////////////////////
+
+enum Policy {
+    Sync,
+    Async,
+    CudaGraph,
+    CudaGraphAsync,
+    CudaGraphSingle
+};
+
+enum BenchmarkEnum {
+    B1,
+    B5,
+    B6,
+    B7,
+    B8,
+    B10,
+    B1M,
+    B5M,
+    B6M,
+    B9M,
+    B11M,
+    B12M,
+    B13M,
+    ERR
+};
+
+//////////////////////////////
+//////////////////////////////
+
+inline Policy get_policy(std::string policy) {
+    if (policy == "sync")
+        return Policy::Sync;
+    else if (policy == "cudagraph")
+        return Policy::CudaGraph;
+    else if (policy == "cudagraphmanual")
+        return Policy::CudaGraphAsync;
+    else if (policy == "cudagraphsingle")
+        return Policy::CudaGraphSingle;
+    else
+        return Policy::Async;
+}
+
+inline BenchmarkEnum get_benchmark(std::string benchmark) {
+    if (benchmark == "b1")
+        return BenchmarkEnum::B1;
+    else if (benchmark == "b5")
+        return BenchmarkEnum::B5;
+    else if (benchmark == "b6")
+        return BenchmarkEnum::B6;
+    else if (benchmark == "b7")
+        return BenchmarkEnum::B7;
+    else if (benchmark == "b8")
+        return BenchmarkEnum::B8;
+    else if (benchmark == "b10")
+        return BenchmarkEnum::B10;
+    else if (benchmark == "b1m")
+        return BenchmarkEnum::B1M;
+    else if (benchmark == "b5m")
+        return BenchmarkEnum::B5M;
+    else if (benchmark == "b6m")
+        return BenchmarkEnum::B6M;
+    else if (benchmark == "b9m")
+        return BenchmarkEnum::B9M;
+    else if (benchmark == "b11m")
+        return BenchmarkEnum::B11M;
+    else if (benchmark == "b12m")
+        return BenchmarkEnum::B12M;
+    else if (benchmark == "b13m")
+        return BenchmarkEnum::B13M;
+    else
+        return BenchmarkEnum::ERR;
+}
+
+struct Options {
+    // Testing options;
+    uint num_iter = NUM_ITER;
+    int debug = DEBUG;
+    int block_size_1d = DEFAULT_BLOCK_SIZE_1D;
+    int block_size_2d = DEFAULT_BLOCK_SIZE_2D;
+    int num_blocks = DEFAULT_NUM_BLOCKS;
+    int N = 0;
+    int max_devices = DEFAULT_MAX_DEVICES;
+    int skip_iterations = DEFAULT_SKIP;
+    bool prefetch = DEFAULT_PREFETCH;
+    bool stream_attach = DEFAULT_STREAM_ATTACH;
+    bool nvprof = DEFAULT_NVPROF;
+    int num_partitions = DEFAULT_NUM_PARTITIONS;
+    BenchmarkEnum benchmark_choice = BenchmarkEnum::ERR;
+    Policy policy_choice = get_policy(DEFAULT_POLICY);
+
+    // Used for printing;
+    std::map<BenchmarkEnum, std::string> benchmark_map;
+    std::map<Policy, std::string> policy_map;
+
+    //////////////////////////////
+    //////////////////////////////
+
+    Options(int argc, char *argv[]) {
+        map_init(policy_map)(Policy::Sync, "sync")(Policy::Async, "async")(Policy::CudaGraph, "cudagraph")(Policy::CudaGraphAsync, "cudagraphmanual")(Policy::CudaGraphSingle, "cudagraphsingle");
+        map_init(benchmark_map)
+                (BenchmarkEnum::B1, "b1 - VEC")
+                (BenchmarkEnum::B5, "b5 - B&S")
+                (BenchmarkEnum::B6, "b6 - ML")
+                (BenchmarkEnum::B7, "b7 - HITS")
+                (BenchmarkEnum::B8, "b8 - IMG")
+                (BenchmarkEnum::B10, "b10 - DL")
+                (BenchmarkEnum::B1M, "b1m - VEC")
+                (BenchmarkEnum::B5M, "b5m - B&S")
+                (BenchmarkEnum::B6M, "b6m - ML")
+                (BenchmarkEnum::B9M, "b9m - CG")
+                (BenchmarkEnum::B11M, "b11m - MUL")
+                (BenchmarkEnum::B12M, "b12m - LANCZOS")
+                (BenchmarkEnum::B13M, "b13m - MMUL");
+
+        int opt;
+        static struct option long_options[] = {{"debug", no_argument, 0, 'd'},
+                                               {"num_iter", required_argument, 0, 't'},
+                                               {"N", required_argument, 0, 'n'},
+                                               {"block_size_1d", required_argument, 0, 'b'},
+                                               {"block_size_2d", required_argument, 0, 'c'},
+                                               {"num_blocks", required_argument, 0, 'g'},
+                                               {"skip_first", required_argument, 0, 's'},
+                                               {"benchmark", required_argument, 0, 'k'},
+                                               {"policy", required_argument, 0, 'p'},
+                                               {"prefetch", no_argument, 0, 'r'},
+                                               {"attach", no_argument, 0, 'a'},
+                                               {"max_devices", required_argument, 0, 'm'},
+                                               {"nvprof", no_argument, 0, 'v'},
+                                               {"partitions", required_argument, 0, 'P'},
+                                               {0, 0, 0, 0}};
+        // getopt_long stores the option index here;
+        int option_index = 0;
+
+        while ((opt = getopt_long(argc, argv, "dt:n:b:c:g:s:k:p:ram:vP:", long_options, &option_index)) != EOF) {
+            switch (opt) {
+                case 'd':
+                    debug = true;
+                    break;
+                case 't':
+                    num_iter = atoi(optarg);
+                    break;
+                case 'n':
+                    N = atoi(optarg);
+                    break;
+                case 'b':
+                    block_size_1d = atoi(optarg);
+                    break;
+                case 'c':
+                    block_size_2d = atoi(optarg);
+                    break;
+                case 'g':
+                    num_blocks = atoi(optarg);
+                    break;
+                case 's':
+                    skip_iterations = atoi(optarg);
+                    break;
+                case 'k':
+                    benchmark_choice = get_benchmark(optarg);
+                    break;
+                case 'p':
+                    policy_choice = get_policy(optarg);
+                    break;
+                case 'r':
+                    prefetch = true;
+                    break;
+                case 'a':
+                    stream_attach = true;
+                    break;
+                case 'm':
+                    max_devices = atoi(optarg);
+                    break;
+                case 'v':
+                    nvprof = true;
+                    break;
+                case 'P':
+                    num_partitions = atoi(optarg);
+                    break;
+                default:
+                    break;
+            }
+        }
+    }
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/run_nvprof.sh b/projects/resources/cuda/run_nvprof.sh
new file mode 100755
index 00000000..c014a000
--- /dev/null
+++ b/projects/resources/cuda/run_nvprof.sh
@@ -0,0 +1,16 @@
+/usr/local/cuda/bin/nsys nvprof --csv -o b1m --profile-from-start off bin/b -m 8 -d -v -t 30 -k b1m -n 950000000 -b 32 -g 64
+/usr/local/cuda/bin/nsys nvprof --csv -o b5m --profile-from-start off bin/b -m 8 -d -v -t 30 -k b5m -n 35000000 -b 1024 -g 64
+/usr/local/cuda/bin/nsys nvprof --csv -o b6m --profile-from-start off bin/b -m 8 -d -v -t 30 -k b6m -n 1800000 -b 32 -g 64
+/usr/local/cuda/bin/nsys nvprof --csv -o b9m --profile-from-start off bin/b -m 8 -d -v -t 30 -k b9m -n 60000 -b 32 -g 64
+/usr/local/cuda/bin/nsys nvprof --csv -o b6m_4 --profile-from-start off bin/b -m 4 -d -v -t 30 -k b6m -n 1800000 -b 32 -g 64
+/usr/local/cuda/bin/nsys nvprof --csv -o b9m_4 --profile-from-start off bin/b -m 4 -d -v -t 30 -k b9m -n 60000 -b 32 -g 64
+/usr/local/cuda/bin/nsys nvprof --csv -o b11m --profile-from-start off bin/b -m 8 -d -v -t 30 -k b11m -n 60000 -b 256 -g 64
+
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b1m.sqlite  -o b1m
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b5m.sqlite  -o b5m
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b6m.sqlite  -o b6m
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b9m.sqlite  -o b9m
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b11m.sqlite  -o b11m
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b6m_4.sqlite  -o b6m_4
+/usr/local/cuda/bin/nsys stats --report gputrace --format csv b9m_4.sqlite  -o b9m_4
+
diff --git a/projects/resources/cuda/run_partitions.sh b/projects/resources/cuda/run_partitions.sh
new file mode 100755
index 00000000..96b16fde
--- /dev/null
+++ b/projects/resources/cuda/run_partitions.sh
@@ -0,0 +1,39 @@
+# # A100
+# DATE=2021_11_02
+# N=20000
+# DIR=../../../grcuda-data/results/scheduling_multi_gpu/A100/${DATE}_partition_scaling
+# mkdir -p ${DIR}
+# for g in 1 2 4 8
+# do
+# 	echo "start ${g} gpu"
+# 	for p in 1 2 4 6 8 10 12 16 20 24 28 32
+# 	do
+# 		echo "${p} partition"
+# 		bin/b -k b11m -n ${N} -P ${p} -t 10 -p async -r -g 32 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_32_r.csv
+# 		bin/b -k b11m -n ${N} -P ${p} -t 10 -p async -r -g 64 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_64_r.csv
+# 		bin/b -k b11m -n ${N} -P ${p} -t 10 -p async -r -g 128 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_128_r.csv
+# 		bin/b -k b11m -n ${N} -P ${p} -t 10 -p async -g 32 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_32.csv
+# 		bin/b -k b11m -n ${N} -P ${p} -t 10 -p async -g 64 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_64.csv
+# 		bin/b -k b11m -n ${N} -P ${p} -t 10 -p async -g 128 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_128.csv
+# 	done
+# done
+
+# A100
+DATE=2021_11_02
+N=8192
+DIR=../../../grcuda-data/results/scheduling_multi_gpu/A100/${DATE}_partition_scaling_m13
+mkdir -p ${DIR}
+for g in 1 2 4
+do
+	echo "start ${g} gpu"
+	for p in 1 2 4 6 8 10 12 16 20 24 28 32
+	do
+		echo "${p} partition"
+		bin/b -k b13m -n ${N} -P ${p} -t 5 -p async -r -g 6 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_6_r.csv
+		bin/b -k b13m -n ${N} -P ${p} -t 5 -p async -r -g 12 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_12_r.csv
+		bin/b -k b13m -n ${N} -P ${p} -t 5 -p async -r -g 24 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_24_r.csv
+		bin/b -k b13m -n ${N} -P ${p} -t 5 -p async -g 6 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_6.csv
+		bin/b -k b13m -n ${N} -P ${p} -t 5 -p async -g 12 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_12.csv
+		bin/b -k b13m -n ${N} -P ${p} -t 5 -p async -g 24 -c 32 -m ${g} | tee ${DIR}/${N}_${g}_${p}_24.csv
+	done
+done
diff --git a/projects/resources/cuda/single_gpu/b1.cu b/projects/resources/cuda/single_gpu/b1.cu
new file mode 100644
index 00000000..bfb806f9
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b1.cu
@@ -0,0 +1,230 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b1.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+__global__ void square(const float *x, float *y, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        // float tmp = x[i];
+        // float sum = 0;
+        // for (int j = 0; j < 4; j++) {
+        //     sum += tmp + j;
+        // }
+
+        y[i] = x[i] * x[i];  // tmp + tmp * tmp / 2 + tmp * tmp * tmp / 6;
+    }
+}
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+// __device__ float atomicAddDouble(float* address, float val) {
+//     unsigned long long int* address_as_ull = (unsigned long long int*) address;
+//     unsigned long long int old = *address_as_ull, assumed;
+//     do {
+//         assumed = old;
+//         old = atomicCAS(address_as_ull, assumed, __float_as_longlong(val + __longlong_as_float(assumed)));
+//     } while (assumed != old);
+//     return __longlong_as_float(old);
+// }
+
+__global__ void reduce(const float *x, const float *y, float *z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] - y[i];
+    }
+    sum = warp_reduce(sum);                    // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum);                     // The first thread in the warp updates the output;
+}
+
+void Benchmark1::prefetch(cudaStream_t &s1, cudaStream_t &s2) {
+    if (pascalGpu) {
+        cudaMemPrefetchAsync(x, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(y, sizeof(float) * N, 0, s2);
+        cudaMemPrefetchAsync(x1, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(y1, sizeof(float) * N, 0, s2);
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark1::alloc() {
+    err = cudaMallocManaged(&x, sizeof(float) * N);
+    err = cudaMallocManaged(&y, sizeof(float) * N);
+    err = cudaMallocManaged(&x1, sizeof(float) * N);
+    err = cudaMallocManaged(&y1, sizeof(float) * N);
+    err = cudaMallocManaged(&res, sizeof(float));
+
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+}
+
+void Benchmark1::init() {
+    for (int i = 0; i < N; i++) {
+        x[i] = 1.0 / (i + 1);
+        y[i] = 2.0 / (i + 1);
+    }
+}
+
+void Benchmark1::reset() {
+    for (int i = 0; i < N; i++) {
+        x[i] = 1.0 / (i + 1);
+        y[i] = 2.0 / (i + 1);
+    }
+    res[0] = 0.0;
+}
+
+void Benchmark1::execute_sync(int iter) {
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(x, sizeof(float) * N, 0, 0);
+        cudaMemPrefetchAsync(x1, sizeof(float) * N, 0, 0);
+        cudaMemPrefetchAsync(y, sizeof(float) * N, 0, 0);
+        cudaMemPrefetchAsync(y1, sizeof(float) * N, 0, 0);
+        cudaMemPrefetchAsync(res, sizeof(float), 0, 0);
+    }
+
+    square<<<num_blocks, block_size_1d>>>(x, x1, N);
+    err = cudaDeviceSynchronize();
+    square<<<num_blocks, block_size_1d>>>(y, y1, N);
+    err = cudaDeviceSynchronize();
+    reduce<<<num_blocks, block_size_1d>>>(x1, y1, res, N);
+    err = cudaDeviceSynchronize();
+}
+
+void Benchmark1::execute_async(int iter) {
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, x, sizeof(float) * N);
+        cudaStreamAttachMemAsync(s1, x1, sizeof(float) * N);
+        cudaStreamAttachMemAsync(s2, y, sizeof(float) * N);
+        cudaStreamAttachMemAsync(s2, y1, sizeof(float) * N);
+    }
+    if (pascalGpu && do_prefetch) {
+        cudaMemPrefetchAsync(x, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(x1, sizeof(float) * N, 0, s1);
+        cudaMemPrefetchAsync(y, sizeof(float) * N, 0, s2);
+        cudaMemPrefetchAsync(y1, sizeof(float) * N, 0, s2);
+        cudaMemPrefetchAsync(res, sizeof(float), 0, s1);
+    }
+
+    square<<<num_blocks, block_size_1d, 0, s1>>>(x, x1, N);
+    square<<<num_blocks, block_size_1d, 0, s2>>>(y, y1, N);
+
+    // Stream 1 waits stream 2;
+    cudaEvent_t e1;
+    cudaEventCreate(&e1);
+    cudaEventRecord(e1, s2);
+    cudaStreamWaitEvent(s1, e1, 0);
+
+    reduce<<<num_blocks, block_size_1d, 0, s1>>>(x1, y1, res, N);
+    cudaStreamSynchronize(s1);
+}
+
+void Benchmark1::execute_cudagraph(int iter) {
+    if (iter == 0) {
+        cudaEvent_t ef;
+        cudaEventCreate(&ef);
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+        cudaEventRecord(ef, s1);
+        cudaStreamWaitEvent(s2, ef, 0);
+
+        // prefetch(s1, s2);
+
+        square<<<num_blocks, block_size_1d, 0, s1>>>(x, x1, N);
+        square<<<num_blocks, block_size_1d, 0, s2>>>(y, y1, N);
+        // Stream 1 waits stream 2;
+        cudaEvent_t e1;
+        cudaEventCreate(&e1);
+        cudaEventRecord(e1, s2);
+        cudaStreamWaitEvent(s1, e1, 0);
+        reduce<<<num_blocks, block_size_1d, 0, s1>>>(x1, y1, res, N);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark1::execute_cudagraph_manual(int iter) {
+    if (iter == 0) {
+        cudaGraphCreate(&graph, 0);
+        void *kernel_1_args[3] = {(void *)&x, (void *)&x1, &N};
+        void *kernel_2_args[3] = {(void *)&y, (void *)&y1, &N};
+        void *kernel_3_args[4] = {(void *)&x1, (void *)&y1, (void *)&res, &N};
+
+        dim3 tb(block_size_1d);
+        dim3 bs(num_blocks);
+
+        // square<<<num_blocks, block_size_1d, 0, s1>>>(x, x1, N);
+        add_node(kernel_1_args, kernel_1_params, (void *)square, bs, tb, graph, &kernel_1, nodeDependencies);
+
+        // square<<<num_blocks, block_size_1d, 0, s2>>>(y, y1, N);
+        add_node(kernel_2_args, kernel_2_params, (void *)square, bs, tb, graph, &kernel_2, nodeDependencies);
+
+        // reduce<<<num_blocks, block_size_1d, 0, s1>>>(x1, y1, res, N);
+        nodeDependencies.push_back(kernel_1);
+        nodeDependencies.push_back(kernel_2);
+        add_node(kernel_3_args, kernel_3_params, (void *)reduce, bs, tb, graph, &kernel_3, nodeDependencies);
+
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark1::execute_cudagraph_single(int iter) {
+    if (iter == 0) {
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+
+        // prefetch(s1, s1);
+
+        square<<<num_blocks, block_size_1d, 0, s1>>>(x, x1, N);
+        square<<<num_blocks, block_size_1d, 0, s1>>>(y, y1, N);
+        reduce<<<num_blocks, block_size_1d, 0, s1>>>(x1, y1, res, N);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+std::string Benchmark1::print_result(bool short_form) {
+    return std::to_string(res[0]);
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b1.cuh b/projects/resources/cuda/single_gpu/b1.cuh
new file mode 100644
index 00000000..d4e89033
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b1.cuh
@@ -0,0 +1,57 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark1 : public Benchmark {
+   public:
+    Benchmark1(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    void execute_cudagraph(int iter);
+    void execute_cudagraph_manual(int iter);
+    void execute_cudagraph_single(int iter);
+    void prefetch(cudaStream_t &s1, cudaStream_t &s2);
+    std::string print_result(bool short_form = false);
+
+   private:
+    float *x, *y, *x1, *y1, *res;
+    cudaStream_t s1, s2;
+    cudaGraph_t graph;
+    cudaGraphExec_t graphExec;
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    cudaGraphNode_t kernel_1, kernel_2, kernel_3;
+    cudaKernelNodeParams kernel_1_params;
+    cudaKernelNodeParams kernel_2_params;
+    cudaKernelNodeParams kernel_3_params;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b10.cu b/projects/resources/cuda/single_gpu/b10.cu
new file mode 100644
index 00000000..6f5c98f6
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b10.cu
@@ -0,0 +1,465 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b10.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+#define NUM_THREADS_PER_BLOCK_2D 8
+#define NUM_THREADS_PER_BLOCK 32
+#define WARP_SIZE 32
+#define NUM_BLOCKS 16
+
+extern "C" __global__ void conv2d(float *out, float *x, float *kernels, int N, int M, int L, int K, int k_out, int stride) {
+    extern __shared__ float kernel_local[];
+    int radius = K / 2;
+
+    for (int m = 0; m < k_out; m++) {
+        for (int i = threadIdx.x; i < K; i += blockDim.x) {
+            for (int j = threadIdx.y; j < K; j += blockDim.y) {
+                for (int l = 0; l < L; l++) {
+                    kernel_local[l + L * (j + K * (i + K * m))] = kernels[l + L * (j + K * (i + K * m))];
+                }
+            }
+        }
+    }
+    __syncthreads();
+
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (int)ceilf((float)N / stride) - radius; i += blockDim.x * gridDim.x) {
+        int out_index = M * i / stride;
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < (int)ceilf((float)M / stride) - radius; j += blockDim.y * gridDim.y) {
+            for (int m = 0; m < k_out; m++) {
+                // for (int m = blockIdx.z * blockDim.z + threadIdx.z; m < k_out; m += blockDim.z * gridDim.z) {
+                float res = 0;
+                int i_f = i * stride + radius;
+                int j_f = j * stride + radius;
+                for (int k_i = -radius; k_i <= radius; k_i++) {
+                    for (int k_j = -radius; k_j <= radius; k_j++) {
+                        int kernel_index = (k_j + radius + K * (k_i + radius + K * m));
+                        for (int l = 0; l < L; l++) {
+                            int ni = i_f + k_i;
+                            int nj = j_f + k_j;
+                            res += kernel_local[l + L * kernel_index] * x[((ni * M) + nj) * L + l];
+                        }
+                    }
+                }
+                // Apply ReLU operator;
+                out[m + k_out * (j + out_index)] = max(res, 0.0);
+            }
+        }
+    }
+}
+
+extern "C" __global__ void mean_pooling(float *out, float *x, int N, int M, int L, int K, int stride) {
+    int radius = K / 2;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (int)ceilf((float)N / stride) - radius; i += blockDim.x * gridDim.x) {
+        int out_index = M * i / stride;
+        int i_f = i * stride + radius;
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < (int)ceilf((float)M / stride) - radius; j += blockDim.y * gridDim.y) {
+            int j_f = j * stride + radius;
+            for (int l = blockIdx.z * blockDim.z + threadIdx.z; l < L; l += blockDim.z * gridDim.z) {
+                float res = 0;
+                for (int k_i = -radius; k_i <= radius; k_i++) {
+                    int ni = i_f + k_i;
+                    for (int k_j = -radius; k_j <= radius; k_j++) {
+                        int nj = j_f + k_j;
+                        res += x[((ni * M) + nj) * L + l];
+                    }
+                }
+                // Apply mean operator;
+                out[l + L * (j + out_index)] = res / (K * K);
+            }
+        }
+    }
+}
+
+extern "C" __global__ void gap(float *out, float *x, int N, int M, int L) {
+    extern __shared__ float out_local[];
+    for (int i = threadIdx.x; i < L; i += blockDim.x) {
+        out_local[i] = 0;
+    }
+    __syncthreads();
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < M; j += blockDim.y * gridDim.y) {
+            for (int l = 0; l < L; l++) {
+                atomicAdd(out_local + l, x[l + L * (j + M * i)]);
+            }
+        }
+    }
+    __syncthreads();
+    for (int l = threadIdx.x; l < L; l += blockDim.x) {
+        atomicAdd(out + l, out_local[l] / (M * N));
+    }
+}
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void dot_product(const float *x, const float *y, float *z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] * y[i];
+    }
+    sum = warp_reduce(sum);                    // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum);                     // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void concat(float *z, const float *x, const float *y, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        z[i] = x[i];
+        z[i + n] = y[i];
+    }
+}
+
+// inline void reset(float *x, float *y, float *x_cpu, float *y_cpu, int N, float *res) {
+//     for (int i = 0; i < N; i++) {
+//         x[i] = x_cpu[i];
+//         y[i] = y_cpu[i];
+//     }
+//     *res = 0;
+// }
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark10::alloc() {
+    x_cpu = (float *)malloc(sizeof(float) * N * N * channels);
+    y_cpu = (float *)malloc(sizeof(float) * N * N * channels);
+    x_len = N * N * channels;
+    x1_len = (N / stride) * (N / stride) * kn1;
+    pooled_len = x1_len / (pooling_diameter * pooling_diameter);
+    x2_len = ((N / stride) / pooling_diameter / stride) * ((N / stride) / pooling_diameter / stride) * kn2;
+    x3_len = kn2;
+
+    err = cudaMallocManaged(&x, sizeof(float) * x_len);
+    err = cudaMallocManaged(&x1, sizeof(float) * x1_len);
+    err = cudaMallocManaged(&x2, sizeof(float) * x2_len);
+    err = cudaMallocManaged(&x3, sizeof(float) * x3_len);
+
+    err = cudaMallocManaged(&y, sizeof(float) * x_len);
+    err = cudaMallocManaged(&y1, sizeof(float) * x1_len);
+    err = cudaMallocManaged(&y2, sizeof(float) * x2_len);
+    err = cudaMallocManaged(&y3, sizeof(float) * x3_len);
+
+    k1_len = channels * K * K * kn1;
+    k2_len = kn1 * K * K * kn2;
+    err = cudaMallocManaged(&kernel_1, sizeof(float) * k1_len);
+    err = cudaMallocManaged(&kernel_2, sizeof(float) * k2_len);
+    err = cudaMallocManaged(&kernel_3, sizeof(float) * k1_len);
+    err = cudaMallocManaged(&kernel_4, sizeof(float) * k2_len);
+
+    z_len = 2 * x2_len;
+    err = cudaMallocManaged(&z, sizeof(float) * z_len);
+    err = cudaMallocManaged(&dense_weights, sizeof(float) * z_len);
+    err = cudaMallocManaged(&res, sizeof(float));
+
+    err = cudaMallocManaged(&x11, sizeof(float) * pooled_len);
+    err = cudaMallocManaged(&y11, sizeof(float) * pooled_len);
+
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+}
+
+void Benchmark10::init() {
+    for (int i = 0; i < x_len; i++) {
+        x_cpu[i] = (float)(rand()) / (float)(RAND_MAX);
+        y_cpu[i] = (float)(rand()) / (float)(RAND_MAX);
+    }
+    for (int i = 0; i < k1_len; i++) {
+        kernel_1[i] = ((float)(rand()) / (float)(RAND_MAX)) * 2 - 1;
+        kernel_3[i] = ((float)(rand()) / (float)(RAND_MAX)) * 2 - 1;
+    }
+    for (int i = 0; i < k2_len; i++) {
+        kernel_2[i] = ((float)(rand()) / (float)(RAND_MAX)) * 2 - 1;
+        kernel_4[i] = ((float)(rand()) / (float)(RAND_MAX)) * 2 - 1;
+    }
+
+    for (int i = 0; i < z_len; i++) {
+        dense_weights[i] = (((float)(rand()) / (float)(RAND_MAX)) * 2 - 1) / z_len;
+    }
+}
+
+void Benchmark10::reset() {
+    for (int i = 0; i < x_len; i++) {
+        x[i] = x_cpu[i];
+        y[i] = y_cpu[i];
+    }
+    *res = 0;
+}
+
+void Benchmark10::execute_sync(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    dim3 grid_size_2(num_blocks / 2, num_blocks / 2);
+
+    dim3 block_size_3d_dim(block_size_2d / 2, block_size_2d / 2, block_size_2d / 2);
+    dim3 grid_size_3(num_blocks / 2, num_blocks / 2, num_blocks / 2);
+
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(x, sizeof(float) * x_len, 0, 0);
+        cudaMemPrefetchAsync(y, sizeof(float) * x_len, 0, 0);
+    }
+
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float)>>>(x1, x, kernel_1, N, N, channels, K, kn1, stride);
+    cudaDeviceSynchronize();
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float)>>>(y1, y, kernel_3, N, N, channels, K, kn1, stride);
+    cudaDeviceSynchronize();
+
+    mean_pooling<<<grid_size_3, block_size_3d_dim>>>(x11, x1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+    cudaDeviceSynchronize();
+    mean_pooling<<<grid_size_3, block_size_3d_dim>>>(y11, y1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+    cudaDeviceSynchronize();
+
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float)>>>(x2, x11, kernel_2, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+    cudaDeviceSynchronize();
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float)>>>(y2, y11, kernel_4, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+    cudaDeviceSynchronize();
+
+    // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float)>>>(x2, x1, kernel_2, N / stride, N / stride, kn1, K, kn2, stride);
+    // cudaDeviceSynchronize();
+    // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float)>>>(y2, y1, kernel_4, N / stride, N / stride, kn1, K, kn2, stride);
+    // cudaDeviceSynchronize();
+
+    // gap<<<grid_size_2, block_size_2d_dim, kn2 * sizeof(float)>>>(x3, x2, N / (stride * stride), N / (stride * stride), kn2);
+    // cudaDeviceSynchronize();
+    // gap<<<grid_size_2, block_size_2d_dim, kn2 * sizeof(float)>>>(y3, y2, N / (stride * stride), N / (stride * stride), kn2);
+    // cudaDeviceSynchronize();
+
+    concat<<<num_blocks, block_size_1d>>>(z, x2, y2, x2_len);
+    cudaDeviceSynchronize();
+
+    dot_product<<<num_blocks, block_size_1d>>>(z, dense_weights, res, x2_len);
+    cudaDeviceSynchronize();
+}
+
+void Benchmark10::execute_async(int iter) {
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, x, sizeof(float) * x_len);
+        cudaStreamAttachMemAsync(s1, x1, 0);
+        cudaStreamAttachMemAsync(s1, x2, 0);
+        // cudaStreamAttachMemAsync(s1, x3, 0);
+        cudaStreamAttachMemAsync(s1, kernel_1, 0);
+        cudaStreamAttachMemAsync(s1, kernel_2, 0);
+
+        cudaStreamAttachMemAsync(s2, y, sizeof(float) * x_len);
+        cudaStreamAttachMemAsync(s2, y1, 0);
+        // cudaStreamAttachMemAsync(s2, y2, 0);
+        // cudaStreamAttachMemAsync(s2, y3, 0);
+        cudaStreamAttachMemAsync(s2, kernel_3, 0);
+        cudaStreamAttachMemAsync(s2, kernel_4, 0);
+    }
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(x, sizeof(float) * x_len, 0, 0);
+        cudaMemPrefetchAsync(y, sizeof(float) * x_len, 0, 0);
+    }
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    dim3 grid_size_2(num_blocks / 2, num_blocks / 2);
+
+    dim3 block_size_3d_dim(block_size_2d / 2, block_size_2d / 2, block_size_2d / 2);
+    dim3 grid_size_3(num_blocks / 2, num_blocks / 2, num_blocks / 2);
+
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s1>>>(x1, x, kernel_1, N, N, channels, K, kn1, stride);
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s2>>>(y1, y, kernel_3, N, N, channels, K, kn1, stride);
+
+    mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s1>>>(x11, x1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+    mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s2>>>(y11, y1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(x2, x11, kernel_2, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+    conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s2>>>(y2, y11, kernel_4, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+
+    // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(x2, x1, kernel_2, N / stride, N / stride, kn1, K, kn2, stride);
+    // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s2>>>(y2, y1, kernel_4, N / stride, N / stride, kn1, K, kn2, stride);
+
+    // gap<<<grid_size_2, block_size_2d_dim, kn2 * sizeof(float), s1>>>(x3, x2, N / (stride * stride), N / (stride * stride), kn2);
+    // gap<<<grid_size_2, block_size_2d_dim, kn2 * sizeof(float), s2>>>(y3, y2, N / (stride * stride), N / (stride * stride), kn2);
+
+    cudaEvent_t e1;
+    cudaEventCreate(&e1);
+    cudaEventRecord(e1, s2);
+    cudaStreamWaitEvent(s1, e1, 0);
+
+    concat<<<num_blocks, block_size_1d, 0, s1>>>(z, x2, y2, x2_len);
+
+    dot_product<<<num_blocks, block_size_1d, 0, s1>>>(z, dense_weights, res, x2_len);
+    cudaStreamSynchronize(s1);
+}
+
+void Benchmark10::execute_cudagraph(int iter) {
+    if (iter == 0) {
+        cudaEvent_t ef;
+        cudaEventCreate(&ef);
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+        cudaEventRecord(ef, s1);
+        cudaStreamWaitEvent(s2, ef, 0);
+
+        dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+        dim3 grid_size(num_blocks, num_blocks);
+        dim3 grid_size_2(num_blocks / 2, num_blocks / 2);
+
+        dim3 block_size_3d_dim(block_size_2d / 2, block_size_2d / 2, block_size_2d / 2);
+        dim3 grid_size_3(num_blocks / 2, num_blocks / 2, num_blocks / 2);
+
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s1>>>(x1, x, kernel_1, N, N, channels, K, kn1, stride);
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s2>>>(y1, y, kernel_3, N, N, channels, K, kn1, stride);
+
+        mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s1>>>(x11, x1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+        mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s2>>>(y11, y1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(x2, x11, kernel_2, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s2>>>(y2, y11, kernel_4, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+
+        // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(x2, x1, kernel_2, N / stride, N / stride, kn1, K, kn2, stride);
+        // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s2>>>(y2, y1, kernel_4, N / stride, N / stride, kn1, K, kn2, stride);
+
+        // gap<<<grid_size_2, block_size_2d_dim, kn2 * sizeof(float), s1>>>(x3, x2, N / (stride * stride), N / (stride * stride), kn2);
+        // gap<<<grid_size_2, block_size_2d_dim, kn2 * sizeof(float), s2>>>(y3, y2, N / (stride * stride), N / (stride * stride), kn2);
+
+        cudaEvent_t e1;
+        cudaEventCreate(&e1);
+        cudaEventRecord(e1, s2);
+        cudaStreamWaitEvent(s1, e1, 0);
+
+        concat<<<num_blocks, block_size_1d, 0, s1>>>(z, x2, y2, x2_len);
+
+        dot_product<<<num_blocks, block_size_1d, 0, s1>>>(z, dense_weights, res, x2_len);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark10::execute_cudagraph_manual(int iter) {
+    if (iter == 0) {
+        cudaGraphCreate(&graph, 0);
+        int a = N / stride;
+        int b = N / stride / pooling_diameter;
+        void *kernel_1_args[9] = {(void *)&x1, (void *)&x, (void *)&kernel_1, &N, &N, &channels, &K, &kn1, &stride};
+        void *kernel_2_args[9] = {(void *)&y1, (void *)&y, (void *)&kernel_3, &N, &N, &channels, &K, &kn1, &stride};
+        void *kernel_3_args[7] = {(void *)&x11, (void *)&x1, &a, &a, &kn1, &pooling_diameter, &pooling_diameter};
+        void *kernel_4_args[7] = {(void *)&y11, (void *)&y1, &a, &a, &kn1, &pooling_diameter, &pooling_diameter};
+        void *kernel_5_args[9] = {(void *)&x2, (void *)&x11, (void *)&kernel_2, &b, &b, &kn1, &K, &kn2, &stride};
+        void *kernel_6_args[9] = {(void *)&y2, (void *)&y11, (void *)&kernel_4, &b, &b, &kn1, &K, &kn2, &stride};
+        void *kernel_7_args[4] = {(void *)&z, (void *)&x2, (void *)&y2, &x2_len};
+        void *kernel_8_args[4] = {(void *)&z, (void *)&dense_weights, (void *)&res, &x2_len};
+
+        dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+        dim3 grid_size(num_blocks, num_blocks);
+        dim3 grid_size_2(num_blocks / 2, num_blocks / 2);
+        dim3 block_size_3d_dim(block_size_2d / 2, block_size_2d / 2, block_size_2d / 2);
+        dim3 grid_size_3(num_blocks / 2, num_blocks / 2, num_blocks / 2);
+        dim3 tb(block_size_1d);
+        dim3 bs(num_blocks);
+
+        // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s1>>>(x1, x, kernel_1, N, N, channels, K, kn1, stride);
+        // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s2>>>(y1, y, kernel_3, N, N, channels, K, kn1, stride);
+        // mean_pooling<<<grid_size_3, block_size_2d_dim, 0, s1>>>(x11, x1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+        // mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s2>>>(y11, y1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+        // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(x2, x11, kernel_2, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+        // conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s2>>>(y2, y11, kernel_4, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+        // concat<<<num_blocks, block_size_1d, 0, s1>>>(z, x2, y2, x2_len);
+        // dot_product<<<num_blocks, block_size_1d, 0, s1>>>(z, dense_weights, res, x2_len);
+
+        add_node(kernel_1_args, kernel_1_params, (void *)conv2d, grid_size_2, block_size_2d_dim, graph, &k_1, nodeDependencies, K * K * kn1 * channels * sizeof(float));
+        add_node(kernel_2_args, kernel_2_params, (void *)conv2d, grid_size_2, block_size_2d_dim, graph, &k_2, nodeDependencies, K * K * kn1 * channels * sizeof(float));
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(k_1);
+        add_node(kernel_3_args, kernel_3_params, (void *)mean_pooling, grid_size_3, block_size_2d_dim, graph, &k_3, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(k_2);
+        add_node(kernel_4_args, kernel_4_params, (void *)mean_pooling, grid_size_3, block_size_2d_dim, graph, &k_4, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(k_3);
+        add_node(kernel_5_args, kernel_5_params, (void *)conv2d, grid_size_2, block_size_2d_dim, graph, &k_5, nodeDependencies, K * K * kn1 * kn2 * sizeof(float));
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(k_4);
+        add_node(kernel_6_args, kernel_6_params, (void *)conv2d, grid_size_2, block_size_2d_dim, graph, &k_6, nodeDependencies, K * K * kn1 * kn2 * sizeof(float));
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(k_5);
+        nodeDependencies.push_back(k_6);
+        add_node(kernel_7_args, kernel_7_params, (void *)concat, bs, tb, graph, &k_7, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(k_7);
+        add_node(kernel_8_args, kernel_8_params, (void *)dot_product, bs, tb, graph, &k_8, nodeDependencies);
+
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark10::execute_cudagraph_single(int iter) {
+    if (iter == 0) {
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+
+        dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+        dim3 grid_size(num_blocks, num_blocks);
+        dim3 grid_size_2(num_blocks / 2, num_blocks / 2);
+
+        dim3 block_size_3d_dim(block_size_2d / 2, block_size_2d / 2, block_size_2d / 2);
+        dim3 grid_size_3(num_blocks / 2, num_blocks / 2, num_blocks / 2);
+
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s1>>>(x1, x, kernel_1, N, N, channels, K, kn1, stride);
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * channels * sizeof(float), s1>>>(y1, y, kernel_3, N, N, channels, K, kn1, stride);
+
+        mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s1>>>(x11, x1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+        mean_pooling<<<grid_size_3, block_size_3d_dim, 0, s1>>>(y11, y1, N / stride, N / stride, kn1, pooling_diameter, pooling_diameter);
+
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(x2, x11, kernel_2, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+        conv2d<<<grid_size_2, block_size_2d_dim, K * K * kn1 * kn2 * sizeof(float), s1>>>(y2, y11, kernel_4, N / stride / pooling_diameter, N / stride / pooling_diameter, kn1, K, kn2, stride);
+
+        concat<<<num_blocks, block_size_1d, 0, s1>>>(z, x2, y2, x2_len);
+
+        dot_product<<<num_blocks, block_size_1d, 0, s1>>>(z, dense_weights, res, x2_len);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+std::string Benchmark10::print_result(bool short_form) {
+    return std::to_string(res[0]);
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b10.cuh b/projects/resources/cuda/single_gpu/b10.cuh
new file mode 100644
index 00000000..91aee773
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b10.cuh
@@ -0,0 +1,79 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark10 : public Benchmark {
+   public:
+    Benchmark10(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    void execute_cudagraph(int iter);
+    void execute_cudagraph_manual(int iter);
+    void execute_cudagraph_single(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int K = 3;
+    int channels = 1;
+    int stride = 2;
+    int kn1 = 8;
+    int kn2 = 16;
+    int pooling_diameter = 5;
+
+    float *x, *x1, *x2, *x3, *y, *y1, *y2, *y3, *kernel_1, *kernel_2, *kernel_3, *kernel_4, *z, *dense_weights, *res;
+    float *x11, *y11;
+    float *x_cpu;
+    float *y_cpu;
+    int x_len;
+    int x1_len;
+    int pooled_len;
+    int x2_len;
+    int x3_len;
+    int k1_len, k2_len, z_len;
+
+    cudaStream_t s1, s2;
+    cudaGraph_t graph;
+    cudaGraphExec_t graphExec;
+
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    cudaGraphNode_t k_1, k_2, k_3, k_4, k_5, k_6, k_7, k_8;
+    cudaKernelNodeParams kernel_1_params;
+    cudaKernelNodeParams kernel_2_params;
+    cudaKernelNodeParams kernel_3_params;
+    cudaKernelNodeParams kernel_4_params;
+    cudaKernelNodeParams kernel_5_params;
+    cudaKernelNodeParams kernel_6_params;
+    cudaKernelNodeParams kernel_7_params;
+    cudaKernelNodeParams kernel_8_params;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b5.cu b/projects/resources/cuda/single_gpu/b5.cu
new file mode 100644
index 00000000..3c1197b2
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b5.cu
@@ -0,0 +1,213 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b5.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+__device__ inline double
+cndGPU(double d) {
+    const double A1 = 0.31938153f;
+    const double A2 = -0.356563782f;
+    const double A3 = 1.781477937f;
+    const double A4 = -1.821255978f;
+    const double A5 = 1.330274429f;
+    const double RSQRT2PI = 0.39894228040143267793994605993438f;
+
+    double K = 1.0 / (1.0 + 0.2316419 * fabs(d));
+
+    double cnd = RSQRT2PI * exp(-0.5f * d * d) *
+                 (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
+
+    if (d > 0)
+        cnd = 1.0 - cnd;
+
+    return cnd;
+}
+
+extern "C" __global__ void
+bs(const double *x, double *y, int N, double R, double V, double T, double K) {
+    double sqrtT = 1.0 / rsqrt(T);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
+         i += blockDim.x * gridDim.x) {
+        double expRT;
+        double d1, d2, CNDD1, CNDD2;
+        d1 = (log(x[i] / K) + (R + 0.5 * V * V) * T) / (V * sqrtT);
+        d2 = d1 - V * sqrtT;
+
+        CNDD1 = cndGPU(d1);
+        CNDD2 = cndGPU(d2);
+
+        // Calculate Call and Put simultaneously
+        expRT = exp(-R * T);
+        y[i] = x[i] * CNDD1 - K * expRT * CNDD2;
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark5::alloc() {
+    x = (double **)malloc(sizeof(double *) * M);
+    y = (double **)malloc(sizeof(double *) * M);
+    tmp_x = (double *)malloc(sizeof(double) * N);
+    // cudaHostRegister(tmp_x, sizeof(double) * N, 0);
+
+    for (int i = 0; i < M; i++) {
+        cudaMallocManaged(&x[i], sizeof(double) * N);
+        cudaMallocManaged(&y[i], sizeof(double) * N);
+    }
+}
+
+void Benchmark5::init() {
+    for (int j = 0; j < N; j++) {
+        tmp_x[j] = 60 - 0.5 + (double)rand() / RAND_MAX;
+        for (int i = 0; i < M; i++) {
+            x[i][j] = tmp_x[j];
+            // y[i][j] = 0;
+        }
+    }
+
+    s = (cudaStream_t *)malloc(sizeof(cudaStream_t) * M);
+    for (int i = 0; i < M; i++) {
+        err = cudaStreamCreate(&s[i]);
+    }
+}
+
+void Benchmark5::reset() {
+    for (int i = 0; i < M; i++) {
+        // memcpy(x[i], y, sizeof(int) * N);
+        // cudaMemcpy(x[i], y, sizeof(double) * N, cudaMemcpyDefault);
+
+        // cudaMemcpyAsync(x[i], y, sizeof(int) * N, cudaMemcpyHostToDevice,
+        // s[i]);
+        for (int j = 0; j < N; j++) {
+            x[i][j] = tmp_x[j];
+        }
+    }
+    // cudaMemPrefetchAsync(x[0], sizeof(double) * N, 0, s[0]);
+}
+
+void Benchmark5::execute_sync(int iter) {
+    for (int j = 0; j < M; j++) {
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(x[j], sizeof(double) * N, 0, 0);
+            cudaMemPrefetchAsync(y[j], sizeof(double) * N, 0, 0);
+        }
+        bs<<<num_blocks, block_size_1d>>>(x[j], y[j], N, R, V, T, K);
+        err = cudaDeviceSynchronize();
+    }
+}
+
+void Benchmark5::execute_async(int iter) {
+    for (int j = 0; j < M; j++) {
+        if (!pascalGpu || stream_attach) {
+            cudaStreamAttachMemAsync(s[j], x[j], sizeof(double) * N);
+            cudaStreamAttachMemAsync(s[j], y[j], sizeof(double) * N);
+        }
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(x[j], sizeof(double) * N, 0, s[j]);
+            cudaMemPrefetchAsync(y[j], sizeof(double) * N, 0, s[j]);
+        }
+        // if (j > 0) cudaMemPrefetchAsync(y[j - 1], sizeof(double) * N, cudaCpuDeviceId, s[j - 1]);
+        bs<<<num_blocks, block_size_1d, 0, s[j]>>>(x[j], y[j], N, R, V, T, K);
+        // if (j < M - 1) cudaMemPrefetchAsync(x[j + 1], sizeof(double) * N, 0, s[j + 1]);
+    }
+
+    // Last tile;
+    // cudaMemPrefetchAsync(y[M - 1], sizeof(double) * N, cudaCpuDeviceId, s[M - 1]);
+
+    for (int j = 0; j < M; j++) {
+        err = cudaStreamSynchronize(s[j]);
+    }
+}
+
+void Benchmark5::execute_cudagraph(int iter) {
+    if (iter == 0) {
+        for (int j = 0; j < M; j++) {
+            cudaStreamBeginCapture(s[j], cudaStreamCaptureModeGlobal);
+            // prefetch(x[j], y[j], s[j], N);
+            bs<<<num_blocks, block_size_1d, 0, s[j]>>>(x[j], y[j], N, R, V, T, K);
+            cudaStreamEndCapture(s[j], &graphs[j]);
+            cudaGraphInstantiate(&graphExec[j], graphs[j], NULL, NULL, 0);
+        }
+    }
+    for (int j = 0; j < M; j++) {
+        cudaGraphLaunch(graphExec[j], s[j]);
+    }
+    for (int j = 0; j < M; j++) {
+        cudaStreamSynchronize(s[j]);
+    }
+}
+
+void Benchmark5::execute_cudagraph_manual(int iter) {
+    if (iter == 0) {
+        cudaGraphCreate(&graphs[0], 0);
+        for (int j = 0; j < M; j++) {
+            void *kernel_args[7] = {(void *)&x[j], (void *)&y[j], &N, &R, &V, &T, &K};
+
+            dim3 tb(block_size_1d);
+            dim3 b_size(num_blocks);
+
+            // bs<<<num_blocks, block_size_1d, 0, s[j]>>>(x[j], y[j], N, R, V, T, K);
+            add_node(kernel_args, kernel_params[j], (void *)bs, b_size, tb, graphs[0], &kernels[j], nodeDependencies);
+        }
+        cudaGraphInstantiate(&graphExec[0], graphs[0], NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec[0], s[0]);
+    err = cudaStreamSynchronize(s[0]);
+}
+
+void Benchmark5::execute_cudagraph_single(int iter) {
+    if (iter == 0) {
+        cudaStreamBeginCapture(s[0], cudaStreamCaptureModeGlobal);
+        for (int j = 0; j < M; j++) {
+            // prefetch(x[j], y[j], s[0], N);
+            bs<<<num_blocks, block_size_1d, 0, s[0]>>>(x[j], y[j], N, R, V, T, K);
+        }
+        cudaStreamEndCapture(s[0], &graphs[0]);
+        cudaGraphInstantiate(&graphExec[0], graphs[0], NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec[0], s[0]);
+    cudaStreamSynchronize(s[0]);
+}
+
+std::string
+Benchmark5::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(y[0][0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < M; j++) {
+            res += std::to_string(y[j][0]) + ", ";
+        }
+        return res + ", ...]";
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b5.cuh b/projects/resources/cuda/single_gpu/b5.cuh
new file mode 100644
index 00000000..204a7126
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b5.cuh
@@ -0,0 +1,66 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark5 : public Benchmark {
+   public:
+    Benchmark5(Options &options) : Benchmark(options) {
+        graphs = std::vector<cudaGraph_t>(M);
+        graphExec = std::vector<cudaGraphExec_t>(M);
+        kernels = std::vector<cudaGraphNode_t>(M);
+        kernel_params = std::vector<cudaKernelNodeParams>(M);
+    }
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    void execute_cudagraph(int iter);
+    void execute_cudagraph_manual(int iter);
+    void execute_cudagraph_single(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    double R = 0.08;
+    double V = 0.3;
+    double T = 1.0;
+    double K = 60.0;
+    
+    int M = 10;
+    double **x, **y, *tmp_x;
+    cudaStream_t *s;
+    std::vector<cudaGraph_t> graphs;
+    std::vector<cudaGraphExec_t> graphExec;
+
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    std::vector<cudaGraphNode_t> kernels;
+    std::vector<cudaKernelNodeParams> kernel_params;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b6.cu b/projects/resources/cuda/single_gpu/b6.cu
new file mode 100644
index 00000000..ac01129b
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b6.cu
@@ -0,0 +1,475 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b6.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+extern "C" __global__ void nb_1(const int* x, const float* y, float* z, int size, int n_feat, int n_classes) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_classes; j++) {
+            for (int q = 0; q < n_feat; q++) {
+                z[i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+            }
+        }
+    }
+}
+
+extern "C" __global__ void nb_2(const float* x, float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float curr_max = x[i * n_col_x];
+        for (int j = 0; j < n_col_x; j++) {
+            curr_max = fmaxf(curr_max, x[i * n_col_x + j]);
+        }
+        y[i] = curr_max;
+    }
+}
+
+extern "C" __global__ void nb_3(const float* x, const float* y, float* z, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < n_col_x; j++) {
+            sum += expf(x[i * n_col_x + j] - y[i]);
+        }
+        z[i] = logf(sum) + y[i];
+    }
+}
+
+extern "C" __global__ void nb_4(float* x, float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] = expf(x[i * n_col_x + j] - y[i]);
+        }
+    }
+}
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void rr_1_0(const int* x, float* y, float* z, int n_row_x, int n_col_x) {
+    int warp_size = 32;
+    for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+        // Compute mean and variance;
+        float feature_mean = float(0);
+        float sum_sq = float(0);
+        for (int i = blockIdx.y * blockDim.y + threadIdx.y; i < n_row_x; i += blockDim.y * gridDim.y) {
+            float x_tmp = x[j * n_row_x + i];
+            feature_mean += x_tmp;
+            sum_sq += x_tmp * x_tmp;
+        }
+        feature_mean = warp_reduce(feature_mean);  // Obtain the sum of values in the current warp;
+        sum_sq = warp_reduce(sum_sq);              // Obtain the sum of values in the current warp;
+        if (!(threadIdx.y % warp_size)) {
+            atomicAdd(y + j, feature_mean);
+            atomicAdd(z + j, sum_sq);
+        }
+    }
+}
+
+extern "C" __global__ void rr_1_1(const int* x, float* y, const float* mean, const float* std, int n_row_x, int n_col_x) {
+    for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+        float mean_tmp = mean[j] / n_row_x;
+        float std_tmp = sqrtf(std[j] / n_row_x - mean_tmp * mean_tmp);
+
+        for (int i = blockIdx.y * blockDim.y + threadIdx.y; i < n_row_x; i += blockDim.y * gridDim.y) {
+            y[j * n_row_x + i] = ((float)x[j * n_row_x + i] - mean_tmp) / std_tmp;
+        }
+    }
+}
+
+extern "C" __global__ void rr_1(const int* x, float* y, int n_row_x, int n_col_x) {
+    for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+        float feature_mean = 0;
+        float sum_sq = 0;
+        // Compute mean and variance;
+        for (int i = 0; i < n_row_x; i++) {
+            float x_tmp = x[j * n_row_x + i];
+            feature_mean += x_tmp;
+            sum_sq += x_tmp * x_tmp;
+        }
+        feature_mean /= n_row_x;
+        float std = sqrtf(sum_sq / n_row_x - feature_mean * feature_mean);
+
+        // Update values;
+        for (int i = 0; i < n_row_x; i++) {
+            y[j * n_row_x + i] = (x[j * n_row_x + i] - feature_mean) / std;
+        }
+    }
+}
+
+extern "C" __global__ void rr_2(const float* x, const float* y, float* z, int size, int n_feat, int n_classes) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_classes; j++) {
+            for (int q = 0; q < n_feat; q++) {
+                z[i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+            }
+        }
+    }
+}
+
+extern "C" __global__ void rr_3(float* x, const float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] += y[j];
+        }
+    }
+}
+
+extern "C" __global__ void softmax(float* x, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float row_exp_sum = 0;
+        for (int j = 0; j < n_col_x; j++) {
+            row_exp_sum += expf(x[i * n_col_x + j]);
+        }
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] = expf(x[i * n_col_x + j]) / row_exp_sum;
+        }
+    }
+}
+
+extern "C" __global__ void argmax(const float* x, const float* y, int* z, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        int curr_best_index = 0;
+        float curr_best = x[i * n_col_x] + y[i * n_col_x];
+        for (int j = 0; j < n_col_x; j++) {
+            float curr = x[i * n_col_x + j] + y[i * n_col_x + j];
+            if (curr > curr_best) {
+                curr_best = curr;
+                curr_best_index = j;
+            }
+        }
+        z[i] = curr_best_index;
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark6::alloc() {
+    err = cudaMallocManaged(&x, sizeof(int) * N * num_features);
+    err = cudaMallocManaged(&z, sizeof(float) * N * num_features);
+    err = cudaMallocManaged(&nb_feat_log_prob, sizeof(float) * num_classes * num_features);
+    err = cudaMallocManaged(&nb_class_log_prior, sizeof(float) * num_classes);
+    err = cudaMallocManaged(&ridge_coeff, sizeof(float) * num_classes * num_features);
+    err = cudaMallocManaged(&ridge_intercept, sizeof(float) * num_classes);
+    err = cudaMallocManaged(&nb_amax, sizeof(float) * N);
+    err = cudaMallocManaged(&nb_l, sizeof(float) * N);
+    err = cudaMallocManaged(&r1, sizeof(float) * N * num_classes);
+    err = cudaMallocManaged(&r2, sizeof(float) * N * num_classes);
+    err = cudaMallocManaged(&r, sizeof(int) * N);
+
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+}
+
+void Benchmark6::init() {
+    for (int i = 0; i < num_classes; i++) {
+        for (int j = 0; j < num_features; j++) {
+            nb_feat_log_prob[i * num_features + j] = (float)(rand()) / (float)(RAND_MAX);
+            ridge_coeff[i * num_features + j] = (float)(rand()) / (float)(RAND_MAX);
+        }
+        nb_class_log_prior[i] = (float)(rand()) / (float)(RAND_MAX);
+        ridge_intercept[i] = (float)(rand()) / (float)(RAND_MAX);
+    }
+    int max_occurrence_of_ngram = 10;
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < num_features; j++) {
+            x[i * num_features + j] = rand() % max_occurrence_of_ngram;
+        }
+        for (int j = 0; j < num_classes; j++) {
+            r1[i * num_classes + j] = nb_class_log_prior[j];
+            r2[i * num_classes + j] = 0;
+        }
+    }
+}
+
+void Benchmark6::reset() {
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < num_classes; j++) {
+            r1[i * num_classes + j] = nb_class_log_prior[j];
+            r2[i * num_classes + j] = 0;
+        }
+        // r1_mean[i] = 0;
+        // r1_std[i] = 0;
+    }
+}
+
+void Benchmark6::execute_sync(int iter) {
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(r1, sizeof(float) * N * num_classes, 0, 0);
+        cudaMemPrefetchAsync(r2, sizeof(float) * N * num_classes, 0, 0);
+        cudaMemPrefetchAsync(r, sizeof(int) * N, 0, 0);
+    }
+
+    rr_1<<<num_blocks, block_size_1d>>>(x, z, N, num_features);
+    // dim3 num_blocks_2d(8, 8);
+    // dim3 block_size_1d_2d(1, 32);
+    // rr_1_0<<<num_blocks_2d, block_size_1d_2d>>>(x, r1_mean, r1_std, N, num_features);
+    // cudaDeviceSynchronize();
+    // rr_1_1<<<num_blocks_2d, block_size_1d_2d>>>(x, z, r1_mean, r1_std, N, num_features);
+    cudaDeviceSynchronize();
+
+    // auto e1 = clock_type::now();
+    // auto rr1time = chrono::duration_cast<chrono::microseconds>(e1 - start).count();
+    // if (debug) std::cout << " rr1=" << (float) rr1time / 1000 << " ms" << std::endl;
+
+    nb_1<<<num_blocks, block_size_1d>>>(x, nb_feat_log_prob, r1, N, num_features, num_classes);
+    cudaDeviceSynchronize();
+
+    rr_2<<<num_blocks, block_size_1d>>>(z, ridge_coeff, r2, N, num_features, num_classes);
+    cudaDeviceSynchronize();
+
+    nb_2<<<num_blocks, block_size_1d>>>(r1, nb_amax, N, num_classes);
+    cudaDeviceSynchronize();
+
+    nb_3<<<num_blocks, block_size_1d>>>(r1, nb_amax, nb_l, N, num_classes);
+    cudaDeviceSynchronize();
+
+    rr_3<<<num_blocks, block_size_1d>>>(r2, ridge_intercept, N, num_classes);
+    cudaDeviceSynchronize();
+
+    nb_4<<<num_blocks, block_size_1d>>>(r1, nb_l, N, num_classes);
+    cudaDeviceSynchronize();
+
+    softmax<<<num_blocks, block_size_1d>>>(r1, N, num_classes);
+    cudaDeviceSynchronize();
+
+    softmax<<<num_blocks, block_size_1d>>>(r2, N, num_classes);
+    cudaDeviceSynchronize();
+
+    argmax<<<num_blocks, block_size_1d>>>(r1, r2, r, N, num_classes);
+    cudaDeviceSynchronize();
+}
+
+void Benchmark6::execute_async(int iter) {
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, z, 0);
+        // cudaStreamAttachMemAsync(s1, r1_mean, 0);
+        // cudaStreamAttachMemAsync(s1, r1_std, 0);
+        cudaStreamAttachMemAsync(s2, nb_feat_log_prob, 0);
+        cudaStreamAttachMemAsync(s2, r1, 0);
+        cudaStreamAttachMemAsync(s1, ridge_coeff, 0);
+        cudaStreamAttachMemAsync(s1, r2, 0);
+        cudaStreamAttachMemAsync(s2, nb_amax, 0);
+        cudaStreamAttachMemAsync(s2, nb_l, 0);
+        cudaStreamAttachMemAsync(s1, ridge_intercept, 0);
+    }
+    if (do_prefetch && pascalGpu) {
+        cudaMemPrefetchAsync(r1, sizeof(float) * N * num_classes, 0, s2);
+        cudaMemPrefetchAsync(r2, sizeof(float) * N * num_classes, 0, s1);
+        cudaMemPrefetchAsync(r, sizeof(int) * N, 0, s1);
+    }
+
+    rr_1<<<num_blocks, block_size_1d, 0, s1>>>(x, z, N, num_features);
+    // dim3 num_blocks_2d(8, 8);
+    // dim3 block_size_1d_2d(8, 8);
+    // rr_1_0<<<num_blocks_2d, block_size_1d_2d, 0, s1>>>(x, r1_mean, r1_std, N, num_features);
+    // rr_1_1<<<num_blocks_2d, block_size_1d_2d, 0, s1>>>(x, z, r1_mean, r1_std, N, num_features);
+
+    nb_1<<<num_blocks, block_size_1d, 0, s2>>>(x, nb_feat_log_prob, r1, N, num_features, num_classes);
+
+    rr_2<<<num_blocks, block_size_1d, 0, s1>>>(z, ridge_coeff, r2, N, num_features, num_classes);
+
+    nb_2<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_amax, N, num_classes);
+
+    nb_3<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_amax, nb_l, N, num_classes);
+
+    rr_3<<<num_blocks, block_size_1d, 0, s1>>>(r2, ridge_intercept, N, num_classes);
+
+    nb_4<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_l, N, num_classes);
+
+    softmax<<<num_blocks, block_size_1d, 0, s2>>>(r1, N, num_classes);
+
+    softmax<<<num_blocks, block_size_1d, 0, s1>>>(r2, N, num_classes);
+
+    // Stream 1 waits stream 2;
+    cudaEvent_t e1;
+    cudaEventCreate(&e1);
+    cudaEventRecord(e1, s2);
+    cudaStreamWaitEvent(s1, e1, 0);
+
+    argmax<<<num_blocks, block_size_1d, 0, s1>>>(r1, r2, r, N, num_classes);
+    cudaDeviceSynchronize();
+}
+
+void Benchmark6::execute_cudagraph(int iter) {
+    if (iter == 0) {
+        cudaEvent_t ef;
+        cudaEventCreate(&ef);
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+        cudaEventRecord(ef, s1);
+        cudaStreamWaitEvent(s2, ef, 0);
+
+        rr_1<<<num_blocks, block_size_1d, 0, s1>>>(x, z, N, num_features);
+        // dim3 num_blocks_2d(8, 8);
+        // dim3 block_size_1d_2d(8, 8);
+        // rr_1_0<<<num_blocks_2d, block_size_1d_2d, 0, s1>>>(x, r1_mean, r1_std, N, num_features);
+        // rr_1_1<<<num_blocks_2d, block_size_1d_2d, 0, s1>>>(x, z, r1_mean, r1_std, N, num_features);
+
+        nb_1<<<num_blocks, block_size_1d, 0, s2>>>(x, nb_feat_log_prob, r1, N, num_features, num_classes);
+
+        rr_2<<<num_blocks, block_size_1d, 0, s1>>>(z, ridge_coeff, r2, N, num_features, num_classes);
+
+        nb_2<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_amax, N, num_classes);
+
+        nb_3<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_amax, nb_l, N, num_classes);
+
+        rr_3<<<num_blocks, block_size_1d, 0, s1>>>(r2, ridge_intercept, N, num_classes);
+
+        nb_4<<<num_blocks, block_size_1d, 0, s2>>>(r1, nb_l, N, num_classes);
+
+        softmax<<<num_blocks, block_size_1d, 0, s2>>>(r1, N, num_classes);
+
+        softmax<<<num_blocks, block_size_1d, 0, s1>>>(r2, N, num_classes);
+
+        // Stream 1 waits stream 2;
+        cudaEvent_t e1;
+        cudaEventCreate(&e1);
+        cudaEventRecord(e1, s2);
+        cudaStreamWaitEvent(s1, e1, 0);
+
+        argmax<<<num_blocks, block_size_1d, 0, s1>>>(r1, r2, r, N, num_classes);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark6::execute_cudagraph_manual(int iter) {
+    if (iter == 0) {
+        cudaGraphCreate(&graph, 0);
+        void* kernel_1_args[4] = {(void*)&x, (void*)&z, &N, &num_features};
+        void* kernel_2_args[6] = {(void*)&x, (void*)&nb_feat_log_prob, (void*)&r1, &N, &num_features, &num_classes};
+        void* kernel_3_args[6] = {(void*)&z, (void*)&ridge_coeff, (void*)&r2, &N, &num_features, &num_classes};
+        void* kernel_4_args[4] = {(void*)&r1, (void*)&nb_amax, &N, &num_classes};
+        void* kernel_5_args[5] = {(void*)&r1, (void*)&nb_amax, (void*)&nb_l, &N, &num_classes};
+        void* kernel_6_args[4] = {(void*)&r2, (void*)&ridge_intercept, &N, &num_classes};
+        void* kernel_7_args[4] = {(void*)&r1, (void*)&nb_l, &N, &num_classes};
+        void* kernel_8_args[3] = {(void*)&r1, &N, &num_classes};
+        void* kernel_9_args[3] = {(void*)&r2, &N, &num_classes};
+        void* kernel_10_args[5] = {(void*)&r1, (void*)&r2, (void*)&r, &N, &num_classes};
+
+        dim3 tb(block_size_1d);
+        dim3 bs(num_blocks);
+
+        add_node(kernel_1_args, kernel_1_params, (void*)rr_1, bs, tb, graph, &kernel_1, nodeDependencies);
+        add_node(kernel_2_args, kernel_2_params, (void*)nb_1, bs, tb, graph, &kernel_2, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_1);
+        add_node(kernel_3_args, kernel_3_params, (void*)rr_2, bs, tb, graph, &kernel_3, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_2);
+        add_node(kernel_4_args, kernel_4_params, (void*)nb_2, bs, tb, graph, &kernel_4, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_4);
+        add_node(kernel_5_args, kernel_5_params, (void*)nb_3, bs, tb, graph, &kernel_5, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_3);
+        add_node(kernel_6_args, kernel_6_params, (void*)rr_3, bs, tb, graph, &kernel_6, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_5);
+        add_node(kernel_7_args, kernel_7_params, (void*)nb_4, bs, tb, graph, &kernel_7, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_7);
+        add_node(kernel_8_args, kernel_8_params, (void*)softmax, bs, tb, graph, &kernel_8, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_6);
+        add_node(kernel_9_args, kernel_9_params, (void*)softmax, bs, tb, graph, &kernel_9, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_8);
+        nodeDependencies.push_back(kernel_9);
+        add_node(kernel_10_args, kernel_10_params, (void*)argmax, bs, tb, graph, &kernel_10, nodeDependencies);
+
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark6::execute_cudagraph_single(int iter) {
+    if (iter == 0) {
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+
+        rr_1<<<num_blocks, block_size_1d, 0, s1>>>(x, z, N, num_features);
+        // dim3 num_blocks_2d(8, 8);
+        // dim3 block_size_1d_2d(8, 8);
+        // rr_1_0<<<num_blocks_2d, block_size_1d_2d, 0, s1>>>(x, r1_mean, r1_std, N, num_features);
+        // rr_1_1<<<num_blocks_2d, block_size_1d_2d, 0, s1>>>(x, z, r1_mean, r1_std, N, num_features);
+
+        nb_1<<<num_blocks, block_size_1d, 0, s1>>>(x, nb_feat_log_prob, r1, N, num_features, num_classes);
+
+        rr_2<<<num_blocks, block_size_1d, 0, s1>>>(z, ridge_coeff, r2, N, num_features, num_classes);
+
+        nb_2<<<num_blocks, block_size_1d, 0, s1>>>(r1, nb_amax, N, num_classes);
+
+        nb_3<<<num_blocks, block_size_1d, 0, s1>>>(r1, nb_amax, nb_l, N, num_classes);
+
+        rr_3<<<num_blocks, block_size_1d, 0, s1>>>(r2, ridge_intercept, N, num_classes);
+
+        nb_4<<<num_blocks, block_size_1d, 0, s1>>>(r1, nb_l, N, num_classes);
+
+        softmax<<<num_blocks, block_size_1d, 0, s1>>>(r1, N, num_classes);
+
+        softmax<<<num_blocks, block_size_1d, 0, s1>>>(r2, N, num_classes);
+
+        argmax<<<num_blocks, block_size_1d, 0, s1>>>(r1, r2, r, N, num_classes);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+std::string Benchmark6::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(r[0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < 10; j++) {
+            res += std::to_string(r[j]) + ", ";
+        }
+        return res + ", ...]";
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b6.cuh b/projects/resources/cuda/single_gpu/b6.cuh
new file mode 100644
index 00000000..3cf0eac7
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b6.cuh
@@ -0,0 +1,69 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark6 : public Benchmark {
+   public:
+    Benchmark6(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    void execute_cudagraph(int iter);
+    void execute_cudagraph_manual(int iter);
+    void execute_cudagraph_single(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int num_features = 200;
+    int num_classes = 10;
+    int *x;
+    float *z;
+    float *nb_feat_log_prob, *nb_class_log_prior, *ridge_coeff, *ridge_intercept, *nb_amax, *nb_l, *r1, *r2;
+    int *r;
+    cudaStream_t s1, s2;
+    cudaGraph_t graph;
+    cudaGraphExec_t graphExec;
+
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    cudaGraphNode_t kernel_1, kernel_2, kernel_3, kernel_4, kernel_5, kernel_6, kernel_7, kernel_8, kernel_9, kernel_10;
+    cudaKernelNodeParams kernel_1_params;
+    cudaKernelNodeParams kernel_2_params;
+    cudaKernelNodeParams kernel_3_params;
+    cudaKernelNodeParams kernel_4_params;
+    cudaKernelNodeParams kernel_5_params;
+    cudaKernelNodeParams kernel_6_params;
+    cudaKernelNodeParams kernel_7_params;
+    cudaKernelNodeParams kernel_8_params;
+    cudaKernelNodeParams kernel_9_params;
+    cudaKernelNodeParams kernel_10_params;
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b7.cu b/projects/resources/cuda/single_gpu/b7.cu
new file mode 100644
index 00000000..d38e83b1
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b7.cu
@@ -0,0 +1,574 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b7.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+#define WARP_SIZE 32
+#define THREADS_PER_VECTOR 4
+#define MAX_NUM_VECTORS_PER_BLOCK (1024 / THREADS_PER_VECTOR)
+
+/////////////////////////////
+/////////////////////////////
+
+extern "C" __global__ void spmv(const int *ptr, const int *idx, const int *val, const float *vec, float *res, int num_rows, int num_nnz) {
+    for (int n = blockIdx.x * blockDim.x + threadIdx.x; n < num_rows; n += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int i = ptr[n]; i < ptr[n + 1]; i++) {
+            sum += val[i] * vec[idx[i]];
+        }
+        res[n] = sum;
+    }
+}
+
+extern "C" __global__ void spmv2(const int *ptr, const int *idx, const int *val, const float *vec, float *res, int num_rows, int num_nnz) {
+    // Thread ID in block
+    int t = threadIdx.x;
+
+    // Thread ID in warp
+    int lane = t & (WARP_SIZE - 1);
+
+    // Number of warps per block
+    int warpsPerBlock = blockDim.x / WARP_SIZE;
+
+    // One row per warp
+    int row = (blockIdx.x * warpsPerBlock) + (t / WARP_SIZE);
+
+    extern __shared__ volatile float vals[];
+
+    if (row < num_rows) {
+        int rowStart = ptr[row];
+        int rowEnd = ptr[row + 1];
+        float sum = 0;
+
+        // Use all threads in a warp accumulate multiplied elements
+        for (int j = rowStart + lane; j < rowEnd; j += WARP_SIZE) {
+            int col = idx[j];
+            sum += val[j] * vec[col];
+        }
+        vals[t] = sum;
+        __syncthreads();
+
+        // Reduce partial sums
+        if (lane < 16) vals[t] += vals[t + 16];
+        if (lane < 8) vals[t] += vals[t + 8];
+        if (lane < 4) vals[t] += vals[t + 4];
+        if (lane < 2) vals[t] += vals[t + 2];
+        if (lane < 1) vals[t] += vals[t + 1];
+        __syncthreads();
+
+        // Write result
+        if (lane == 0) {
+            res[row] = vals[t];
+        }
+    }
+}
+
+extern "C" __global__ void spmv3(int *cudaRowCounter, int *d_ptr, int *d_cols, int *d_val, float *d_vector, float *d_out, int N) {
+    int i;
+    float sum;
+    int row;
+    int rowStart, rowEnd;
+    int laneId = threadIdx.x % THREADS_PER_VECTOR;       //lane index in the vector
+    int vectorId = threadIdx.x / THREADS_PER_VECTOR;     //vector index in the thread block
+    int warpLaneId = threadIdx.x & 31;                   //lane index in the warp
+    int warpVectorId = warpLaneId / THREADS_PER_VECTOR;  //vector index in the warp
+
+    __shared__ volatile int space[MAX_NUM_VECTORS_PER_BLOCK][2];
+
+    // Get the row index
+    if (warpLaneId == 0) {
+        row = atomicAdd(cudaRowCounter, 32 / THREADS_PER_VECTOR);
+    }
+    // Broadcast the value to other threads in the same warp and compute the row index of each vector
+    row = __shfl_sync(0xffffffff, row, 0) + warpVectorId;
+
+    while (row < N) {
+        // Use two threads to fetch the row offset
+        if (laneId < 2) {
+            space[vectorId][laneId] = d_ptr[row + laneId];
+        }
+        rowStart = space[vectorId][0];
+        rowEnd = space[vectorId][1];
+
+        sum = 0;
+        // Compute dot product
+        if (THREADS_PER_VECTOR == 32) {
+            // Ensure aligned memory access
+            i = rowStart - (rowStart & (THREADS_PER_VECTOR - 1)) + laneId;
+
+            // Process the unaligned part
+            if (i >= rowStart && i < rowEnd) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+
+            // Process the aligned part
+            for (i += THREADS_PER_VECTOR; i < rowEnd; i += THREADS_PER_VECTOR) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        } else {
+            for (i = rowStart + laneId; i < rowEnd; i += THREADS_PER_VECTOR) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        }
+        // Intra-vector reduction
+        for (i = THREADS_PER_VECTOR >> 1; i > 0; i >>= 1) {
+            sum += __shfl_down_sync(0xffffffff, sum, i);
+        }
+
+        // Save the results
+        if (laneId == 0) {
+            d_out[row] = sum;
+        }
+
+        // Get a new row index
+        if (warpLaneId == 0) {
+            row = atomicAdd(cudaRowCounter, 32 / THREADS_PER_VECTOR);
+        }
+        // Broadcast the row index to the other threads in the same warp and compute the row index of each vector
+        row = __shfl_sync(0xffffffff, row, 0) + warpVectorId;
+    }
+}
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void sum(const float *x, float *z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i];
+    }
+    sum = warp_reduce(sum);                    // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum);                     // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void divide(const float *x, float *y, float *val, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i] / val[0];
+    }
+}
+
+extern "C" __global__ void reset_kernel(float *n1, float *n2, int *r1, int *r2) {
+    if (blockIdx.x * blockDim.x + threadIdx.x == 0) {
+        *n1 = 0;
+        *n2 = 0;
+        *r1 = 0;
+        *r2 = 0;
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void CUDART_CB host_callback(void *data) {
+    // Check status of GPU after stream operations are done
+    callBackData_t *tmp = (callBackData_t *)(data);
+    tmp->n1[0] = 0.0;
+    tmp->n2[0] = 0.0;
+    tmp->r1[0] = 0;
+    tmp->r2[0] = 0;
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark7::alloc() {
+    nnz = degree * N;
+    ptr_tmp = (int *)malloc(sizeof(int) * (N + 1));
+    ptr2_tmp = (int *)malloc(sizeof(int) * (N + 1));
+    idx_tmp = (int *)malloc(sizeof(int) * nnz);
+    idx2_tmp = (int *)malloc(sizeof(int) * nnz);
+    val_tmp = (int *)malloc(sizeof(int) * nnz);
+    val2_tmp = (int *)malloc(sizeof(int) * nnz);
+
+    err = cudaMallocManaged(&ptr, sizeof(int) * (N + 1));
+    err = cudaMallocManaged(&ptr2, sizeof(int) * (N + 1));
+    err = cudaMallocManaged(&idx, sizeof(int) * nnz);
+    err = cudaMallocManaged(&idx2, sizeof(int) * nnz);
+    err = cudaMallocManaged(&val, sizeof(int) * nnz);
+    err = cudaMallocManaged(&val2, sizeof(int) * nnz);
+    err = cudaMallocManaged(&rowCounter1, sizeof(int));
+    err = cudaMallocManaged(&rowCounter2, sizeof(int));
+
+    err = cudaMallocManaged(&auth1, sizeof(float) * N);
+    err = cudaMallocManaged(&auth2, sizeof(float) * N);
+    err = cudaMallocManaged(&hub1, sizeof(float) * N);
+    err = cudaMallocManaged(&hub2, sizeof(float) * N);
+    err = cudaMallocManaged(&auth_norm, sizeof(float));
+    err = cudaMallocManaged(&hub_norm, sizeof(float));
+
+    x = (int *)malloc(nnz * sizeof(int));
+    y = (int *)malloc(nnz * sizeof(int));
+    v = (int *)malloc(nnz * sizeof(int));
+
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+}
+
+void Benchmark7::init() {
+    random_coo(x, y, v, N, degree);
+    // Create a CSR;
+    coo2csr(ptr_tmp, idx_tmp, val_tmp, x, y, v, N, N, nnz);
+    coo2csr(ptr2_tmp, idx2_tmp, val2_tmp, y, x, v, N, N, nnz);
+}
+
+void Benchmark7::reset() {
+    // FIXME: using the same data for CSC and CSR, because ptr2 is giving data-dependent performance differences
+    for (int j = 0; j < nnz; j++) {
+        idx[j] = idx_tmp[j];
+        idx2[j] = idx_tmp[j];
+        val[j] = val_tmp[j];
+        val2[j] = val_tmp[j];
+    }
+    for (int j = 0; j < N + 1; j++) {
+        ptr[j] = ptr_tmp[j];
+        ptr2[j] = ptr_tmp[j];
+    }
+    for (int i = 0; i < N; i++) {
+        auth1[i] = 1;
+        auth2[i] = 1;
+        hub1[i] = 1;
+        hub2[i] = 1;
+    }
+    auth_norm[0] = 0;
+    hub_norm[0] = 0;
+    rowCounter1[0] = 0;
+    rowCounter2[0] = 0;
+}
+
+void Benchmark7::execute_sync(int iter) {
+    for (int iter = 0; iter < iterations; iter++) {
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(auth1, N * sizeof(float), 0);
+            cudaMemPrefetchAsync(auth2, N * sizeof(float), 0);
+            cudaMemPrefetchAsync(hub1, N * sizeof(float), 0);
+            cudaMemPrefetchAsync(hub2, N * sizeof(float), 0);
+            cudaMemPrefetchAsync(auth_norm, sizeof(float), 0);
+            cudaMemPrefetchAsync(hub_norm, sizeof(float), 0);
+        }
+
+        int nb = ceil(N / ((float)block_size_1d));
+
+        // spmv<<<nb, block_size_1d>>>(ptr2, idx2, val2, hub1, auth2, N, nnz);
+        spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float)>>>(rowCounter1, ptr2, idx2, val2, hub1, auth2, N);
+        err = cudaDeviceSynchronize();
+
+        // spmv<<<nb, block_size_1d>>>(ptr, idx, val, auth1, hub2, N, nnz);
+        spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float)>>>(rowCounter2, ptr, idx, val, auth1, hub2, N);
+        err = cudaDeviceSynchronize();
+
+        sum<<<num_blocks, block_size_1d>>>(auth2, auth_norm, N);
+        err = cudaDeviceSynchronize();
+
+        sum<<<num_blocks, block_size_1d>>>(hub2, hub_norm, N);
+        err = cudaDeviceSynchronize();
+
+        divide<<<num_blocks, block_size_1d>>>(auth2, auth1, auth_norm, N);
+        err = cudaDeviceSynchronize();
+
+        divide<<<num_blocks, block_size_1d>>>(hub2, hub1, hub_norm, N);
+        err = cudaDeviceSynchronize();
+
+        auth_norm[0] = 0;
+        hub_norm[0] = 0;
+        rowCounter1[0] = 0;
+        rowCounter2[0] = 0;
+
+        if (debug && err) std::cout << err << std::endl;
+    }
+}
+
+void Benchmark7::execute_async(int iter) {
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, ptr2, 0);
+        cudaStreamAttachMemAsync(s1, idx2, 0);
+        cudaStreamAttachMemAsync(s1, val2, 0);
+        cudaStreamAttachMemAsync(s2, ptr, 0);
+        cudaStreamAttachMemAsync(s2, idx, 0);
+        cudaStreamAttachMemAsync(s2, val, 0);
+    }
+    for (int iter = 0; iter < iterations; iter++) {
+        if (!pascalGpu || stream_attach) {
+            cudaStreamAttachMemAsync(s1, hub1, 0);
+            cudaStreamAttachMemAsync(s1, auth2, 0);
+            cudaStreamAttachMemAsync(s2, auth1, 0);
+            cudaStreamAttachMemAsync(s2, hub2, 0);
+        }
+        if (pascalGpu && do_prefetch) {
+            cudaMemPrefetchAsync(auth1, N * sizeof(float), 0, s2);
+            cudaMemPrefetchAsync(auth2, N * sizeof(float), 0, s1);
+            cudaMemPrefetchAsync(hub1, N * sizeof(float), 0, s1);
+            cudaMemPrefetchAsync(hub2, N * sizeof(float), 0, s2);
+            cudaMemPrefetchAsync(auth_norm, sizeof(float), 0, s1);
+            cudaMemPrefetchAsync(hub_norm, sizeof(float), 0, s2);
+        }
+
+        cudaEvent_t e1, e2;
+        cudaEventCreate(&e1);
+        cudaEventCreate(&e2);
+
+        int nb = ceil(N / ((float)block_size_1d));
+
+        // spmv<<<nb, block_size_1d, 0, s1>>>(ptr2, idx2, val2, hub1, auth2, N, nnz);
+        spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s1>>>(rowCounter1, ptr2, idx2, val2, hub1, auth2, N);
+        err = cudaEventRecord(e1, s1);
+        // spmv<<<nb, block_size_1d, 0, s2>>>(ptr, idx, val, auth1, hub2, N, nnz);
+        spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s2>>>(rowCounter2, ptr, idx, val, auth1, hub2, N);
+        err = cudaEventRecord(e2, s2);
+
+        sum<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth_norm, N);
+
+        sum<<<num_blocks, block_size_1d, 0, s2>>>(hub2, hub_norm, N);
+
+        // Stream 1 waits stream 2;
+        err = cudaStreamWaitEvent(s1, e2, 0);
+        cudaStreamAttachMemAsync(s1, auth1, 0);
+        divide<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth1, auth_norm, N);
+        // Stream 2 waits stream 1;
+        err = cudaStreamWaitEvent(s2, e1, 0);
+        cudaStreamAttachMemAsync(s2, hub1, 0);
+        divide<<<num_blocks, block_size_1d, 0, s2>>>(hub2, hub1, hub_norm, N);
+
+        // cudaEvent_t e3;
+        // cudaEventCreate(&e3);
+        // cudaEventRecord(e3, s2);
+        // checkCudaErrors(cudaStreamWaitEvent(s1, e3, 0));
+        // reset_kernel<<<1, 1, 0, s1>>>(auth_norm, hub_norm, rowCounter1, rowCounter2);
+
+        err = cudaStreamSynchronize(s1);
+        err = cudaStreamSynchronize(s2);
+        auth_norm[0] = 0;
+        hub_norm[0] = 0;
+        rowCounter1[0] = 0;
+        rowCounter2[0] = 0;
+
+        if (debug && err) std::cout << err << std::endl;
+    }
+    // err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark7::execute_cudagraph(int iter) {
+    if (iter == 0) {
+        cudaEvent_t ef;
+        cudaEventCreate(&ef);
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+        cudaEventRecord(ef, s1);
+        cudaStreamWaitEvent(s2, ef, 0);
+
+        // callBackData_t hostFnData = {auth_norm, hub_norm, rowCounter1, rowCounter2};
+        // cudaHostFn_t fn = host_callback;
+
+        for (int i = 0; i < iterations; i++) {
+            cudaEvent_t e1, e2;
+            cudaEventCreate(&e1);
+            cudaEventCreate(&e2);
+
+            int nb = ceil(N / ((float)block_size_1d));
+
+            // spmv<<<nb, block_size_1d, 0, s1>>>(ptr2, idx2, val2, hub1, auth2, N, nnz);
+            spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s1>>>(rowCounter1, ptr2, idx2, val2, hub1, auth2, N);
+
+            // spmv<<<nb, block_size_1d, 0, s2>>>(ptr, idx, val, auth1, hub2, N, nnz);
+            spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s2>>>(rowCounter2, ptr, idx, val, auth1, hub2, N);
+
+            sum<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth_norm, N);
+            err = cudaEventRecord(e1, s1);
+            sum<<<num_blocks, block_size_1d, 0, s2>>>(hub2, hub_norm, N);
+            err = cudaEventRecord(e2, s2);
+            // Stream 1 waits stream 2;
+            err = cudaStreamWaitEvent(s1, e2, 0);
+            divide<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth1, auth_norm, N);
+            // Stream 2 waits stream 1;
+            err = cudaStreamWaitEvent(s2, e1, 0);
+            divide<<<num_blocks, block_size_1d, 0, s2>>>(hub2, hub1, hub_norm, N);
+            // Stream 1 waits stream 2;
+            cudaEvent_t e3;
+            cudaEventCreate(&e3);
+            cudaEventRecord(e3, s2);
+            checkCudaErrors(cudaStreamWaitEvent(s1, e3, 0));
+
+            // This doesn't work for some reason;
+            // checkCudaErrors(cudaLaunchHostFunc(s1, fn, &hostFnData));
+
+            reset_kernel<<<1, 1, 0, s1>>>(auth_norm, hub_norm, rowCounter1, rowCounter2);
+            cudaEvent_t e4;
+            cudaEventCreate(&e4);
+            cudaEventRecord(e4, s1);
+            checkCudaErrors(cudaStreamWaitEvent(s2, e4, 0));
+        }
+
+        checkCudaErrors(cudaStreamEndCapture(s1, &graph));
+        checkCudaErrors(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
+    }
+    checkCudaErrors(cudaGraphLaunch(graphExec, s1));
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark7::execute_cudagraph_manual(int iter) {
+    if (iter == 0) {
+        // callBackData_t hostFnData = {auth_norm, hub_norm, rowCounter1, rowCounter2};
+        // cudaHostFn_t fn = host_callback;
+
+        int pascalGpu = 0;
+        // cudaDeviceGetAttribute(&pascalGpu, cudaDeviceAttr::cudaDevAttrConcurrentManagedAccess, 0);
+        cudaGraphCreate(&graph, 0);
+        void *kernel_1_args[7] = {(void *)&rowCounter1, (void *)&ptr2, (void *)&idx2, (void *)&val2, (void *)&hub1, (void *)&auth2, &N};
+        void *kernel_2_args[7] = {(void *)&rowCounter2, (void *)&ptr, (void *)&idx, (void *)&val, (void *)&auth1, (void *)&hub2, &N};
+        void *kernel_3_args[3] = {(void *)&auth2, (void *)&auth_norm, &N};
+        void *kernel_4_args[3] = {(void *)&hub2, (void *)&hub_norm, &N};
+        void *kernel_5_args[4] = {(void *)&auth2, (void *)&auth1, (void *)&auth_norm, &N};
+        void *kernel_6_args[4] = {(void *)&hub2, (void *)&hub1, (void *)&hub_norm, &N};
+        void *kernel_7_args[4] = {(void *)&auth_norm, (void *)&hub_norm, (void *)&rowCounter1, (void *)&rowCounter2};
+
+        callback_data = {0};
+        callback_data.n1 = auth_norm;
+        callback_data.n2 = hub_norm;
+        callback_data.r1 = rowCounter1;
+        callback_data.r2 = rowCounter2;
+
+        for (int i = 0; i < iterations; i++) {
+            dim3 tb(block_size_1d);
+            dim3 bs(num_blocks);
+            dim3 nb(ceil(N / ((float)block_size_1d)));
+
+            if (i > 0) {
+                nodeDependencies.clear();
+                if (pascalGpu) {
+                    nodeDependencies.push_back(host_node);
+                } else {
+                    nodeDependencies.push_back(kernel_7);
+                }
+            }
+            checkCudaErrors(add_node(kernel_1_args, kernel_1_params, (void *)spmv3, nb, tb, graph, &kernel_1, nodeDependencies, block_size_1d * sizeof(float)));
+            if (i > 0) {
+                nodeDependencies.clear();
+                if (pascalGpu) {
+                    nodeDependencies.push_back(host_node);
+                } else {
+                    nodeDependencies.push_back(kernel_7);
+                }
+            }
+            add_node(kernel_2_args, kernel_2_params, (void *)spmv3, nb, tb, graph, &kernel_2, nodeDependencies, block_size_1d * sizeof(float));
+
+            nodeDependencies.clear();
+            nodeDependencies.push_back(kernel_1);
+            add_node(kernel_3_args, kernel_3_params, (void *)sum, bs, tb, graph, &kernel_3, nodeDependencies);
+
+            nodeDependencies.clear();
+            nodeDependencies.push_back(kernel_2);
+            add_node(kernel_4_args, kernel_4_params, (void *)sum, bs, tb, graph, &kernel_4, nodeDependencies);
+
+            nodeDependencies.clear();
+            nodeDependencies.push_back(kernel_2);
+            nodeDependencies.push_back(kernel_3);
+            add_node(kernel_5_args, kernel_5_params, (void *)divide, bs, tb, graph, &kernel_5, nodeDependencies);
+
+            nodeDependencies.clear();
+            nodeDependencies.push_back(kernel_1);
+            nodeDependencies.push_back(kernel_4);
+            checkCudaErrors(add_node(kernel_6_args, kernel_6_params, (void *)divide, bs, tb, graph, &kernel_6, nodeDependencies));
+
+            nodeDependencies.clear();
+            nodeDependencies.push_back(kernel_5);
+            nodeDependencies.push_back(kernel_6);
+            if (pascalGpu) {
+                host_params.fn = host_callback;
+                host_params.userData = (void *)&callback_data;
+
+                checkCudaErrors(cudaGraphAddHostNode(&host_node, graph,
+                                                     nodeDependencies.data(),
+                                                     nodeDependencies.size(), &host_params));
+            } else {
+                add_node(kernel_7_args, kernel_7_params, (void *)reset_kernel, bs, tb, graph, &kernel_7, nodeDependencies);
+            }
+
+            // spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s1>>>(rowCounter1, ptr2, idx2, val2, hub1, auth2, N);
+            // spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s2>>>(rowCounter2, ptr, idx, val, auth1, hub2, N);
+            // sum<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth_norm, N);
+            // sum<<<num_blocks, block_size_1d, 0, s2>>>(hub2, hub_norm, N);
+            // divide<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth1, auth_norm, N);
+            // divide<<<num_blocks, block_size_1d, 0, s2>>>(hub2, hub1, hub_norm, N);
+
+            // reset_kernel<<<1, 1, 0, s1>>>(auth_norm, hub_norm, rowCounter1, rowCounter2);
+        }
+        checkCudaErrors(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
+    }
+    checkCudaErrors(cudaGraphLaunch(graphExec, s1));
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark7::execute_cudagraph_single(int iter) {
+    if (iter == 0) {
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+
+        for (int i = 0; i < iterations; i++) {
+            int nb = ceil(N / ((float)block_size_1d));
+
+            // spmv<<<nb, block_size_1d, 0, s1>>>(ptr2, idx2, val2, hub1, auth2, N, nnz);
+            spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s1>>>(rowCounter1, ptr2, idx2, val2, hub1, auth2, N);
+
+            // spmv<<<nb, block_size_1d, 0, s2>>>(ptr, idx, val, auth1, hub2, N, nnz);
+            spmv3<<<nb, block_size_1d, block_size_1d * sizeof(float), s1>>>(rowCounter2, ptr, idx, val, auth1, hub2, N);
+
+            sum<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth_norm, N);
+            sum<<<num_blocks, block_size_1d, 0, s1>>>(hub2, hub_norm, N);
+
+            divide<<<num_blocks, block_size_1d, 0, s1>>>(auth2, auth1, auth_norm, N);
+
+            divide<<<num_blocks, block_size_1d, 0, s1>>>(hub2, hub1, hub_norm, N);
+
+            reset_kernel<<<1, 1, 0, s1>>>(auth_norm, hub_norm, rowCounter1, rowCounter2);
+        }
+
+        checkCudaErrors(cudaStreamEndCapture(s1, &graph));
+        checkCudaErrors(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
+    }
+    checkCudaErrors(cudaGraphLaunch(graphExec, s1));
+    err = cudaStreamSynchronize(s1);
+}
+
+std::string Benchmark7::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(auth1[0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < 10; j++) {
+            res += std::to_string(auth1[j]) + ", ";
+        }
+        return res + ", ...]";
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b7.cuh b/projects/resources/cuda/single_gpu/b7.cuh
new file mode 100644
index 00000000..937cbe9b
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b7.cuh
@@ -0,0 +1,95 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include <set>
+
+#include "../benchmark.cuh"
+
+typedef struct callBackData {
+    float *n1;
+    float *n2;
+    int *r1;
+    int *r2;
+} callBackData_t;
+
+class Benchmark7 : public Benchmark {
+   public:
+    Benchmark7(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    void execute_cudagraph(int iter);
+    void execute_cudagraph_manual(int iter);
+    void execute_cudagraph_single(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int degree = 3;
+    int iterations = 5;
+    int nnz;
+
+    int *ptr, *idx, *val, *ptr2, *idx2, *val2, *rowCounter1, *rowCounter2, *x, *y, *v;
+    int *ptr_tmp, *idx_tmp, *val_tmp, *ptr2_tmp, *idx2_tmp, *val2_tmp;
+    float *auth1, *auth2, *hub1, *hub2, *auth_norm, *hub_norm;
+
+    cudaStream_t s1, s2;
+    cudaGraph_t graph;
+    cudaGraphExec_t graphExec;
+
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    cudaGraphNode_t kernel_1, kernel_2, kernel_3, kernel_4, kernel_5, kernel_6, kernel_7;
+    cudaGraphNode_t host_node;
+    callBackData_t callback_data;
+    cudaHostNodeParams host_params;
+    cudaKernelNodeParams kernel_1_params;
+    cudaKernelNodeParams kernel_2_params;
+    cudaKernelNodeParams kernel_3_params;
+    cudaKernelNodeParams kernel_4_params;
+    cudaKernelNodeParams kernel_5_params;
+    cudaKernelNodeParams kernel_6_params;
+    cudaKernelNodeParams kernel_7_params;
+
+    inline void random_coo(int *x, int *y, int *val, int N, int degree) {
+        for (int i = 0; i < N; i++) {
+            std::set<int> edges;
+            while (edges.size() < degree) {
+                edges.insert(rand() % N);
+            }
+            int j = 0;
+            for (auto iter = edges.begin(); iter != edges.end(); iter++, j++) {
+                x[i * degree + j] = i;
+                y[i * degree + j] = *iter;
+                val[i * degree + j] = 1;
+            }
+        }
+    }
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b8.cu b/projects/resources/cuda/single_gpu/b8.cu
new file mode 100644
index 00000000..12d3e729
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b8.cu
@@ -0,0 +1,543 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#include "b8.cuh"
+
+//////////////////////////////
+//////////////////////////////
+
+extern "C" __global__ void gaussian_blur(const float *image, float *result, int rows, int cols, const float *kernel, int diameter) {
+    extern __shared__ float kernel_local[];
+    for (int i = threadIdx.x; i < diameter; i += blockDim.x) {
+        for (int j = threadIdx.y; j < diameter; j += blockDim.y) {
+            kernel_local[i * diameter + j] = kernel[i * diameter + j];
+        }
+    }
+    __syncthreads();
+
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            int radius = diameter / 2;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        sum += kernel_local[(x + radius) * diameter + (y + radius)] * image[nx * cols + ny];
+                    }
+                }
+            }
+            result[i * cols + j] = sum;
+        }
+    }
+}
+
+extern "C" __global__ void sobel(const float *image, float *result, int rows, int cols) {
+    // int SOBEL_X[3][3] = {{-1, -2, -1}, {0, 0, 0}, {1, 2, 1}};
+    // int SOBEL_Y[3][3] = {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}};
+    __shared__ int SOBEL_X[9];
+    __shared__ int SOBEL_Y[9];
+    if (threadIdx.x == 0 && threadIdx.y == 0) {
+        SOBEL_X[0] = -1;
+        SOBEL_X[1] = -2;
+        SOBEL_X[2] = -1;
+        SOBEL_X[3] = 0;
+        SOBEL_X[4] = 0;
+        SOBEL_X[5] = 0;
+        SOBEL_X[6] = 1;
+        SOBEL_X[7] = 2;
+        SOBEL_X[8] = 1;
+
+        SOBEL_Y[0] = -1;
+        SOBEL_Y[1] = 0;
+        SOBEL_Y[2] = 1;
+        SOBEL_Y[3] = -2;
+        SOBEL_Y[4] = 0;
+        SOBEL_Y[5] = 2;
+        SOBEL_Y[6] = -1;
+        SOBEL_Y[7] = 0;
+        SOBEL_Y[8] = 1;
+    }
+    __syncthreads();
+
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum_gradient_x = 0.0, sum_gradient_y = 0.0;
+            int radius = 1;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        float neighbour = image[nx * cols + ny];
+                        int s = (x + radius) * 3 + y + radius;
+                        sum_gradient_x += SOBEL_X[s] * neighbour;
+                        sum_gradient_y += SOBEL_Y[s] * neighbour;
+                    }
+                }
+            }
+            result[i * cols + j] = sqrt(sum_gradient_x * sum_gradient_x + sum_gradient_y * sum_gradient_y);
+        }
+    }
+}
+
+__device__ float atomicMinf(float *address, float val) {
+    int *address_as_int = (int *)address;
+    int old = *address_as_int, assumed;
+    while (val < __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed,
+                        __float_as_int(val));
+    }
+    return __int_as_float(old);
+}
+
+__device__ float atomicMaxf(float *address, float val) {
+    int *address_as_int = (int *)address;
+    int old = *address_as_int, assumed;
+    // If val is smaller than current, don't do anything, else update the current value atomically;
+    while (val > __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed, __float_as_int(val));
+    }
+    return __int_as_float(old);
+}
+
+__inline__ __device__ float warp_reduce_max(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val = max(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+__inline__ __device__ float warp_reduce_min(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2)
+        val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+extern "C" __global__ void maximum_kernel(const float *in, float *out, int N) {
+    int warp_size = 32;
+    float maximum = -1000;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        maximum = max(maximum, in[i]);
+    }
+    maximum = warp_reduce_max(maximum);        // Obtain the max of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMaxf(out, maximum);              // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void minimum_kernel(const float *in, float *out, int N) {
+    int warp_size = 32;
+    float minimum = 1000;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        minimum = min(minimum, in[i]);
+    }
+    minimum = warp_reduce_min(minimum);        // Obtain the min of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMinf(out, minimum);              // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void extend(float *x, const float *minimum, const float *maximum, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float res_tmp = 5 * (x[i] - *minimum) / (*maximum - *minimum);
+        x[i] = res_tmp > 1 ? 1 : res_tmp;
+    }
+}
+
+extern "C" __global__ void unsharpen(const float *x, const float *y, float *res, float amount, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float res_tmp = x[i] * (1 + amount) - y[i] * amount;
+        res_tmp = res_tmp > 1 ? 1 : res_tmp;
+        res[i] = res_tmp < 0 ? 0 : res_tmp;
+    }
+}
+
+extern "C" __global__ void combine(const float *x, const float *y, const float *mask, float *res, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        res[i] = x[i] * mask[i] + y[i] * (1 - mask[i]);
+    }
+}
+
+extern "C" __global__ void reset_image(float *x, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        x[i] = 0.0;
+    }
+}
+
+//////////////////////////////
+//////////////////////////////
+
+void Benchmark8::alloc() {
+    err = cudaMallocManaged(&image, sizeof(float) * N * N);
+    err = cudaMallocManaged(&image2, sizeof(float) * N * N);
+    err = cudaMallocManaged(&image3, sizeof(float) * N * N);
+    err = cudaMallocManaged(&image_unsharpen, sizeof(float) * N * N);
+    err = cudaMallocManaged(&mask_small, sizeof(float) * N * N);
+    err = cudaMallocManaged(&mask_large, sizeof(float) * N * N);
+    err = cudaMallocManaged(&mask_unsharpen, sizeof(float) * N * N);
+    err = cudaMallocManaged(&blurred_small, sizeof(float) * N * N);
+    err = cudaMallocManaged(&blurred_large, sizeof(float) * N * N);
+    err = cudaMallocManaged(&blurred_unsharpen, sizeof(float) * N * N);
+
+    err = cudaMallocManaged(&kernel_small, sizeof(float) * kernel_small_diameter * kernel_small_diameter);
+    err = cudaMallocManaged(&kernel_large, sizeof(float) * kernel_large_diameter * kernel_large_diameter);
+    err = cudaMallocManaged(&kernel_unsharpen, sizeof(float) * kernel_unsharpen_diameter * kernel_unsharpen_diameter);
+    err = cudaMallocManaged(&maximum, sizeof(float));
+    err = cudaMallocManaged(&minimum, sizeof(float));
+
+    err = cudaStreamCreate(&s1);
+    err = cudaStreamCreate(&s2);
+    err = cudaStreamCreate(&s3);
+    err = cudaStreamCreate(&s4);
+    err = cudaStreamCreate(&s5);
+}
+
+void Benchmark8::init() {
+    for (int i = 0; i < N; i++) {
+        for (int j = 0; j < N; j++) {
+            image[i * N + j] = (float)(rand()) / (float)(RAND_MAX);
+        }
+    }
+    gaussian_kernel(kernel_small, kernel_small_diameter, 1);
+    gaussian_kernel(kernel_large, kernel_large_diameter, 10);
+    gaussian_kernel(kernel_unsharpen, kernel_unsharpen_diameter, 5);
+}
+
+void Benchmark8::reset() {
+    // for (int i = 0; i < N; i++) {
+    //     for (int j = 0; j < N; j++) {
+    //         image3[i * N + j] = 0;
+    //     }
+    // }
+    memset(image3, 0, N * N * sizeof(float));
+    *maximum = 0;
+    *minimum = 0;
+    reset_image<<<num_blocks, block_size_1d>>>(image3, N * N);
+    cudaDeviceSynchronize();
+}
+
+void Benchmark8::execute_sync(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    dim3 grid_size_2(num_blocks / 2, num_blocks / 2);
+
+    if (pascalGpu && do_prefetch) {
+        cudaMemPrefetchAsync(image3, N * N * sizeof(float), 0, 0);
+    }
+
+    gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_small_diameter * kernel_small_diameter * sizeof(float)>>>(image, blurred_small, N, N, kernel_small, kernel_small_diameter);
+    cudaDeviceSynchronize();
+
+    gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_large_diameter * kernel_large_diameter * sizeof(float)>>>(image, blurred_large, N, N, kernel_large, kernel_large_diameter);
+    cudaDeviceSynchronize();
+
+    gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float)>>>(image, blurred_unsharpen, N, N, kernel_unsharpen, kernel_unsharpen_diameter);
+    cudaDeviceSynchronize();
+
+    sobel<<<grid_size_2, block_size_2d_dim>>>(blurred_small, mask_small, N, N);
+    cudaDeviceSynchronize();
+
+    sobel<<<grid_size_2, block_size_2d_dim>>>(blurred_large, mask_large, N, N);
+    cudaDeviceSynchronize();
+
+    maximum_kernel<<<num_blocks, block_size_1d>>>(mask_large, maximum, N * N);
+    cudaDeviceSynchronize();
+
+    minimum_kernel<<<num_blocks, block_size_1d>>>(mask_large, minimum, N * N);
+    cudaDeviceSynchronize();
+
+    extend<<<num_blocks, block_size_1d>>>(mask_large, minimum, maximum, N * N);
+    cudaDeviceSynchronize();
+
+    unsharpen<<<num_blocks, block_size_1d>>>(image, blurred_unsharpen, image_unsharpen, 0.5, N * N);
+    cudaDeviceSynchronize();
+
+    combine<<<num_blocks, block_size_1d>>>(image_unsharpen, blurred_large, mask_large, image2, N * N);
+    cudaDeviceSynchronize();
+
+    combine<<<num_blocks, block_size_1d>>>(image2, blurred_small, mask_small, image3, N * N);
+
+    // Extra
+    // combine<<<num_blocks, block_size_1d>>>(blurred_small, blurred_large, blurred_unsharpen, image3, N * N);
+
+    cudaDeviceSynchronize();
+}
+
+void Benchmark8::execute_async(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    int nb = num_blocks / 2;
+    dim3 grid_size_2(nb, nb);
+    if (!pascalGpu || stream_attach) {
+        cudaStreamAttachMemAsync(s1, blurred_small, 0);
+        cudaStreamAttachMemAsync(s1, mask_small, 0);
+        cudaStreamAttachMemAsync(s2, blurred_large, 0);
+        cudaStreamAttachMemAsync(s2, mask_large, 0);
+        cudaStreamAttachMemAsync(s2, image2, 0);
+        cudaStreamAttachMemAsync(s3, blurred_unsharpen, 0);
+        cudaStreamAttachMemAsync(s3, image_unsharpen, 0);
+        cudaStreamAttachMemAsync(s1, image3, 0);
+    }
+
+    gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_small_diameter * kernel_small_diameter * sizeof(float), s1>>>(image, blurred_small, N, N, kernel_small, kernel_small_diameter);
+
+    gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_large_diameter * kernel_large_diameter * sizeof(float), s2>>>(image, blurred_large, N, N, kernel_large, kernel_large_diameter);
+
+    gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float), s3>>>(image, blurred_unsharpen, N, N, kernel_unsharpen, kernel_unsharpen_diameter);
+
+    sobel<<<grid_size_2, block_size_2d_dim, 0, s1>>>(blurred_small, mask_small, N, N);
+
+    sobel<<<grid_size_2, block_size_2d_dim, 0, s2>>>(blurred_large, mask_large, N, N);
+
+    cudaEvent_t e1, e2, e3, e4, e5;
+    cudaEventCreate(&e1);
+    cudaEventCreate(&e2);
+    cudaEventCreate(&e3);
+    cudaEventCreate(&e4);
+    cudaEventCreate(&e5);
+
+    cudaEventRecord(e1, s2);
+    cudaStreamWaitEvent(s5, e1, 0);
+    maximum_kernel<<<num_blocks, block_size_1d, 0, s5>>>(mask_large, maximum, N * N);
+
+    cudaStreamWaitEvent(s4, e1, 0);
+    minimum_kernel<<<num_blocks, block_size_1d, 0, s4>>>(mask_large, minimum, N * N);
+
+    cudaEventRecord(e2, s4);
+    cudaEventRecord(e5, s5);
+
+    cudaStreamWaitEvent(s2, e2, 0);
+    cudaStreamWaitEvent(s2, e5, 0);
+
+    extend<<<num_blocks, block_size_1d, 0, s2>>>(mask_large, minimum, maximum, N * N);
+
+    unsharpen<<<num_blocks, block_size_1d, 0, s3>>>(image, blurred_unsharpen, image_unsharpen, 0.5, N * N);
+    cudaEventRecord(e3, s3);
+    cudaStreamWaitEvent(s2, e3, 0);
+    combine<<<num_blocks, block_size_1d, 0, s2>>>(image_unsharpen, blurred_large, mask_large, image2, N * N);
+    cudaEventRecord(e4, s2);
+    cudaStreamWaitEvent(s1, e4, 0);
+    cudaStreamAttachMemAsync(s1, image2, 0);
+    if (pascalGpu && do_prefetch) {
+        cudaMemPrefetchAsync(image3, N * N * sizeof(float), 0, s1);
+    }
+    combine<<<num_blocks, block_size_1d, 0, s1>>>(image2, blurred_small, mask_small, image3, N * N);
+
+    // Extra
+    // cudaEventRecord(e1, s2);
+    // cudaEventRecord(e2, s3);
+    // cudaStreamWaitEvent(s1, e1, 0);
+    // cudaStreamWaitEvent(s1, e2, 0);
+    // combine<<<num_blocks, block_size_1d, 0, s1>>>(blurred_small, blurred_large, blurred_unsharpen, image3, N * N);
+
+    cudaStreamSynchronize(s1);
+}
+
+void Benchmark8::execute_cudagraph(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    int nb = num_blocks / 2;
+    dim3 grid_size_2(nb, nb);
+    if (iter == 0) {
+        cudaEvent_t ef;
+        cudaEventCreate(&ef);
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+        cudaEventRecord(ef, s1);
+        cudaStreamWaitEvent(s2, ef, 0);
+        cudaStreamWaitEvent(s3, ef, 0);
+        cudaStreamWaitEvent(s4, ef, 0);
+        cudaStreamWaitEvent(s5, ef, 0);
+
+        gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_small_diameter * kernel_small_diameter * sizeof(float), s1>>>(image, blurred_small, N, N, kernel_small, kernel_small_diameter);
+
+        gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_large_diameter * kernel_large_diameter * sizeof(float), s2>>>(image, blurred_large, N, N, kernel_large, kernel_large_diameter);
+
+        gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float), s3>>>(image, blurred_unsharpen, N, N, kernel_unsharpen, kernel_unsharpen_diameter);
+
+        sobel<<<grid_size_2, block_size_2d_dim, 0, s1>>>(blurred_small, mask_small, N, N);
+
+        sobel<<<grid_size_2, block_size_2d_dim, 0, s2>>>(blurred_large, mask_large, N, N);
+
+        cudaEvent_t e1, e2, e3, e4, e5;
+        cudaEventCreate(&e1);
+        cudaEventCreate(&e2);
+        cudaEventCreate(&e3);
+        cudaEventCreate(&e4);
+        cudaEventCreate(&e5);
+
+        cudaEventRecord(e1, s2);
+        cudaStreamWaitEvent(s5, e1, 0);
+        maximum_kernel<<<num_blocks, block_size_1d, 0, s5>>>(mask_large, maximum, N * N);
+
+        cudaStreamWaitEvent(s4, e1, 0);
+        minimum_kernel<<<num_blocks, block_size_1d, 0, s4>>>(mask_large, minimum, N * N);
+
+        cudaEventRecord(e2, s4);
+        cudaEventRecord(e5, s5);
+
+        cudaStreamWaitEvent(s2, e2, 0);
+        cudaStreamWaitEvent(s2, e5, 0);
+
+        extend<<<num_blocks, block_size_1d, 0, s2>>>(mask_large, minimum, maximum, N * N);
+
+        unsharpen<<<num_blocks, block_size_1d, 0, s3>>>(image, blurred_unsharpen, image_unsharpen, 0.5, N * N);
+        cudaEventRecord(e3, s3);
+        cudaStreamWaitEvent(s2, e3, 0);
+        combine<<<num_blocks, block_size_1d, 0, s2>>>(image_unsharpen, blurred_large, mask_large, image2, N * N);
+        cudaEventRecord(e4, s2);
+        cudaStreamWaitEvent(s1, e4, 0);
+
+        combine<<<num_blocks, block_size_1d, 0, s1>>>(image2, blurred_small, mask_small, image3, N * N);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark8::execute_cudagraph_manual(int iter) {
+    if (iter == 0) {
+        dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+        dim3 grid_size(num_blocks, num_blocks);
+        int nb = num_blocks / 2;
+        dim3 grid_size_2(nb, nb);
+        dim3 tb(block_size_1d);
+        dim3 bs(num_blocks);
+        int N2 = N * N;
+        int a = 0.5;
+        cudaGraphCreate(&graph, 0);
+        void *kernel_1_args[6] = {(void *)&image, (void *)&blurred_small, &N, &N, (void *)&kernel_small, &kernel_small_diameter};
+        void *kernel_2_args[6] = {(void *)&image, (void *)&blurred_large, &N, &N, (void *)&kernel_large, &kernel_large_diameter};
+        void *kernel_3_args[6] = {(void *)&image, (void *)&blurred_unsharpen, &N, &N, (void *)&kernel_unsharpen, &kernel_unsharpen_diameter};
+        void *kernel_4_args[4] = {(void *)&blurred_small, (void *)&mask_small, &N, &N};
+        void *kernel_5_args[4] = {(void *)&blurred_large, (void *)&mask_large, &N, &N};
+        void *kernel_6_args[3] = {(void *)&mask_large, (void *)&maximum, &N2};
+        void *kernel_7_args[3] = {(void *)&mask_large, (void *)&minimum, &N2};
+        void *kernel_8_args[4] = {(void *)&mask_large, (void *)&minimum, (void *)&maximum, &N2};
+        void *kernel_9_args[5] = {(void *)&image, (void *)&blurred_unsharpen, (void *)&image_unsharpen, &a, &N2};
+        void *kernel_10_args[5] = {(void *)&image_unsharpen, (void *)&blurred_large, (void *)&mask_large, (void *)&image2, &N2};
+        void *kernel_11_args[5] = {(void *)&image2, (void *)&blurred_small, (void *)&mask_small, (void *)&image3, &N2};
+
+        add_node(kernel_1_args, kernel_1_params, (void *)gaussian_blur, grid_size_2, block_size_2d_dim, graph, &kernel_1, nodeDependencies, kernel_small_diameter * kernel_small_diameter * sizeof(float));
+        add_node(kernel_2_args, kernel_2_params, (void *)gaussian_blur, grid_size_2, block_size_2d_dim, graph, &kernel_2, nodeDependencies, kernel_large_diameter * kernel_large_diameter * sizeof(float));
+        add_node(kernel_3_args, kernel_3_params, (void *)gaussian_blur, grid_size_2, block_size_2d_dim, graph, &kernel_3, nodeDependencies, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float));
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_1);
+        add_node(kernel_4_args, kernel_4_params, (void *)sobel, grid_size_2, block_size_2d_dim, graph, &kernel_4, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_2);
+        add_node(kernel_5_args, kernel_5_params, (void *)sobel, grid_size_2, block_size_2d_dim, graph, &kernel_5, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_5);
+        add_node(kernel_6_args, kernel_6_params, (void *)maximum_kernel, bs, tb, graph, &kernel_6, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_5);
+        add_node(kernel_7_args, kernel_7_params, (void *)minimum_kernel, bs, tb, graph, &kernel_7, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_6);
+        nodeDependencies.push_back(kernel_7);
+        add_node(kernel_8_args, kernel_8_params, (void *)extend, bs, tb, graph, &kernel_8, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_3);
+        add_node(kernel_9_args, kernel_9_params, (void *)unsharpen, bs, tb, graph, &kernel_9, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_8);
+        nodeDependencies.push_back(kernel_9);
+        add_node(kernel_10_args, kernel_10_params, (void *)combine, bs, tb, graph, &kernel_10, nodeDependencies);
+
+        nodeDependencies.clear();
+        nodeDependencies.push_back(kernel_4);
+        nodeDependencies.push_back(kernel_10);
+        add_node(kernel_11_args, kernel_11_params, (void *)combine, bs, tb, graph, &kernel_11, nodeDependencies);
+
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+void Benchmark8::execute_cudagraph_single(int iter) {
+    dim3 block_size_2d_dim(block_size_2d, block_size_2d);
+    dim3 grid_size(num_blocks, num_blocks);
+    int nb = num_blocks / 2;
+    dim3 grid_size_2(nb, nb);
+    if (iter == 0) {
+        cudaStreamBeginCapture(s1, cudaStreamCaptureModeGlobal);
+
+        gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_small_diameter * kernel_small_diameter * sizeof(float), s1>>>(image, blurred_small, N, N, kernel_small, kernel_small_diameter);
+
+        gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_large_diameter * kernel_large_diameter * sizeof(float), s1>>>(image, blurred_large, N, N, kernel_large, kernel_large_diameter);
+
+        gaussian_blur<<<grid_size_2, block_size_2d_dim, kernel_unsharpen_diameter * kernel_unsharpen_diameter * sizeof(float), s1>>>(image, blurred_unsharpen, N, N, kernel_unsharpen, kernel_unsharpen_diameter);
+
+        sobel<<<grid_size_2, block_size_2d_dim, 0, s1>>>(blurred_small, mask_small, N, N);
+
+        sobel<<<grid_size_2, block_size_2d_dim, 0, s1>>>(blurred_large, mask_large, N, N);
+
+        maximum_kernel<<<num_blocks, block_size_1d, 0, s1>>>(mask_large, maximum, N * N);
+
+        minimum_kernel<<<num_blocks, block_size_1d, 0, s1>>>(mask_large, minimum, N * N);
+
+        extend<<<num_blocks, block_size_1d, 0, s1>>>(mask_large, minimum, maximum, N * N);
+
+        unsharpen<<<num_blocks, block_size_1d, 0, s1>>>(image, blurred_unsharpen, image_unsharpen, 0.5, N * N);
+
+        combine<<<num_blocks, block_size_1d, 0, s1>>>(image_unsharpen, blurred_large, mask_large, image2, N * N);
+
+        combine<<<num_blocks, block_size_1d, 0, s1>>>(image2, blurred_small, mask_small, image3, N * N);
+
+        cudaStreamEndCapture(s1, &graph);
+        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
+    }
+    cudaGraphLaunch(graphExec, s1);
+    err = cudaStreamSynchronize(s1);
+}
+
+std::string Benchmark8::print_result(bool short_form) {
+    if (short_form) {
+        return std::to_string(image3[0]);
+    } else {
+        std::string res = "[";
+        for (int j = 0; j < 10; j++) {
+            res += std::to_string(image3[j]) + ", ";
+        }
+        return res + ", ...]";
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/cuda/single_gpu/b8.cuh b/projects/resources/cuda/single_gpu/b8.cuh
new file mode 100644
index 00000000..049a2159
--- /dev/null
+++ b/projects/resources/cuda/single_gpu/b8.cuh
@@ -0,0 +1,86 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+#include "../benchmark.cuh"
+
+class Benchmark8 : public Benchmark {
+   public:
+    Benchmark8(Options &options) : Benchmark(options) {}
+    void alloc();
+    void init();
+    void reset();
+    void execute_sync(int iter);
+    void execute_async(int iter);
+    void execute_cudagraph(int iter);
+    void execute_cudagraph_manual(int iter);
+    void execute_cudagraph_single(int iter);
+    std::string print_result(bool short_form = false);
+
+   private:
+    int kernel_small_diameter = 3;
+    int kernel_large_diameter = 5;
+    int kernel_unsharpen_diameter = 3;
+
+    float *image, *image2, *image3, *image_unsharpen, *mask_small, *mask_large, *mask_unsharpen, *blurred_small, *blurred_large, *blurred_unsharpen;
+    float *kernel_small, *kernel_large, *kernel_unsharpen, *maximum, *minimum;
+    cudaStream_t s1, s2, s3, s4, s5;
+    cudaGraph_t graph;
+    cudaGraphExec_t graphExec;
+
+    std::vector<cudaGraphNode_t> nodeDependencies;
+    cudaGraphNode_t kernel_1, kernel_2, kernel_3, kernel_4, kernel_5, kernel_6, kernel_7, kernel_8, kernel_9, kernel_10, kernel_11;
+    cudaKernelNodeParams kernel_1_params;
+    cudaKernelNodeParams kernel_2_params;
+    cudaKernelNodeParams kernel_3_params;
+    cudaKernelNodeParams kernel_4_params;
+    cudaKernelNodeParams kernel_5_params;
+    cudaKernelNodeParams kernel_6_params;
+    cudaKernelNodeParams kernel_7_params;
+    cudaKernelNodeParams kernel_8_params;
+    cudaKernelNodeParams kernel_9_params;
+    cudaKernelNodeParams kernel_10_params;
+    cudaKernelNodeParams kernel_11_params;
+
+    inline void gaussian_kernel(float *kernel, int diameter, float sigma) {
+        int mean = diameter / 2;
+        float sum_tmp = 0;
+        for (int i = 0; i < diameter; i++) {
+            for (int j = 0; j < diameter; j++) {
+                kernel[i * diameter + j] = exp(-0.5 * ((i - mean) * (i - mean) + (j - mean) * (j - mean)) / (sigma * sigma));
+                sum_tmp += kernel[i * diameter + j];
+            }
+        }
+        for (int i = 0; i < diameter; i++) {
+            for (int j = 0; j < diameter; j++) {
+                kernel[i * diameter + j] /= sum_tmp;
+            }
+        }
+    }
+};
\ No newline at end of file
diff --git a/projects/resources/cuda/utils.hpp b/projects/resources/cuda/utils.hpp
new file mode 100644
index 00000000..3cc67e67
--- /dev/null
+++ b/projects/resources/cuda/utils.hpp
@@ -0,0 +1,177 @@
+// Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//  * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//  * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//  * Neither the name of NECSTLab nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//  * Neither the name of Politecnico di Milano nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#pragma once
+
+#include <vector>
+#include <tuple>
+#include <algorithm>
+#include <map>
+#include "dvrapi_error_string.h"
+
+#define checkCudaErrors(err) __checkCudaErrors(err, __FILE__, __LINE__)
+
+// These are the inline versions for all of the SDK helper functions
+inline void __checkCudaErrors(int err, const char *file, const int line) {
+  if (0 != err) {
+    fprintf(stderr,
+            "checkCudaErrors() Driver API error = %04d \"%s\" from file <%s>, "
+            "line %i.\n",
+            err, getCudaDrvErrorString(err), file, line);
+    exit(EXIT_FAILURE);
+  }
+}
+
+template<typename T> struct map_init_helper
+{
+    T& data;
+    map_init_helper(T& d) : data(d) {}
+    map_init_helper& operator() (typename T::key_type const& key, typename T::mapped_type const& value)
+    {
+        data[key] = value;
+        return *this;
+    }
+};
+
+template<typename T> map_init_helper<T> map_init(T& item)
+{
+    return map_init_helper<T>(item);
+}
+
+template<typename I, typename T>
+inline bool compare(const std::tuple<I, I, T, I> &lhs, const std::tuple<I, I, T, I> &rhs) {
+	I a = std::get < 0 > (lhs);
+	I b = std::get < 0 > (rhs);
+	I c = std::get < 1 > (lhs);
+	I d = std::get < 1 > (rhs);
+	if (a == b)
+		return c < d;
+	else
+		return a < b;
+}
+
+template<typename I>
+inline bool compare(const std::tuple<I, I, I> &lhs, const std::tuple<I, I, I> &rhs) {
+	I a = std::get < 0 > (lhs);
+	I b = std::get < 0 > (rhs);
+	I c = std::get < 1 > (lhs);
+	I d = std::get < 1 > (rhs);
+	if (a == b)
+		return c < d;
+	else
+		return a < b;
+}
+
+
+template<typename I, typename T>
+inline void customSort(I *row_indices, I *col_indices, T *values, I nnz) {
+	I nvals = nnz;
+	std::vector<std::tuple<I, I, T, I>> my_tuple;
+
+	for (I i = 0; i < nvals; ++i)
+		my_tuple.push_back(std::make_tuple(row_indices[i], col_indices[i], values[i], i));
+
+	std::sort(my_tuple.begin(), my_tuple.end(), compare<I, T>);
+
+	for (I i = 0; i < nvals; ++i) {
+		row_indices[i] = std::get<0>(my_tuple[i]);
+		col_indices[i] = std::get<1>(my_tuple[i]);
+		values[i] = std::get<2>(my_tuple[i]);
+	}
+}
+
+template<typename I, typename T>
+inline void coo2csr(I *csrRowPtr, I *csrColInd, T *csrVal, I *row_indices,
+		I* col_indices, T* values, I nrows, I ncols, I nnz) {
+
+	I temp, row, col, dest, cumsum = 0;
+
+	std::vector<I> row_indices_t(row_indices, row_indices + nnz);
+	std::vector<I> col_indices_t (col_indices, col_indices + nnz);
+	std::vector<T> values_t(values, values + nnz);
+
+	customSort<I, T>(row_indices_t.data(), col_indices_t.data(), values_t.data(), nnz);
+
+	// Set all rowPtr to 0
+	for (I i = 0; i <= nrows; i++)
+		csrRowPtr[i] = 0;
+
+	// Go through all elements to see how many fall in each row
+	for (I i = 0; i < nnz; i++) {
+		row = row_indices_t[i];
+		if (row >= nrows)
+			std::cout << "Error: Index out of bounds!\n";
+		csrRowPtr[row]++;
+	}
+
+	// Cumulative sum to obtain rowPtr
+	for (I i = 0; i < nrows; i++) {
+		temp = csrRowPtr[i];
+		csrRowPtr[i] = cumsum;
+		cumsum += temp;
+	}
+	csrRowPtr[nrows] = nnz;
+
+	// Store colInd and val
+	for (I i = 0; i < nnz; i++) {
+		row = row_indices_t[i];
+		dest = csrRowPtr[row];
+		col = col_indices_t[i];
+		if (col >= ncols)
+			std::cout << "Error: Index out of bounds!\n";
+		csrColInd[dest] = col;
+		csrVal[dest] = values_t[i];
+		csrRowPtr[row]++;
+	}
+	cumsum = 0;
+
+	// Undo damage done to rowPtr
+	for (I i = 0; i < nrows; i++) {
+		temp = csrRowPtr[i];
+		csrRowPtr[i] = cumsum;
+		cumsum = temp;
+	}
+	temp = csrRowPtr[nrows];
+	csrRowPtr[nrows] = cumsum;
+	cumsum = temp;
+}
+
+template<typename T>
+inline void print_graph(T *ptr, T* idx, int N, int max_N = 20, int max_E = 20) {
+	std::cout << "-) degree: " << ptr[0] << std::endl;
+	for (int v = 1; v < std::min((int) N + 1, max_N); v++) {
+		std::cout << v - 1 << ") degree: " << ptr[v] - ptr[v - 1] << ", edges: ";
+		for (int e = 0; e < ptr[v] - ptr[v - 1]; e++) {
+			if (e < max_E) {
+				std::cout << idx[ptr[v - 1] + e] << ", ";
+			}
+		}
+		std::cout << std::endl;
+	}
+}
diff --git a/projects/resources/java/grcuda-benchmark/pom.xml b/projects/resources/java/grcuda-benchmark/pom.xml
new file mode 100644
index 00000000..8b1001c5
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/pom.xml
@@ -0,0 +1,92 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+
+  <groupId>it.necst.grcuda.benchmark</groupId>
+  <artifactId>grcuda-benchmark</artifactId>
+  <version>1.0-SNAPSHOT</version>
+  <name>grcuda-benchmark</name>
+
+  <properties>
+    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+    <maven.compiler.source>11</maven.compiler.source>
+    <maven.compiler.target>11</maven.compiler.target>
+  </properties>
+
+
+
+  <dependencies>
+
+    <dependency>
+      <groupId>com.google.code.gson</groupId>
+      <artifactId>gson</artifactId>
+      <version>2.9.0</version>
+    </dependency>
+
+    <dependency>
+      <groupId>junit</groupId>
+      <artifactId>junit</artifactId>
+      <version>4.13.2</version>
+    </dependency>
+
+    <dependency>
+      <groupId>com.fasterxml.jackson.core</groupId>
+      <artifactId>jackson-databind</artifactId>
+      <version>2.13.3</version>
+    </dependency>
+
+  </dependencies>
+
+  <build>
+    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
+      <plugins>
+        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
+        <plugin>
+          <artifactId>maven-clean-plugin</artifactId>
+          <version>3.1.0</version>
+        </plugin>
+        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
+        <plugin>
+          <artifactId>maven-resources-plugin</artifactId>
+          <version>3.0.2</version>
+        </plugin>
+        <plugin>
+          <artifactId>maven-compiler-plugin</artifactId>
+          <version>3.8.0</version>
+        </plugin>
+        <plugin>
+          <artifactId>maven-surefire-plugin</artifactId>
+          <version>2.22.1</version>
+          <configuration>
+            <environmentVariables>
+              <GRCUDA_HOME>${env.GRCUDA_HOME}</GRCUDA_HOME>
+            </environmentVariables>
+          </configuration>
+        </plugin>
+        <plugin>
+          <artifactId>maven-jar-plugin</artifactId>
+          <version>3.0.2</version>
+        </plugin>
+        <plugin>
+          <artifactId>maven-install-plugin</artifactId>
+          <version>2.5.2</version>
+        </plugin>
+        <plugin>
+          <artifactId>maven-deploy-plugin</artifactId>
+          <version>2.8.2</version>
+        </plugin>
+        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
+        <plugin>
+          <artifactId>maven-site-plugin</artifactId>
+          <version>3.7.1</version>
+        </plugin>
+        <plugin>
+          <artifactId>maven-project-info-reports-plugin</artifactId>
+          <version>3.0.0</version>
+        </plugin>
+      </plugins>
+    </pluginManagement>
+  </build>
+</project>
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/Benchmark.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/Benchmark.java
new file mode 100644
index 00000000..91e09777
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/Benchmark.java
@@ -0,0 +1,213 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark;
+
+import java.util.ArrayList;
+import java.util.function.Consumer;
+import org.graalvm.polyglot.Context;
+import org.graalvm.polyglot.Value;
+
+
+public abstract class Benchmark {
+    public Context context;
+    public final BenchmarkConfig config;
+    public final BenchmarkResults benchmarkResults;
+    public ArrayList<Value> deviceArrayList = new ArrayList<>(); // used to store all the arrays to be freed at the end of the benchmark
+
+    public Benchmark(BenchmarkConfig currentConfig) {
+        this.config = currentConfig;
+        this.benchmarkResults = new BenchmarkResults(currentConfig);
+        this.context = createContext(currentConfig);
+    }
+
+    /**
+     * This method is used to run the current benchmark.
+     * It will use the information stored in the config attribute to decide whether to do an additional initialization phase and
+     the cpuValidation.
+     */
+    public void run() {
+        if(config.debug)
+            System.out.println("INSIDE run()");
+
+        for (int i = 0; i < config.totIter; ++i) {
+            if(config.debug)
+                System.out.println("["+i+"] START");
+            benchmarkResults.startNewIteration(i, config.timePhases); // create the current iteration in the result class
+
+            // Start a timer to monitor the total GPU execution time
+            long overall_gpu_start = System.nanoTime();
+
+            // Allocate memory for the benchmark
+
+            if (config.reAlloc || i == 0){
+                if(config.debug)
+                    System.out.println("["+i+"] alloc");
+                time(i, "alloc", this::allocateTest);
+            }
+
+            // Initialize memory for the benchmark
+
+            if (config.reInit || i == 0){
+                if(config.debug)
+                    System.out.println("["+i+"] init");
+                time(i, "init", this::initializeTest);
+            }
+
+            // Reset the result
+            if(config.debug)
+                System.out.println("["+i+"] reset");
+            time(i, "reset", this::resetIteration);
+
+            if(config.nvprof_profile){
+                context.eval("grcuda", "cudaProfilerStart").execute();
+            }
+
+            // Execute the benchmark
+            if(config.debug)
+                System.out.println("["+i+"] execution");
+            time(i, "execution", this::runTest);
+
+            if(config.nvprof_profile){
+                context.eval("grcuda", "cudaProfilerStop").execute();
+            }
+
+            // Stop the timer
+            long overall_gpu_end = System.nanoTime();
+
+            benchmarkResults.setCurrentTotalTime((overall_gpu_end - overall_gpu_start) / 1000000000F);
+
+            // Perform validation on CPU
+            if (config.cpuValidate && i == 0)
+                cpuValidation();
+
+            if(config.debug)
+                System.out.println("["+i+"] VALIDATION \nCPU: " + benchmarkResults.cpu_result+"\nGPU: " + benchmarkResults.currentIteration().gpu_result);
+        }
+
+        // Save the benchmark results
+        benchmarkResults.saveToJsonFile();
+
+
+        // Free the allocated arrays
+        deallocDeviceArrays();
+
+        //  Gracefully close the current context
+        context.close();
+    }
+
+    /**
+     * This method is used to time the function passed to it.
+     * It will add the timing and the phase name to the benchmarkResult attribute.
+     * @param iteration the current iteration of the benchmark
+     * @param phaseName the current phase of the benchmark
+     * @param functionToTime the function to time passed like "class::funName"
+     */
+    private void time(int iteration, String phaseName, Consumer<Integer> functionToTime){
+        long begin = System.nanoTime();
+        functionToTime.accept(iteration);
+        long end = System.nanoTime();
+        benchmarkResults.addPhaseToCurrentIteration(phaseName, (end - begin)/ 1000000000F);
+    }
+
+    protected void deallocDeviceArrays(){
+        for(Value v : deviceArrayList)
+            v.invokeMember("free");
+    }
+
+    protected Value requestArray(String type, int size){
+        Value vector = context.eval("grcuda", type+"["+ size +"]");
+        deviceArrayList.add(vector);
+        return vector;
+    }
+
+    private Context createContext(BenchmarkConfig config){
+        return Context
+                .newBuilder()
+                .allowAllAccess(true)
+                .allowExperimentalOptions(true)
+                //logging settings
+                .option("log.grcuda.com.nvidia.grcuda.level", "WARNING")
+                .option("log.grcuda.com.nvidia.grcuda.GrCUDAContext.level", "SEVERE")
+                //GrCUDA env settings
+                .option("grcuda.ExecutionPolicy", config.executionPolicy)
+                .option("grcuda.InputPrefetch", String.valueOf(config.inputPrefetch))
+                .option("grcuda.RetrieveNewStreamPolicy", config.retrieveNewStreamPolicy)
+                .option("grcuda.RetrieveParentStreamPolicy", config.retrieveParentStreamPolicy)
+                .option("grcuda.DependencyPolicy", config.dependencyPolicy)
+                .option("grcuda.DeviceSelectionPolicy", config.deviceSelectionPolicy)
+                .option("grcuda.ForceStreamAttach", String.valueOf(config.forceStreamAttach))
+                .option("grcuda.EnableComputationTimers", String.valueOf(config.enableComputationTimers))
+                .option("grcuda.MemAdvisePolicy", config.memAdvisePolicy)
+                .option("grcuda.NumberOfGPUs", String.valueOf(config.numGpus))
+                .option("grcuda.BandwidthMatrix", config.bandwidthMatrix)
+                .build();
+    }
+
+    /*
+        ###################################################################################
+                        METHODS TO BE IMPLEMENTED IN THE BENCHMARKS
+        ###################################################################################
+    */
+
+    /**
+     * Here goes the read of the test parameters,
+     * the initialization of the necessary arrays
+     * and the creation of the kernels (if applicable)
+     * @param iteration the current number of the iteration
+     */
+    protected abstract void initializeTest(int iteration);
+
+    /**
+     * Allocate new memory on GPU used for the benchmark
+     * @param iteration the current number of the iteration
+     */
+    protected abstract void allocateTest(int iteration);
+
+    /**
+     * Reset code, to be run before each test
+     * Here you clean up the arrays and other reset stuffs
+     * @param iteration the current number of the iteration
+     */
+    protected abstract void resetIteration(int iteration);
+
+    /**
+     * Run the actual test
+     * @param iteration the current number of the iteration
+     */
+    protected abstract void runTest(int iteration);
+
+    /**
+     * (numerically) validate results against CPU
+     */
+    protected abstract void cpuValidation();
+
+}
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/BenchmarkConfig.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/BenchmarkConfig.java
new file mode 100644
index 00000000..1c073963
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/BenchmarkConfig.java
@@ -0,0 +1,105 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark;
+
+import com.fasterxml.jackson.annotation.JsonIgnore;
+
+/**
+ * This class will be passed to initialize the configuration of a benchmark.
+ */
+public class BenchmarkConfig {
+    /**
+     * Default parameters
+     */
+    public String benchmarkName = "";
+    public String setupId = "";
+    public int totIter;
+    public int currentIter;
+    public int randomSeed = 42;
+    public int size;
+    public int blockSize1D = 32;
+    public int blockSize2D = 8;
+    boolean timePhases = false;
+    public int numBlocks = 8;
+    public boolean randomInit = false;
+    public boolean reInit = false;
+    public boolean reAlloc = false;
+    public boolean cpuValidate = false;
+    // GrCUDA context settings
+    public String executionPolicy;
+    public boolean inputPrefetch;
+    public String retrieveNewStreamPolicy;
+    public String retrieveParentStreamPolicy;
+    public String dependencyPolicy;
+    public String deviceSelectionPolicy;
+    public boolean forceStreamAttach;
+    public boolean enableComputationTimers;
+    public int numGpus;
+    public String memAdvisePolicy;
+    @JsonIgnore public String bandwidthMatrix;
+    // Debug parameters
+    public boolean debug;
+    public boolean nvprof_profile;
+    public String gpuModel;
+    @JsonIgnore public String results_path;
+
+    @Override
+    public String toString() {
+        return "BenchmarkConfig{" +
+                "benchmarkName='" + benchmarkName + '\'' +
+                ", setupId='" + setupId + '\'' +
+                ", totIter=" + totIter +
+                ", currentIter=" + currentIter +
+                ", randomSeed=" + randomSeed +
+                ", size=" + size +
+                ", blockSize1D=" + blockSize1D +
+                ", blockSize2D=" + blockSize2D +
+                ", timePhases=" + timePhases +
+                ", numBlocks=" + numBlocks +
+                ", randomInit=" + randomInit +
+                ", reInit=" + reInit +
+                ", reAlloc=" +reAlloc+
+                ", cpuValidate=" + cpuValidate +
+                ", executionPolicy='" + executionPolicy + '\'' +
+                ", inputPrefetch=" + inputPrefetch +
+                ", retrieveNewStreamPolicy='" + retrieveNewStreamPolicy + '\'' +
+                ", retrieveParentStreamPolicy='" + retrieveParentStreamPolicy + '\'' +
+                ", dependencyPolicy='" + dependencyPolicy + '\'' +
+                ", deviceSelectionPolicy='" + deviceSelectionPolicy + '\'' +
+                ", forceStreamAttach=" + forceStreamAttach +
+                ", enableComputationTimers=" + enableComputationTimers +
+                ", numGpus=" + numGpus +
+                ", memAdvisePolicy='" + memAdvisePolicy + '\'' +
+                ", bandwidthMatrix='" + bandwidthMatrix + '\'' +
+                '}';
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/BenchmarkResults.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/BenchmarkResults.java
new file mode 100644
index 00000000..a45836d4
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/BenchmarkResults.java
@@ -0,0 +1,154 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark;
+
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.SerializationFeature;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.LinkedList;
+
+/**
+ * This class stores all the results coming from a benchmark.
+ * It is mainly composed of a linked list containing various BenchmarkRecord, those records are reporting information on the single phases in the benchmark (like timings etc).
+ */
+public class BenchmarkResults {
+    public final BenchmarkConfig config;
+    public LinkedList<Iteration> iterations = new LinkedList<>();
+    public ArrayList<String> filteredPhases = new ArrayList<>();
+    public double cpu_result;
+
+    BenchmarkResults(BenchmarkConfig config){
+        this.config = config;
+        filteredPhases.add("alloc");
+        filteredPhases.add("reset");
+        filteredPhases.add("init");
+    }
+
+    public void startNewIteration(int iter_num, boolean time_phases){
+        iterations.add(new Iteration(iter_num, time_phases));
+    }
+    public void addPhaseToCurrentIteration(String phaseName, double execTime){
+        iterations.getLast().addPhase(phaseName, execTime);
+    }
+    public void saveToJsonFile() {
+        try {
+            ObjectMapper objectMapper = new ObjectMapper().enable(SerializationFeature.INDENT_OUTPUT);
+            if(config.debug)
+                System.out.println(objectMapper.writeValueAsString(this));
+
+            String sb =
+                    config.benchmarkName +
+                            "_" + config.size +
+                            "_" + config.numGpus +
+                            "_" + config.numBlocks +
+                            "_" + config.executionPolicy +
+                            "_" + config.dependencyPolicy +
+                            "_" + config.retrieveNewStreamPolicy +
+                            "_" + config.retrieveParentStreamPolicy +
+                            "_" + config.deviceSelectionPolicy +
+                            "_" + config.memAdvisePolicy +
+                            "_" + config.inputPrefetch +
+                            "_" + config.forceStreamAttach +
+                            ".json";
+
+            objectMapper.writeValue(new File(config.results_path+"/"+ sb), this);
+        } catch (IOException e) {
+            throw new RuntimeException(e);
+        }
+
+    }
+
+
+    public void setCurrentGpuResult(double gpuResult){
+        iterations.getLast().gpu_result = gpuResult;
+    }
+    public void setCurrentCpuResult(double cpuResult){
+        this.cpu_result = cpuResult;
+    }
+    public void setCurrentComputationSec(double computationSec){iterations.getLast().computation_sec = computationSec;}
+    public void setCurrentOverheadSec(double overheadSec){iterations.getLast().overhead_sec = overheadSec;}
+    public void setCurrentTotalTime(double totalTime){
+        iterations.getLast().total_time_sec = totalTime;
+        double tot_time_phases = 0;
+        for(Phase phase : iterations.getLast().phases){
+            if(!filteredPhases.contains(phase.phaseName))
+                tot_time_phases += phase.executionTime_sec;
+        }
+        iterations.getLast().overhead_sec = totalTime-tot_time_phases;
+        iterations.getLast().computation_sum_phases_sec = tot_time_phases;
+    }
+
+
+    public double currentGpuResult(){
+        return iterations.getLast().gpu_result;
+    }
+    public double currentCpuResult(){
+        return this.cpu_result;
+    }
+    public Iteration currentIteration(){return iterations.getLast();}
+
+}
+
+class Iteration{
+    public int iteration;
+    public boolean time_phases;
+    public double gpu_result;
+
+    public double computation_sec;
+    public double total_time_sec;
+    public double overhead_sec;
+    public double computation_sum_phases_sec;
+
+    public ArrayList<Phase> phases = new ArrayList<>();
+
+    public Iteration(int iteration, boolean time_phases) {
+        this.iteration = iteration;
+        this.time_phases = time_phases;
+    }
+
+    public void addPhase(String phaseName, double execTime){phases.add(new Phase(phaseName, execTime));}
+
+}
+
+class Phase {
+    public String phaseName;         // the phase of the benchmark that this class is representing
+    public double executionTime_sec;       // the execution time of the current phase
+
+    public Phase(String phaseName, double executionTime_sec) {
+        this.phaseName = phaseName;
+        this.executionTime_sec = executionTime_sec;
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B1.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B1.java
new file mode 100644
index 00000000..9ff1cb47
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B1.java
@@ -0,0 +1,162 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark.bench;
+
+import static org.junit.Assert.assertEquals;
+import org.graalvm.polyglot.Value;
+import it.necst.grcuda.benchmark.Benchmark;
+import it.necst.grcuda.benchmark.BenchmarkConfig;
+
+
+public class B1 extends Benchmark {
+
+    private static final String SQUARE_KERNEL = "" +
+            "extern \"C\" __global__ void square(float* x, float* y, int n) { \n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        y[i] = x[i] * x[i];\n" +
+            "    }\n" +
+            "}\n";
+
+    private static final String REDUCE_KERNEL = "" +
+            "// From https://devblogs.nvidia.com/faster-parallel-reductions-kepler/\n" +
+            "\n" + "__inline__ __device__ float warp_reduce(float val) {\n" +
+            "    int warp_size = 32;\n" + "    for (int offset = warp_size / 2; offset > 0; offset /= 2)\n" +
+            "        val += __shfl_down_sync(0xFFFFFFFF, val, offset);\n" +
+            "    return val;\n" + "}\n" + "\n" + "__global__ void reduce(float *x, float *y, float* z, int N) {\n" +
+            "    int warp_size = 32;\n" + "    float sum = float(0);\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {\n" +
+            "        sum += x[i] - y[i];\n" + "    }\n" +
+            "    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;\n" +
+            "    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster\n" +
+            "        atomicAdd(z, sum); // The first thread in the warp updates the output;\n" +
+            "}";
+
+    private Value squareKernelFunction;
+    private Value reduceKernelFunction;
+    private Value x, x1, y, y1, res;
+
+    public B1(BenchmarkConfig currentConfig) {
+        super(currentConfig);
+    }
+
+    @Override
+    public void initializeTest(int iteration) {
+        assert (!config.randomInit);
+        for (int i = 0; i < config.size; i++) {
+            x.setArrayElement(i, 1.0f / (i + 1));
+            y.setArrayElement(i, 2.0f / (i + 1));
+        }
+    }
+
+    @Override
+    public void allocateTest(int iteration) {
+        // Alloc arrays
+        x = requestArray("float", config.size);
+        x1 = requestArray("float", config.size);
+        y = requestArray("float", config.size);
+        y1 = requestArray("float", config.size);
+
+        // Allocate a support vector
+        res = requestArray("float", 1);
+
+        // Context initialization
+        Value buildKernel = context.eval("grcuda", "buildkernel");
+
+        // Build the kernels;
+        squareKernelFunction = buildKernel.execute(SQUARE_KERNEL, "square", "pointer, pointer, sint32");
+        reduceKernelFunction = buildKernel.execute(REDUCE_KERNEL, "reduce", "pointer, pointer, pointer, sint32");
+    }
+
+    @Override
+    public void resetIteration(int iteration) {
+        initializeTest(iteration);
+        res.setArrayElement(0, 0.0f);
+    }
+
+    @Override
+    public void runTest(int iteration) {
+        long start = System.nanoTime();
+
+        if(config.debug)
+            System.out.println("    INSIDE runTest() - " + iteration);
+
+        // A, B. Call the kernel. The 2 computations are independent, and can be done in parallel
+        squareKernelFunction.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(x, x1, config.size); // Execute actual kernel
+        squareKernelFunction.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(y, y1, config.size); // Execute actual kernel
+
+        // C. Compute the sum of the result
+        reduceKernelFunction.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(x1, y1, res, config.size); // Execute actual kernel
+
+        long end = System.nanoTime();
+
+
+        // Sync step to measure the real computation time
+        benchmarkResults.setCurrentGpuResult(res.getArrayElement(0).asFloat());
+        benchmarkResults.setCurrentComputationSec((end-start)/1000000000F);
+
+    }
+
+
+    @Override
+    public void cpuValidation() {
+        assert (!config.randomInit);
+
+        float[] xHost = new float[config.size];
+        float[] yHost = new float[config.size];
+
+        for (int i = 0; i < config.size; i++) {
+            xHost[i] = 1.0f / (i + 1);
+            yHost[i] = 2.0f / (i + 1);
+        }
+
+        for (int i = 0; i < config.size; i++) {
+            xHost[i] = xHost[i] * xHost[i];
+            yHost[i]=  yHost[i] * yHost[i];
+            xHost[i] -= yHost[i];
+        }
+
+        float acc = 0.0f;
+
+        for (int i = 0; i < config.size; i++) {
+            acc += xHost[i];
+        }
+
+        benchmarkResults.setCurrentCpuResult(acc);
+
+        assertEquals(benchmarkResults.currentCpuResult(), benchmarkResults.currentGpuResult(), 1e-3); //TODO: IT IS FAILING WITH THIS DELTA --> INVESTIGATE
+
+    }
+
+}
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B11M.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B11M.java
new file mode 100644
index 00000000..3f99ee7a
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B11M.java
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark.bench;
+
+import it.necst.grcuda.benchmark.Benchmark;
+import it.necst.grcuda.benchmark.BenchmarkConfig;
+import org.graalvm.polyglot.Value;
+
+import static org.junit.Assert.assertEquals;
+
+public class B11M extends Benchmark {
+    /*
+     *  Dense matrix-vector multiplication, partitioning the matrix in blocks of rows;
+     */
+
+    private static final String MATRIX_VECTOR_MULT_KERNEL = "" +
+            "extern \"C\" __global__ void matrix_vector_mult_1(const float* x, const float* y, float* z, int n, int m, int z_offset) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        float sum = 0;\n" +
+            "        for (int j = 0; j < m; j++) {                \n" +
+            "            sum += x[i * m + j] * y[j];\n" +
+            "        }\n" +
+            "        z[z_offset + i] = sum;\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void matrix_vector_mult_2(const float* x, const float* y, float* z, int n, int m) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        float sum = 0;\n" +
+            "        for (int j = 0; j < m; j++) {                \n" +
+            "            sum += x[i * m + j] * y[j];\n" +
+            "        }\n" +
+            "        z[i] = sum;\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void copy(const float *x, float *y, int n, int offset) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        y[i + offset] = x[i];\n" +
+            "    }\n" +
+            "}";
+
+    private Value matrix_vector_mult_kernel, copy_kernel, initialize;
+    private Value[] x, z;
+    private Value y, z_out;
+    private float[][] x_cpu_matrix;
+    private float[] x_cpu_array, y_cpu;
+    private int N, M, P, S;
+
+    public B11M(BenchmarkConfig currentConfig) {
+        super(currentConfig);
+
+        // Square matrix of size x size
+        this.N = config.size;
+        this.M = config.size;
+
+        // Use P horizontal partitions
+        this.P = 16;
+
+        // Size of partitions
+        this.S = Math.floorDiv(this.N + this.P - 1, this.P);
+
+        // Full matrix
+        this.x_cpu_array = null;
+        this.x_cpu_matrix = null;
+        // Dense vector
+        this.y_cpu = null;
+
+        // The GPU matrix is stored using P arrays
+        this.x = new Value[this.P];
+        for (int i = 0; i < this.P; i++) {
+            this.x[i] = null;
+        }
+        // Dense vector
+        this.y = null;
+        // Result
+        // this.z = null;
+        this.z = new Value[this.P];
+        for (int i = 0; i < this.P; i++) {
+            this.z[i] = null;
+        }
+        this.z_out = null;
+
+        this.matrix_vector_mult_kernel = null;
+    }
+
+    @Override
+    public void allocateTest(int iteration) {
+        this.N = config.size;
+        this.M = config.size;
+        this.S = Math.floorDiv(this.N + this.P - 1, this.P);
+
+        // Allocate vectors
+        for (int i = 0; i < this.P; i++) {
+            this.x[i] = requestArray("float", this.S * this.M);
+        }
+        this.y = requestArray("float", this.M);
+        // this.z = requestArray("float", this.N);
+        for (int i = 0; i < this.P; i++) {
+            this.z[i] = requestArray("float", this.S);
+        }
+        this.z_out = requestArray("float", this.N);
+
+        // Build the kernels;
+        Value buildKernel = context.eval("grcuda", "buildkernel");
+        // this.matrix_vector_mult_kernel = buildKernel.execute(MATRIX_VECTOR_MULT_KERNEL, "matrix_vector_mult_2", "const pointer, const pointer, pointer, sint32, sint32, sint32")
+        this.matrix_vector_mult_kernel = buildKernel.execute(MATRIX_VECTOR_MULT_KERNEL, "matrix_vector_mult_2", "const pointer, const pointer, pointer, sint32, sint32");
+        this.copy_kernel = buildKernel.execute(MATRIX_VECTOR_MULT_KERNEL, "copy", "const pointer, pointer, sint32, sint32");
+        this.initialize = context.eval("js", "x => { for (let i = 0; i < x.length; i++) { x[i] = i / x.length }}");
+    }
+
+    @Override
+    public void initializeTest(int iteration) {
+        assert (!config.randomInit); // randomInit not supported yet
+    }
+
+    @Override
+    public void resetIteration(int iteration) {
+        // Reset result
+
+        for (int i = 0; i < this.P; i++) this.initialize.execute(this.x[i]);
+        for (int i = 0; i < this.M; i++) {
+            this.y.setArrayElement(i, (float)(i) / this.M);
+        }
+    }
+
+    @Override
+    public void runTest(int iteration) {
+        long start = System.nanoTime();
+
+        // Compute all partitions
+        for (int p = 0; p < this.P; p++) {
+            // this.matrix_vector_mult_kernel.execute(config.numBlocks, config.blockSize1D)
+            //         .execute(this.x[p], this.y, this.z, Math.min(this.S, this.N - p * this.S), this.M, p * this.S);
+            this.matrix_vector_mult_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.x[p], this.y, this.z[p], Math.min(this.S, this.N - p * this.S), this.M);
+        }
+
+        // Aggregate results
+        for (int p = 0; p < this.P; p++) {
+            this.copy_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.z[p], this.z_out, Math.min(this.S, this.N - p * this.S), p * this.S);
+        }
+
+        // Add a final sync step to measure the real computation time  Math.sum(self.z_out[:10])
+        float sum = 0;
+        for (int i = 0; i < 10; i++) sum += this.z_out.getArrayElement(i).asFloat();
+
+        long end = System.nanoTime();
+
+        benchmarkResults.setCurrentGpuResult(sum);
+        benchmarkResults.setCurrentComputationSec((end-start)/1000000000F);
+
+    }
+
+    @Override
+    public void cpuValidation() {
+        float[] z_cpu;
+        float sum;
+
+        x_cpu_array = new float[this.N * this.M];
+        x_cpu_matrix = new float[this.N][this.M];
+        y_cpu = new float[this.M];
+
+        for (int i = 0; i < this.N * this.M; i++) x_cpu_array[i] = 0.0F;
+        for (int i = 0; i < this.M; i++) y_cpu[i] = this.y.getArrayElement(i).asFloat();
+
+        for (int i = 0; i < this.P; i++) {
+            for (int j = 0; j < this.S * this.M; j++) {
+                if (i * this.S * this.M + j < x_cpu_array.length) {
+                    x_cpu_array[i * this.S * this.M + j] = (float) (j) / (this.S * this.M);
+                }
+            }
+        }
+        for (int r = 0; r < this.N; r++)
+            for (int c = 0; c < this.M; c++) {
+                x_cpu_matrix[r][c] = x_cpu_array[r * this.M + c];
+            }
+        z_cpu = matrixMult(x_cpu_matrix, y_cpu);
+
+        sum = 0;
+        for (int i = 0; i < 10; i++) sum += z_cpu[i];
+        benchmarkResults.setCurrentCpuResult(sum);
+
+        // Compare GPU and CPU results
+        assertEquals(benchmarkResults.currentGpuResult(), sum, 1e-4);
+    }
+
+    private float[] matrixMult(float[][] a, float[] b) {
+        float[] res = new float[a.length];
+        float tempSum;
+
+        for (int r = 0; r < a.length; r++) {
+            tempSum = 0;
+            for (int k = 0; k < b.length; k++) {
+                tempSum += a[r][k] * b[k];
+            }
+            res[r] = tempSum;
+        }
+        return res;
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B1M.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B1M.java
new file mode 100644
index 00000000..a0ac5fad
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B1M.java
@@ -0,0 +1,195 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark.bench;
+
+import it.necst.grcuda.benchmark.Benchmark;
+import it.necst.grcuda.benchmark.BenchmarkConfig;
+import org.graalvm.polyglot.Value;
+import java.util.ArrayList;
+import static org.junit.Assert.assertEquals;
+
+
+public class B1M extends Benchmark {
+
+    private static final String SQUARE_KERNEL = "" +
+            "extern \"C\" __global__ void square(float* x, float* y, int n) { \n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        y[i] = x[i] * x[i];\n" +
+            "    }\n" +
+            "}\n";
+
+    private static final String REDUCE_KERNEL = "" +
+            "// From https://devblogs.nvidia.com/faster-parallel-reductions-kepler/\n" +
+            "\n" + "__inline__ __device__ float warp_reduce(float val) {\n" +
+            "    int warp_size = 32;\n" + "    for (int offset = warp_size / 2; offset > 0; offset /= 2)\n" +
+            "        val += __shfl_down_sync(0xFFFFFFFF, val, offset);\n" +
+            "    return val;\n" + "}\n" + "\n" + "__global__ void reduce(float *x, float *y, float* z, int N) {\n" +
+            "    int warp_size = 32;\n" + "    float sum = float(0);\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {\n" +
+            "        sum += x[i] - y[i];\n" + "    }\n" +
+            "    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;\n" +
+            "    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster\n" +
+            "        atomicAdd(z, sum); // The first thread in the warp updates the output;\n" +
+            "}";
+
+    private Value squareKernelFunction;
+    private Value reduceKernelFunction;
+    private Value initialize;
+    private ArrayList<Value> x, x1, y, y1, res;
+    //private Value initialize;
+    double res_tot=0;
+    private int partitionSize;
+    private final int P = 16;
+
+    public B1M(BenchmarkConfig currentConfig) {
+        super(currentConfig);
+    }
+
+    @Override
+    public void initializeTest(int iteration) {
+        assert (!config.randomInit);
+        for(int i=0; i<P; i++){
+            initialize.execute(x.get(i), i, config.size, 1.0f);
+            initialize.execute(y.get(i), i, config.size, 2.0f);
+
+        }
+    }
+
+    private void initializeWithJava(Value x, int i, int N, float a){
+        long index;
+        for(int j = 0; j<x.getArraySize(); j++){
+            index = i * x.getArraySize() + j;
+            if(index < N ){
+                x.setArrayElement(j, a / (index+1));
+            }
+        }
+    }
+
+    @Override
+    public void allocateTest(int iteration) {
+        // Compute the partition size
+        partitionSize = (config.size + P -1) / P;
+
+        // Alloc arrays
+        x = new ArrayList<>();
+        x1 = new ArrayList<>();
+        y = new ArrayList<>();
+        y1 = new ArrayList<>();
+        res = new ArrayList<>();
+
+        for(int i=0; i<P; i++){
+            x.add(requestArray("float", partitionSize));
+            x1.add(requestArray("float", partitionSize));
+            y.add(requestArray("float", partitionSize));
+            y1.add(requestArray("float", partitionSize));
+            res.add(requestArray("float", 1));
+        }
+
+        // Context initialization
+        Value buildKernel = context.eval("grcuda", "buildkernel");
+
+        // Build the kernels;
+        squareKernelFunction = buildKernel.execute(SQUARE_KERNEL, "square", "pointer, pointer, sint32");
+        reduceKernelFunction = buildKernel.execute(REDUCE_KERNEL, "reduce", "pointer, pointer, pointer, sint32");
+
+        initialize = context.eval("js", "(x, i, N, a) => { for (let j = 0; j < x.length; j++) { let index = i * x.length + j; if (index < N) {x[j] = a / (index + 1); }}}");
+
+    }
+
+    @Override
+    public void resetIteration(int iteration) {
+        for(int i=0; i<P; i++){
+            initialize.execute(x.get(i), i, config.size, 1);
+            initialize.execute(y.get(i), i, config.size, 2);
+            //initializeWithJava(x.get(i), i, config.size, 1.0f);
+            //initializeWithJava(y.get(i), i, config.size, 2.0f);
+            res.get(i).setArrayElement(0, 0.0f);
+        }
+        res_tot = 0;
+    }
+
+    @Override
+    public void runTest(int iteration) {
+        long start = System.nanoTime();
+
+        for(int i=0; i<P; i++){
+            // A, B. Call the kernel. The 2 computations are independent, and can be done in parallel;
+            squareKernelFunction.execute(config.numBlocks, config.blockSize1D).execute(x.get(i), x1.get(i), partitionSize);
+            squareKernelFunction.execute(config.numBlocks, config.blockSize1D).execute(y.get(i), y1.get(i), partitionSize);
+            // C. Compute the sum of the result;
+            reduceKernelFunction.execute(config.numBlocks, config.blockSize1D).execute(x1.get(i), y1.get(i), res.get(i), partitionSize);
+        }
+
+        for(int i=0; i<P; i++){
+            float val = res.get(i).getArrayElement(0).asFloat();
+            if(!Float.isNaN(val))
+                res_tot += val;
+        }
+        long end = System.nanoTime();
+
+        // Sync step to measure the real computation time
+        benchmarkResults.setCurrentGpuResult(res_tot);
+        benchmarkResults.setCurrentComputationSec((end-start)/1000000000F);
+    }
+
+
+    @Override
+    public void cpuValidation() {
+        assert (!config.randomInit);
+
+        float[] xHost = new float[config.size];
+        float[] yHost = new float[config.size];
+
+        for (int i = 0; i < config.size; i++) {
+            xHost[i] = 1.0f / (i + 1);
+            yHost[i] = 2.0f / (i + 1);
+        }
+
+        for (int i = 0; i < config.size; i++) {
+            xHost[i] = xHost[i] * xHost[i];
+            yHost[i]=  yHost[i] * yHost[i];
+            xHost[i] -= yHost[i];
+        }
+
+        double acc = 0.0f;
+
+        for (int i = 0; i < config.size; i++) {
+            acc += xHost[i];
+        }
+
+        benchmarkResults.setCurrentCpuResult(acc);
+
+        assertEquals(benchmarkResults.currentCpuResult(), benchmarkResults.currentGpuResult(), 1e-3);
+
+    }
+
+}
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B5M.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B5M.java
new file mode 100644
index 00000000..dfeb35de
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B5M.java
@@ -0,0 +1,218 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark.bench;
+
+import it.necst.grcuda.benchmark.Benchmark;
+import it.necst.grcuda.benchmark.BenchmarkConfig;
+import org.graalvm.polyglot.Value;
+
+import static org.junit.Assert.assertEquals;
+
+public class B5M extends Benchmark {
+    /*
+     *  Black & Scholes equation benchmark, executed concurrently on different input vectors;
+     */
+
+    private static final String BS_KERNEL = "" +
+            "__device__ inline double cndGPU(double d) {\n" +
+            "    const double       A1 = 0.31938153f;\n" +
+            "    const double       A2 = -0.356563782f;\n" +
+            "    const double       A3 = 1.781477937f;\n" +
+            "    const double       A4 = -1.821255978f;\n" +
+            "    const double       A5 = 1.330274429f;\n" +
+            "    const double RSQRT2PI = 0.39894228040143267793994605993438f;\n" +
+            "\n" +
+            "    double\n" +
+            "    K = 1.0 / (1.0 + 0.2316419 * fabs(d));\n" + "\n" +
+            "    double\n" + "    cnd = RSQRT2PI * exp(- 0.5f * d * d) *\n" +
+            "          (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));\n" +
+            "\n" +
+            "    if (d > 0)\n" + "        cnd = 1.0 - cnd;\n" +
+            "\n" +
+            "    return cnd;\n" + "}\n" + "\n" +
+            "extern \"C\" __global__ void bs(const double *x, double *y, int N, double R, double V, double T, double K) {\n" +
+            "\n" +
+            "    double sqrtT = 1.0 / rsqrt(T);\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {\n" +
+            "        double expRT;\n" +
+            "        double d1, d2, CNDD1, CNDD2;\n" +
+            "        d1 = (log(x[i] / K) + (R + 0.5 * V * V) * T) / (V * sqrtT);\n" +
+            "        d2 = d1 - V * sqrtT;\n" + "\n" + "        CNDD1 = cndGPU(d1);\n" +
+            "        CNDD2 = cndGPU(d2);\n" + "\n" + "        //Calculate Call and Put simultaneously\n" +
+            "        expRT = exp(-R * T);\n" + "        y[i] = x[i] * CNDD1 - K * expRT * CNDD2;\n" +
+            "    }\n" +
+            "}";
+
+    private Value bs_kernelFunction;
+    private final Value[] x, y;
+    private double[] x_tmp;
+    private static final float R = 0.08f;
+    private static final float V = 0.3f;
+    private static final float T = 1.0f;
+    private static final float global_K = 60.0f;
+    private final int local_K;
+
+    public B5M(BenchmarkConfig currentConfig) {
+        super(currentConfig);
+
+        this.local_K = 11; //previously was 24
+        this.x = new Value[this.local_K];
+        this.x_tmp = null;
+        this.y = new Value[this.local_K];
+
+        this.bs_kernelFunction = null;
+    }
+
+    @Override
+    public void initializeTest(int iteration) {
+        // Initialization
+        this.x_tmp = new double[this.config.size];
+
+        assert (!config.randomInit); // randomInit not supported yet
+
+        for (int i = 0; i < this.config.size; i++)
+            this.x_tmp[i] = global_K;
+    }
+
+    @Override
+    public void allocateTest(int iteration) {
+        // Allocate vectors
+        for (int i = 0; i < this.local_K; i++) {
+            this.x[i] = requestArray("double", config.size);
+            this.y[i] = requestArray("double", config.size);
+        }
+
+        // Context initialization
+        Value buildKernel = context.eval("grcuda", "buildkernel");
+
+        // Build the kernels
+        bs_kernelFunction = buildKernel.execute(BS_KERNEL, "bs", "const pointer, pointer, sint32, double, double, double, double");
+    }
+
+    @Override
+    public void resetIteration(int iteration) {
+        // Reset result
+        for (int i = 0; i < local_K; i++) {
+            for (int j = 0; j < this.config.size; j++) {
+                this.x[i].setArrayElement(j, this.x_tmp[j]);
+            }
+        }
+    }
+
+    @Override
+    public void runTest(int iteration) {
+        long start = System.nanoTime();
+
+        double[] result = new double[local_K]; // default 0
+
+        for (int i = 0; i < local_K; i++) {
+            bs_kernelFunction.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                    .execute(this.x[i], this.y[i], this.config.size, R, V, T, global_K); // Execute actual kernel
+        }
+
+        for (int i = 0; i < local_K; i++){
+            result[i] = this.y[i].getArrayElement(0).asDouble();
+        }
+
+        long end = System.nanoTime();
+
+        benchmarkResults.setCurrentGpuResult(result[0]);
+        benchmarkResults.setCurrentComputationSec((end-start)/1000000000F);
+
+    }
+
+    @Override
+    public void cpuValidation() {
+        double[] res;
+        res = BS(this.x_tmp, R, V, T, global_K);
+
+        benchmarkResults.setCurrentCpuResult(res[0]);
+
+        assertEquals(benchmarkResults.currentGpuResult(), res[0], 1e-5);
+    }
+
+    private double[] CND(double[] X) {
+        /*
+         *  Cumulative normal distribution.
+         *  Helper function used by BS(...).
+         */
+
+        double a1 = 0.31938153f;
+        double a2 = -0.356563782f;
+        double a3 = 1.781477937f;
+        double a4 = -1.821255978f;
+        double a5 = 1.330274429f;
+        double[] L = new double[X.length];
+        double[] K = new double[X.length];
+        double[] w = new double[X.length];
+
+        for (int i = 0; i < X.length; i++)
+            L[i] = Math.abs(X[i]);
+        for (int i = 0; i < X.length; i++)
+            K[i] = (1.0f) / (1.0 + 0.2316419 * L[i]);
+        for (int i = 0; i < X.length; i++)
+            w[i] = 1.0 - 1.0 / Math.sqrt(2 * Math.PI) * Math.exp(-L[i] * L[i] / 2.) *
+                    (       a1 * K[i] +
+                            a2 * (Math.pow(K[i], 2)) +
+                            a3 * (Math.pow(K[i], 3)) +
+                            a4 * (Math.pow(K[i], 4)) +
+                            a5 * (Math.pow(K[i], 5)));
+
+        for (int i = 0; i < X.length; i++)
+            w[i] = (w[i] < 0) ? (1.0 - w[i]) : w[i];
+
+        return w;
+    }
+
+    private double[] BS(double[] X, float R, float V, float T, float K) {
+        /*
+         *  Black Scholes Function.
+         */
+        double[] d1_arr = new double[X.length];
+        double[] d2_arr = new double[X.length];
+        double[] result = new double[X.length];
+        double[] w_arr;
+        double[] w2_arr;
+
+        for (int i = 0; i < X.length; i++)
+            d1_arr[i] = (Math.log(X[i] / K) + (R + V * V / 2.) * T) / (V * Math.sqrt(T));
+        for (int i = 0; i < X.length; i++)
+            d2_arr[i] = d1_arr[i] - V * Math.sqrt(T);
+        w_arr = CND(d1_arr);
+        w2_arr = CND(d2_arr);
+
+        for (int i = 0; i < X.length; i++)
+            result[i] = X[i] * w_arr[i] - X[i] * Math.exp(-R * T) * w2_arr[i];
+
+        return result;
+    }
+}
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B6M.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B6M.java
new file mode 100644
index 00000000..ded0b4e0
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B6M.java
@@ -0,0 +1,519 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark.bench;
+
+import it.necst.grcuda.benchmark.Benchmark;
+import it.necst.grcuda.benchmark.BenchmarkConfig;
+import org.graalvm.polyglot.Value;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Random;
+
+// Just a recommendation of optimal block size for the V100 (BLOCK_SIZE_V100 = 64)
+public class B6M extends Benchmark {
+    static final int P = 16;
+
+    private static final String NB_KERNEL = "" +
+            "extern \"C\" __global__ void nb_1(const int* x, const float* y, float* z, int n, int partition_rows, int n_feat, int n_classes, int partition_num) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < min(partition_rows, n - partition_num * partition_rows); i += blockDim.x * gridDim.x) {\n" +
+            "        for (int j = 0; j < n_classes; j++) {\n" +
+            "            for (int q = 0; q < n_feat; q++) {\n" +
+            "                z[partition_num * partition_rows * n_classes + i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];\n" +
+            "            }\n" +
+            "        }\n" +
+            "    }\n" +
+            "}\n" +
+            "    \n" +
+            "extern \"C\" __global__ void nb_2(const float* x, float* y, int n_row_x, int n_col_x) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {\n" +
+            "        float curr_max = x[i * n_col_x];\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            curr_max = fmaxf(curr_max, x[i * n_col_x + j]);\n" +
+            "        }\n" +
+            "        y[i] = curr_max;\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void nb_3(const float* x, const float* y, float* z, int n_row_x, int n_col_x) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {\n" +
+            "        float sum = 0;\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            sum += expf(x[i * n_col_x + j] - y[i]);\n" +
+            "        }\n" +
+            "        z[i] = logf(sum) + y[i];\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void nb_4(float* x, float* y, int n_row_x, int n_col_x) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            x[i * n_col_x + j] = expf(x[i * n_col_x + j] - y[i]);\n" +
+            "        }\n" +
+            "    }\n" +
+            "}";
+
+    private static final String RR_KERNEL = "" +
+            "extern \"C\" __global__ void rr_1(const int* x, float* mean, float *std, int n_row_x, int n_col_x, int partition, int partition_size) {\n" +
+            "    for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {\n" +
+            "        float feature_mean = 0;\n" +
+            "        float sum_sq = 0;\n" +
+            "        // Compute mean and variance;\n" +
+            "        for (int i = 0; i < partition_size; i++) {\n" +
+            "            float x_tmp = x[j * partition_size + i];\n" +
+            "            feature_mean += x_tmp;\n" +
+            "            sum_sq += x_tmp * x_tmp;\n" +
+            "        }\n" +
+            "        // feature_mean /= n_row_x;\n" +
+            "        // std[j] = sqrtf(sum_sq / n_row_x - feature_mean * feature_mean);\n" +
+            "        // mean[j] = feature_mean;\n" +
+            "\n" +
+            "        // Keep just the sum and squared sum, compute mean and std later;\n" +
+            "        mean[j] += feature_mean;\n" +
+            "        std[j] += sum_sq;\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void rr_1_1(float* mean, float *std, const float *mean_curr, const float *std_curr, int n_row_x, int n_col_x, int partition_index, int partition_size) {\n" +
+            "    // We use partition 0 to accumulate, so skip it;\n" +
+            "    if (partition_index == 0) return;\n" +
+            "\n" +
+            "    // Aggregate mean and std from different partitions;\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_col_x; i += blockDim.x * gridDim.x) {\n" +
+            "        mean[i] += mean_curr[i];\n" +
+            "        std[i] += std_curr[i];\n" +
+            "        // When processing the last partition, compute the final mean and std;\n" +
+            "        if (partition_index == " + P + "- 1) {\n" +
+            "            mean[i] /= n_row_x;\n" +
+            "            std[i] = sqrtf(std[i] / n_row_x - mean[i] * mean[i]);\n" +
+            "        }\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void rr_1_2(const int *x, float *y, const float* mean, const float *std, int n_row_x, int n_col_x, int partition_size) {\n" +
+            "    // Normalize each row;\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < partition_size; i += blockDim.x * gridDim.x) {\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            float mean_curr = mean[j];\n" +
+            "            float std_curr = std[j];\n" +
+            "            y[i * n_col_x + j] = (x[i * n_col_x + j] - mean_curr) / std_curr;\n" +
+            "        }\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void rr_2(const float* x, const float* y, float* z, int n, int partition_rows, int n_feat, int n_classes, int partition_num) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < min(partition_rows, n - partition_num * partition_rows); i += blockDim.x * gridDim.x) {\n" +
+            "        for (int j = 0; j < n_classes; j++) {\n" +
+            "            for (int q = 0; q < n_feat; q++) {\n" +
+            "                z[partition_num * partition_rows * n_classes + i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];\n" +
+            "            }\n" +
+            "        }\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void rr_3(float* x, const float* y, int n_row_x, int n_col_x) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            x[i * n_col_x + j] += y[j];\n" +
+            "        }\n" +
+            "    }\n" +
+            "}";
+
+    private static final String ENSEMBLE_KERNEL = "" +
+            "extern \"C\" __global__ void softmax(float* x, int n_row_x, int n_col_x) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {\n" +
+            "        float row_exp_sum = 0;\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            row_exp_sum += expf(x[i * n_col_x + j]);\n" +
+            "        }\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            x[i * n_col_x + j] = expf(x[i * n_col_x + j]) / row_exp_sum;\n" +
+            "        }\n" +
+            "    }\n" +
+            "}\n" +
+            "\n" +
+            "extern \"C\" __global__ void argmax(const float* x, const float* y, int* z, int n_row_x, int n_col_x) {\n" +
+            "    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {\n" +
+            "        int curr_best_index = 0;\n" +
+            "        float curr_best = x[i * n_col_x] + y[i * n_col_x];\n" +
+            "        for (int j = 0; j < n_col_x; j++) {\n" +
+            "            float curr = x[i * n_col_x + j] + y[i * n_col_x + j];\n" +
+            "            if (curr > curr_best) {\n" +
+            "                curr_best = curr;\n" +
+            "                curr_best_index = j;\n" +
+            "            }\n" +
+            "        }\n" +
+            "        z[i] = curr_best_index;\n" +
+            "    }\n" +
+            "}";
+
+    /*
+    Compute an ensemble of Categorical Naive Bayes and Ridge Regression classifiers.
+    Predictions are aggregated averaging the class scores after softmax normalization.
+    The computation is done on mock data and parameters, but is conceptually identical to a real ML pipeline.
+    In the DAG below, input arguments that are not involved in the computation of dependencies are omitted;
+
+    RR-1: standard column normalization (partitioned row-wise)
+        RR-1-1: aggregate mean/std across partitions (partitioned row-wise, but partitions are not independent)
+        RR-1-2: apply normalization (partitioned row-wise)
+    RR-2: matrix multiplication (partitioned row-wise)
+    RR-3: add vector to matrix, row-wise
+    NB-1: matrix multiplication (partitioned row-wise)
+    NB-2: row-wise maximum
+    NB-3: log of sum of exponential, row-wise
+    NB-4: exponential, element-wise
+
+         ┌─> RR-1(const X,MEAN,STD) ─> RR-1-1(MEAN,STD) -> RR-1-2(X, Z, MEAN, STD) ─> (...)
+         │     (...) -> RR-2(const Z,R2) ─> RR-3(R2) ─> SOFTMAX(R1) ─────────────────────┐
+        ─┤                                                                               ├─> ARGMAX(const R1,const R2,R)
+         └─> NB-1(const X,R1) ─> NB-2(const R1,AMAX) ─> (...)                            │
+               (...) -> NB-3(const R1,const AMAX,L) ─> NB-4(R1,const L) ─> SOFTMAX(R2) ──┘
+     */
+
+    private Value[] x, z, mean, std;
+    private Value nb_1, nb_2, nb_3, nb_4;
+    private Value rr_1, rr_1_1, rr_1_2, rr_2, rr_3;
+    private Value argmax, softmax;
+    private Value initialize_rand;
+    private Value nb_feat_log_prob, nb_class_log_prior;
+    private Value ridge_coeff, ridge_intercept;
+    private Value nb_amax, nb_l;
+    private Value r1, r2, r;
+    private int S;
+    private int num_features;
+    private int num_classes;
+    private int max_occurrence_of_ngram;
+
+    private List<List<List<Integer>>> x_cpu;
+    private float[][] nb_feat_log_prob_cpu;
+    private float[][] ridge_coeff_cpu;
+    private float nb_class_log_prior_cpu;
+    private float ridge_intercept_cpu;
+
+    public B6M(BenchmarkConfig currentConfig) {
+        super(currentConfig);
+
+        this.S = 0;
+
+        x = new Value[P];
+        z = new Value[P];
+        mean = new Value[P];
+        std = new Value[P];
+
+
+        /*for (int i = 0; i < P; i++) {
+            this.z[i] = null;
+            this.mean[i] = null;
+            this.std[i] = null;
+        }*/
+        this.r1 = null;
+        this.r2 = null;
+        this.r = null;
+
+        this.nb_1 = null;
+        this.nb_2 = null;
+        this.nb_3 = null;
+        this.nb_4 = null;
+        this.rr_1 = null;
+        this.rr_1_1 = null;
+        this.rr_1_2 = null;
+        this.rr_2 = null;
+        this.rr_3 = null;
+        this.softmax = null;
+        this.argmax = null;
+
+        // Internal arrays used by the algorithms, they do not affect the DAG structure
+        this.nb_feat_log_prob = null;
+        this.nb_class_log_prior = null;
+        this.ridge_coeff = null;
+        this.ridge_intercept = null;
+        this.nb_amax = null;
+        this.nb_l = null;
+
+        this.num_features = 1024;
+        this.num_classes = 16;
+        this.max_occurrence_of_ngram = 10;
+
+        this.x_cpu = null;
+        this.nb_feat_log_prob_cpu =  null;
+        this.ridge_coeff_cpu = null;
+        this.nb_class_log_prior_cpu = 0;
+        this.ridge_intercept_cpu = 0;
+    }
+
+    @Override
+    public void allocateTest(int iteration) {
+        this.S = Math.floorDiv(config.size + P - 1, P);
+
+        // Allocate vectors
+        for (int i = 0; i < P; i++) {
+            this.x[i] = requestArray("int", this.S * this.num_features);
+            this.z[i] = requestArray("float", this.S * this.num_features);
+            this.mean[i] = requestArray("float", this.num_features);
+            this.std[i] = requestArray("float", this.num_features);
+        }
+
+        this.nb_feat_log_prob = requestArray("float", this.num_classes * this.num_features);
+        this.nb_class_log_prior = requestArray("float", this.num_classes);
+        this.ridge_coeff = requestArray("float", this.num_classes * this.num_features);
+        this.ridge_intercept = requestArray("float", this.num_classes);
+
+        this.nb_amax = requestArray("float", config.size);
+        this.nb_l = requestArray("float", config.size);
+
+        this.r1 = requestArray("float", config.size * this.num_classes);
+        this.r2 = requestArray("float", config.size * this.num_classes);
+        this.r = requestArray("int", config.size);
+
+        // Build the kernels
+        Value buildKernel = context.eval("grcuda", "buildkernel");
+        this.nb_1 = buildKernel.execute(NB_KERNEL, "nb_1", "const pointer, const pointer, const pointer, sint32, sint32, sint32, sint32, sint32");
+        this.nb_2 = buildKernel.execute(NB_KERNEL, "nb_2", "pointer, pointer, sint32, sint32");
+        this.nb_3 = buildKernel.execute(NB_KERNEL, "nb_3", "const pointer, const pointer, pointer, sint32, sint32");
+        this.nb_4 = buildKernel.execute(NB_KERNEL, "nb_4", "pointer, const pointer, sint32, sint32");
+
+        this.rr_1 = buildKernel.execute(RR_KERNEL, "rr_1", "const pointer, pointer, pointer, sint32, sint32, sint32, sint32");
+        this.rr_1_1 = buildKernel.execute(RR_KERNEL, "rr_1_1", "pointer, pointer, const pointer, const pointer, sint32, sint32, sint32, sint32");
+        this.rr_1_2 = buildKernel.execute(RR_KERNEL, "rr_1_2", "const pointer, pointer, const pointer, const pointer, sint32, sint32, sint32");
+        this.rr_2 = buildKernel.execute(RR_KERNEL, "rr_2", "const pointer, const pointer, const pointer, sint32, sint32, sint32, sint32, sint32");
+        this.rr_3 = buildKernel.execute(RR_KERNEL, "rr_3", "pointer, const pointer, sint32, sint32");
+
+        this.softmax = buildKernel.execute(ENSEMBLE_KERNEL, "softmax", "pointer, sint32, sint32");
+        this.argmax = buildKernel.execute(ENSEMBLE_KERNEL, "argmax", "const pointer, const pointer, pointer, sint32, sint32");
+        this.initialize_rand = context.eval("js", "(x, m) => { for (let i = 0; i < x.length; i++) { x[i] = Math.floor(Math.random() * m) }}");
+    }
+
+
+    @Override
+    public void initializeTest(int iteration) {
+        assert (!config.randomInit); // randomInit not supported yet
+        // Random init not optional
+        Random random = new Random(System.currentTimeMillis());
+
+        for (int i = 0; i < P; i++)
+            this.initialize_rand.execute(this.x[i], this.max_occurrence_of_ngram);
+
+        for(int i=0; i<nb_feat_log_prob.getArraySize(); i++)
+            this.nb_feat_log_prob.setArrayElement(i, random.nextFloat());
+
+        for(int i=0; i<ridge_coeff.getArraySize(); i++)
+            this.ridge_coeff.setArrayElement(i, random.nextFloat());
+
+        for(int i=0; i<nb_class_log_prior.getArraySize(); i++)
+            this.nb_class_log_prior.setArrayElement(i, random.nextFloat());
+
+        for(int i=0; i<ridge_intercept.getArraySize(); i++)
+            this.ridge_intercept.setArrayElement(i, random.nextFloat());
+    }
+
+
+    @Override
+    public void resetIteration(int iteration) {
+        for (int i = 0; i < config.size; i++) {
+            for (int j = 0; j < this.num_classes; j++) {
+                this.r1.setArrayElement((long) i * this.num_classes + j, this.nb_class_log_prior.getArrayElement(j));
+                this.r2.setArrayElement((long) i * this.num_classes + j, 0);
+            }
+        }
+        for (int i = 0; i < P; i++) {
+            for (int j = 0; j < this.num_features; j++) {
+                this.mean[i].setArrayElement(j, 0.0);
+                this.std[i].setArrayElement(j, 0.0);
+            }
+        }
+    }
+
+    @Override
+    public void runTest(int iteration) {
+        long start = System.nanoTime();
+
+        // Schedule the categorical Naive Bayes and Ridge Regression kernels
+
+        // RR - 1.
+        for (int i = 0; i < P; i++) {
+            this.rr_1.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                    .execute(this.x[i], this.mean[i], this.std[i], this.config.size, this.num_features, i, this.S);
+        }
+
+        // RR - 1.1
+        for (int i = 0; i < P; i++) {
+            this.rr_1_1.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                    .execute(this.mean[0], this.std[0], this.mean[i], this.std[i], this.config.size, this.num_features, i, this.S);
+        }
+
+        // RR - 1.2 and 2.
+        for (int i = 0; i < P; i++) {
+            this.rr_1_2.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                    .execute(this.x[i], this.z[i], this.mean[0], this.std[0], this.config.size, this.num_features, this.S);
+            this.rr_2.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                    .execute(this.z[i], this.ridge_coeff, this.r2, this.config.size, this.S, this.num_features, this.num_classes, i);
+        }
+
+        // RR - 3.
+        this.rr_3.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r2, this.ridge_intercept, this.config.size, this.num_classes);
+
+        // NB - 1.
+        for (int i = 0; i < P; i++) {
+            this.nb_1.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                    .execute(this.x[i], this.nb_feat_log_prob, this.r1, this.config.size, this.S, this.num_features, this.num_classes, i);
+        }
+
+        // NB - 2.
+        this.nb_2.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r1, this.nb_amax, this.config.size, this.num_classes);
+
+        // NB - 3.
+        this.nb_3.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r1, this.nb_amax, this.nb_l, this.config.size, this.num_classes);
+
+        // NB - 4.
+        this.nb_4.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r1, this.nb_l, this.config.size, this.num_classes);
+
+        // Ensemble results
+
+        // Softmax normalization;
+        this.softmax.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r1, this.config.size, this.num_classes);
+        this.softmax.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r2, this.config.size, this.num_classes);
+
+        // Prediction
+        this.argmax.execute(config.numBlocks, config.blockSize1D) // Set parameters
+                .execute(this.r1, this.r2, this.r, this.config.size, this.num_classes);
+
+        // Sync step to measure the real computation time
+        int tmp = this.r.getArrayElement(0).asInt();
+        long end = System.nanoTime();
+        benchmarkResults.setCurrentGpuResult(0);
+        benchmarkResults.setCurrentComputationSec((end-start)/1000000000F);
+    }
+
+    @Override
+    public void cpuValidation() {
+        // Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+        assert (!config.randomInit);
+        x_cpu = new ArrayList<>(new ArrayList<>(new ArrayList<>()));
+
+        for (int r = 0; r < config.size; r++) {
+            for (int c = 0; c < this.num_classes; c++) {
+                x_cpu.get(r).get(c).add(0);
+            }
+        }
+
+        for (int r = 0; r < config.size; r++) {
+            for (int c = 0; c < this.num_classes; c++) {
+                for (int i = 0; i < this.S * this.num_features; i++) {
+                    x_cpu.get(r).get(c).add(this.x[r * this.num_features + c].getArrayElement(i).asInt());
+                }
+            }
+        }
+
+        float[] r_g;
+
+        // TODO:
+        /*
+            r1_g = naive_bayes_predict(x_cpu, self.nb_feat_log_prob_cpu, self.nb_class_log_prior_cpu)
+            r2_g = ridge_pred(normalize(x_cpu), self.ridge_coeff_cpu, self.ridge_intercept_cpu)
+            r_g = np.argmax(softmax(r1_g) + softmax(r2_g), axis=1)
+            self.cpu_result = r_g
+
+             # Compare GPU and CPU results;
+            difference = 0
+            for i in range(self.size):
+            difference += np.abs(self.cpu_result[i] - gpu_result[i])
+
+         */
+    }
+
+    private float[] softmax(float[] X, int n_col_x, int n_row_x) {
+        float row_exp_sum = 0;
+        float[] result = new float[X.length];
+
+        for (int r=0; r<n_row_x; r++) {
+            for (int c = 0; c < n_col_x; c++) {
+                row_exp_sum += Math.exp(X[r + n_col_x + c]);
+            }
+
+            for (int c = 0; c < n_col_x; c++) {
+                result[r * n_col_x + c] = (float) (Math.exp(X[r * n_col_x + c]) / row_exp_sum);
+            }
+        }
+
+        return result;
+    }
+
+    private void logsumexp() {
+        /*
+            return np.log(np.sum(np.exp(X)))
+         */
+    }
+
+    private void naive_bayes_predict() {
+        /*
+            jll = X.dot(feature_log_prob.T) + log_class_prior
+            amax = np.amax(jll, axis=1)
+            l = logsumexp(jll - np.atleast_2d(amax).T) + amax
+
+            return np.exp(jll - np.atleast_2d(l).T)
+         */
+    }
+
+    private void normalize(float[][] X) {
+        /*
+            return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
+         */
+    }
+
+    private float[] ridge_pred(float[] X, float[] coef, float[] intercept) {
+        /*
+            return np.dot(X, coef.T) + intercept
+         */
+        float[] result = new float[intercept.length];
+        float dotResult = dotProduct(X, coef);
+
+        for (int i = 0; i<intercept.length; i++) {
+            result[i] = dotResult + intercept[i];
+        }
+        return null;
+    }
+
+    public static float dotProduct(float[] a, float[] b) {
+        float sum = 0;
+        for (int i = 0; i < a.length; i++) {
+            sum += a[i] * b[i];
+        }
+        return sum;
+    }
+}
diff --git a/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B9M.java b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B9M.java
new file mode 100644
index 00000000..2384b778
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/main/java/it/necst/grcuda/benchmark/bench/B9M.java
@@ -0,0 +1,375 @@
+/*
+ * Copyright (c) 2022 NECSTLab, Politecnico di Milano. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NECSTLab nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *  * Neither the name of Politecnico di Milano nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+package it.necst.grcuda.benchmark.bench;
+
+import it.necst.grcuda.benchmark.Benchmark;
+import it.necst.grcuda.benchmark.BenchmarkConfig;
+import org.graalvm.polyglot.Value;
+
+import java.util.Random;
+
+import static org.junit.Assert.assertEquals;
+
+public class B9M extends Benchmark {
+    /*
+    Compute the conjugate gradient algorithm on a dense symmetric matrix.
+    The matrix-vector multiplications are row-partitioned to scale across multiple GPUs;
+     */
+
+    private static final String PRECONDITION_KERNEL = "" +
+            "// Add a small epsilon to the main diagonal:\n" +
+            "extern \"C\" __global__ void precondition(float *A, int n, int m, int offset) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < m; i += blockDim.x * gridDim.x) {\n" +
+            "        A[i * n + i + offset] += 1e-12; \n" +
+            "    }\n" +
+            "}";
+
+    private static final String MMUL_KERNEL = "" +
+            "// z = x @ y;\n" +
+            "extern \"C\" __global__ void matrix_vector_mult(const float* x, const float* y, float* z, int n, int m, int z_offset) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        float sum = 0;\n" +
+            "        for (int j = 0; j < m; j++) {                \n" +
+            "            sum += x[i * m + j] * y[j];\n" +
+            "        }\n" +
+            "        z[z_offset + i] = sum;\n" +
+            "    }\n" +
+            "}\n" +
+            "// z := w + alpha * A @ y;\n" +
+            "extern \"C\" __global__ void matrix_vector_mult_axpy(const float* x, const float* y, const float *w, const float alpha, float* z, int n, int m, int z_offset) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        float sum = 0;\n" +
+            "        for (int j = 0; j < m; j++) {                \n" +
+            "            sum += x[i * m + j] * y[j];\n" +
+            "        }\n" +
+            "        z[z_offset + i] = alpha * sum + w[z_offset + i];\n" +
+            "    }\n" +
+            "}";
+
+    private static final String DP_KERNEL = "" +
+            "__inline__ __device__ float warp_reduce(float val) {\n" +
+            "    int warp_size = 32;\n" +
+            "    for (int offset = warp_size / 2; offset > 0; offset /= 2) \n" +
+            "        val += __shfl_down_sync(0xFFFFFFFF, val, offset);\n" +
+            "    return val;\n" +
+            "}\n" +
+            "// z = <x, x>;\n" +
+            "extern \"C\" __global__ void l2_norm(const float *x, float* z, int N) {\n" +
+            "    int warp_size = 32;\n" +
+            "    float sum = float(0);\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {\n" +
+            "        float x_tmp = x[i];\n" +
+            "        sum += x_tmp * x_tmp;\n" +
+            "    }\n" +
+            "    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;\n" +
+            "    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster\n" +
+            "        atomicAdd(z, sum); // The first thread in the warp updates the output;\n" +
+            "}\n" +
+            "// z = <x, y>;\n" +
+            "extern \"C\" __global__ void dot(const float *x, const float *y, float* z, int N) {\n" +
+            "    int warp_size = 32;\n" +
+            "    float sum = float(0);\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {\n" +
+            "        sum += x[i] * y[i];\n" +
+            "    }\n" +
+            "    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;\n" +
+            "    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster\n" +
+            "        atomicAdd(z, sum); // The first thread in the warp updates the output;\n" +
+            "}";
+
+    private static final String SAXPY_KERNEL = "" +
+            "// y = val + alpha * x;\n" +
+            "extern \"C\" __global__ void saxpy(float* y, const float *val, const float *x, float alpha, int n) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        y[i] = val[i] + alpha * x[i];\n" +
+            "    }\n" +
+            "}\n" +
+            "// Simply copy array x into y;\n" +
+            "extern \"C\" __global__ void cpy(float *y, const float *x, int n) {\n" +
+            "    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {\n" +
+            "        y[i] = x[i];\n" +
+            "    }\n" +
+            "}";
+
+    private Value precondition_kernel, mmul_kernel, mmul_axpy_kernel, l2_norm_kernel, dp_kernel, saxpy_kernel, copy_kernel, initialize_random_symmetric_matrix;
+    private Value[] A;
+    private Value x, b, p, r, y, t1, t2;
+    private int S;
+
+    private final int P = 16;
+    private final int ITER = 50;
+
+    public B9M(BenchmarkConfig currentConfig) {
+        super(currentConfig);
+
+        this.S = 0;
+        this.A = new Value[this.P];
+        for (int i = 0; i < this.P; i ++) this.A[i] = null;
+        this.x = null;
+        this.b = null;
+        this.p = null;
+        this.r = null;
+        this.y = null;
+        this.t1 = null;
+        this.t2 = null;
+
+        this.mmul_axpy_kernel = null;
+        this.mmul_kernel = null;
+        this.l2_norm_kernel = null;
+        this.dp_kernel = null;
+        this.saxpy_kernel = null;
+        this.copy_kernel = null;
+    }
+
+    @Override
+    public void allocateTest(int iteration) {
+        this.S = Math.floorDiv(config.size + this.P - 1, this.P);
+
+        // Allocate vectors
+        for (int i = 0; i < this.P; i++)
+            this.A[i] = requestArray("float", this.S * config.size);
+        this.x = requestArray("float", config.size);
+        this.b = requestArray("float", config.size);
+        this.p = requestArray("float", config.size);
+        this.r = requestArray("float", config.size);
+        this.y = requestArray("float", config.size);
+        this.t1 = requestArray("float", 1);
+        this.t2 = requestArray("float", 1);
+
+        // Build the kernels
+        Value buildKernel = context.eval("grcuda", "buildkernel");
+
+        this.precondition_kernel = buildKernel.execute(PRECONDITION_KERNEL, "precondition", "pointer, sint32, sint32, sint32");
+        this.mmul_kernel = buildKernel.execute(MMUL_KERNEL, "matrix_vector_mult", "const pointer, const pointer, const pointer, sint32, sint32, sint32");
+        this.mmul_axpy_kernel = buildKernel.execute(MMUL_KERNEL, "matrix_vector_mult_axpy", "const pointer, const pointer, pointer, float, const pointer, sint32, sint32, sint32");
+        this.l2_norm_kernel = buildKernel.execute(DP_KERNEL, "l2_norm", "const pointer, pointer, sint32");
+        this.dp_kernel = buildKernel.execute(DP_KERNEL, "dot", "const pointer, pointer, pointer, sint32");
+        this.saxpy_kernel = buildKernel.execute(SAXPY_KERNEL, "saxpy", "pointer, const pointer, const pointer, float, sint32");
+        this.copy_kernel = buildKernel.execute(SAXPY_KERNEL, "cpy", "pointer, pointer, sint32");
+        this.initialize_random_symmetric_matrix = context.eval("js", "(X, S, N) => { \n" +
+                "            for (let i = 0; i < N; i++) {\n" +
+                "                s = (i / S) >> 0;\n" +
+                "                k = i % S;\n" +
+                "                Xs = X[s];\n" +
+                "                i_N = k * N;\n" +
+                "                for (let j = i; j < N; j++) {\n" +
+                "                    val = 2 * Math.random() - 1; \n" +
+                "                    Xs[i_N + j] = val;\n" +
+                "                    X[(j / S) >> 0][(j % S) * N + i] = val;\n" +
+                "                }\n" +
+                "            }}");
+    }
+
+    @Override
+    public void initializeTest(int iteration) {
+        this.initialize_random_symmetric_matrix.execute(this.A, this.S, config.size);
+    }
+
+    @Override
+    public void resetIteration(int iteration) {
+        // Reset result
+        for (int i = 0; i < config.size; i++)
+            this.x.setArrayElement(i, 1.0 / config.size);
+        this.t1.setArrayElement(0, 0.0);
+        this.t2.setArrayElement(0, 0.0);
+    }
+
+    @Override
+    public void runTest(int iteration) {
+        long start_comp = System.nanoTime();
+        long end;
+
+        // Initialization phase
+        // precondition: A += I * np.eps;
+        for (int i = 0; i < this.P; i++) {
+            this.precondition_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.A[i], config.size, Math.min(this.S, config.size - i * this.S), i * this.S);
+        }
+
+        // r = b - A * x
+        for (int i = 0; i < this.P; i++) {
+            this.mmul_axpy_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.A[i], this.x, this.b, -1, this.r, this.S, config.size, i * this.S);
+        }
+
+        // p = r
+        this.copy_kernel.execute(config.numBlocks, config.blockSize1D).
+                execute(this.p, this.r, config.size);
+
+        // t1 = r^t * r
+        this.l2_norm_kernel.execute(config.numBlocks, config.blockSize1D).
+                execute(this.r, this.t1, config.size);
+
+        for (int curr_iter = 0; curr_iter < this.ITER; curr_iter++) {
+            // t2 = p^t * A * p
+            for (int i = 0; i < this.P; i++) {
+                this.mmul_kernel.execute(config.numBlocks, config.blockSize1D).
+                        execute(this.A[i], this.p, this.y, this.S, config.size, i * this.S);
+            }
+            this.dp_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.p, this.y, this.t2, config.size);
+
+            float alpha = this.t1.getArrayElement(0).asFloat() / this.t2.getArrayElement(0).asFloat();
+            float old_r_norm_squared = this.t1.getArrayElement(0).asFloat();
+            this.t1.setArrayElement(0, 0);
+            this.t2.setArrayElement(0, 0);
+
+            // Update x: x = x + alpha * p
+            this.saxpy_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.x, this.x, this.p, alpha, config.size);
+
+            // r = r - alpha * y
+            this.saxpy_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.r, this.r, this.y, -1 * alpha, config.size);
+
+            // t1 = r^t * r
+            this.l2_norm_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.r, this.t1, config.size);
+
+            float beta = this.t1.getArrayElement(0).asFloat() / old_r_norm_squared;
+
+            this.saxpy_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.p, this.r, this.p, beta, config.size);
+        }
+
+        // Add final sync step
+        float tmp = x.getArrayElement(0).asFloat();
+        end = System.nanoTime();
+
+        benchmarkResults.setCurrentComputationSec((end - start_comp) / 1000000000F);
+
+        // Compute GPU result
+        for (int i = 0; i < this.P; i++) {
+            this.mmul_axpy_kernel.execute(config.numBlocks, config.blockSize1D).
+                    execute(this.A[i], this.x, this.b, -1, this.y, Math.min(this.S, config.size - i * this.S), config.size, i * this.S);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < 10; i++)
+            sum += this.y.getArrayElement(i).asFloat();
+
+        benchmarkResults.setCurrentGpuResult(0);
+    }
+
+    @Override
+    public void cpuValidation() {
+        float[][] A_cpu = new float[config.size][config.size];
+        float[] b_cpu = new float[config.size];
+        float[] x_cpu_1 = new float[config.size];
+        float[] x_cpu = new float[config.size];
+        float[] r_cpu = new float[config.size];
+        float[] p_cpu = new float[config.size];
+        float[] y_cpu = new float[config.size];
+        float[] tmp;
+        float t1_cpu = 0;
+        float t2_cpu = 0;
+        float alpha_cpu;
+        float beta_cpu;
+        float t1_old_cpu;
+
+        for (int i = 0; i < config.size; i++) x_cpu_1[i] = 0;
+
+        int p_counter;
+        for (int i = 0; i < config.size; i++) {
+            p_counter = Math.floorDiv(i, this.S);
+            for (int j = 0; j < config.size; j++)
+                A_cpu[i][j] = this.A[p_counter].getArrayElement((i % this.S) * config.size + j).asFloat();
+        }
+
+//        System.out.println("Matrix test A-CPU");
+//        System.out.println("Matrix A-CPU -> rowSize: " + A_cpu.length + "; colSize: " + A_cpu[0].length);
+//        for (int r=0; r<config.size; r++) {
+//            System.out.print('|');
+//            for (int c=0; c<config.size; c++) {
+//                System.out.print(A_cpu[r][c] + "\t| ");
+//            }
+//            System.out.print('\n');
+//        }
+
+        Random rd = new Random();
+        for (int i = 0; i < config.size; i++) b_cpu[i] = rd.nextFloat();
+
+        for (int i = 0; i < config.size; i++) x_cpu[i] = 1;
+
+        tmp = matrixMult(A_cpu, x_cpu);
+        for (int i = 0; i < config.size; i++) r_cpu[i] = b_cpu[i] - tmp[i];
+
+        for (int i = 0; i < config.size; i++) p_cpu[i] = r_cpu[i];
+
+        for (int i = 0; i < config.size; i++) t1_cpu += (r_cpu[i] * r_cpu[i]);
+
+        // Main iteration
+        for (int i = 0; i < ITER; i++) {
+            y_cpu = matrixMult(A_cpu, p_cpu);
+
+            for (int j = 0; j < config.size; j++) t2_cpu += (p_cpu[j] * y_cpu[j]);
+
+            alpha_cpu = t1_cpu / t2_cpu;
+            t1_old_cpu = t1_cpu;
+            for (int j = 0; j < config.size; j++){
+                x_cpu[j] += alpha_cpu * p_cpu[j];
+                r_cpu[j] -= alpha_cpu * y_cpu[j];
+            }
+
+            for (int j = 0; j < config.size; j++) t1_cpu += (r_cpu[j] * r_cpu[j]);
+
+            beta_cpu = t1_cpu / t1_old_cpu;
+
+            for (int j = 0; j < config.size; j++) p_cpu[j] = r_cpu[j] + beta_cpu * p_cpu[j];
+        }
+
+//        System.out.println(" CPU - y pre sum ");
+//        for (int i=0; i < config.size; i++) {System.out.print(y_cpu[i] + " % ");}
+//        System.out.print("\n");
+        float sum = 0;
+        for (int i = 0; i < 10; i++) {
+            sum += y_cpu[i];
+        }
+
+        benchmarkResults.setCurrentCpuResult(sum);
+        assertEquals(benchmarkResults.currentCpuResult(), benchmarkResults.currentGpuResult(), 1e-3);
+    }
+
+    private float[] matrixMult(float[][] a, float[] b) {
+        float[] res = new float[a.length];
+        float tempSum;
+
+        for (int r = 0; r < a.length; r++) {
+            tempSum = 0;
+            for (int k = 0; k < b.length; k++) {
+                tempSum += a[r][k] * b[k];
+            }
+            res[r] = tempSum;
+        }
+        return res;
+    }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/TestBenchmarks.java b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/TestBenchmarks.java
new file mode 100644
index 00000000..bbe051a0
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/TestBenchmarks.java
@@ -0,0 +1,331 @@
+package it.necst.grcuda.benchmark;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.gson.Gson;
+import com.google.gson.GsonBuilder;
+import com.google.gson.stream.JsonReader;
+import org.junit.Before;
+import org.junit.Ignore;
+import org.junit.Test;
+
+import java.io.*;
+import java.lang.reflect.Constructor;
+import java.lang.reflect.InvocationTargetException;
+import java.text.Format;
+import java.text.SimpleDateFormat;
+import java.util.*;
+
+import static org.junit.Assert.*;
+import static org.junit.Assume.assumeThat;
+import static org.junit.Assume.assumeTrue;
+
+
+public class TestBenchmarks{
+    private String GRCUDA_HOME = System.getenv("GRCUDA_HOME");
+    private String PATH;
+    private GPU currentGPU;
+    private String results_path;
+
+    @Before
+    public void init() throws IOException, InterruptedException {
+        //create the folder to store the json results of the benchmarks
+        PATH = GRCUDA_HOME+"/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark";
+        Format formatter = new SimpleDateFormat("yyyy_MM_dd_hh_mm_ss");
+        Date currentDate = new Date();
+        String results_path = "./results/"+formatter.format(currentDate);
+        this.results_path = results_path;
+
+        int i=0;
+        while(!new File(results_path).mkdirs()){
+            results_path = "./results/"+formatter.format(currentDate)+"_("+i+")";
+            i++;
+        }
+
+
+        // Compute BANDWIDTH MATRIX if necessary
+        String BANDWIDTH_MATRIX_PATH = GRCUDA_HOME+"/projects/resources/connection_graph/datasets/connection_graph.csv";
+        File f = new File(BANDWIDTH_MATRIX_PATH);
+        if(!f.exists() && !f.isDirectory()) {
+            // we need to compute the interconnection bandwidth matrix
+            ProcessBuilder builder = new ProcessBuilder();
+            builder.directory(new File(GRCUDA_HOME+"/projects/resources/connection_graph"));
+            builder.command("bash -c ./run.sh".split("\\s+"));
+            Process process = builder.start();
+            int exitCode = process.waitFor();
+            assertEquals("Return value should be 0", 0, exitCode);
+        }
+
+        // Get the model of GPUs installed in the system
+        Set<GPU> detectedGPUS = new HashSet<>();
+        ProcessBuilder builder = new ProcessBuilder();
+        builder.command("nvidia-smi --query-gpu=gpu_name --format=csv".split("\\s+"));
+        Process process = builder.start();
+        int exitCode = process.waitFor();
+        assertEquals("Return value should be 0", 0, exitCode);
+        BufferedReader br=new BufferedReader(new InputStreamReader(process.getInputStream()));
+        String line;
+        StringBuilder sb = new StringBuilder();
+        br.readLine(); // discard "name" at the beginning of the output
+        while((line=br.readLine())!=null){
+            GPU g = GPU.valueOfName(line);
+            assertNotEquals("There is no configuration file for the current GPU model, please add it and modify the GPU enum in TestBenchmarks.java",null, g);
+            detectedGPUS.add(g);
+        }
+        assertEquals(1, detectedGPUS.size());
+        this.currentGPU = detectedGPUS.iterator().next();
+    }
+
+    @Test
+    public void runAll_gtx1660_super() throws FileNotFoundException, ClassNotFoundException, InvocationTargetException, NoSuchMethodException, InstantiationException, IllegalAccessException, JsonProcessingException {
+        assumeTrue(this.currentGPU.equals(GPU.GTX1660_SUPER));
+
+         // get the configuration for the selected GPU into a Config class
+        String CONFIG_PATH = PATH + "/config_GTX1660_super.json";
+        Gson gson = new GsonBuilder().setPrettyPrinting().create();
+        JsonReader reader = new JsonReader(new FileReader(CONFIG_PATH));
+        Config parsedConfig = gson.fromJson(reader, Config.class);
+        //System.out.println(gson.toJson(parsedConfig)); // print the current configuration
+
+        iterateAllPossibleConfig(parsedConfig);
+    }
+
+    @Test
+    public void runAll_gtx960_multi() throws FileNotFoundException, ClassNotFoundException, InvocationTargetException, NoSuchMethodException, InstantiationException, IllegalAccessException, JsonProcessingException {
+        assumeTrue(this.currentGPU.equals(GPU.GTX960));
+
+        // get the configuration for the selected GPU into a Config class
+        String CONFIG_PATH = PATH + "/config_GTX960.json";
+        Gson gson = new GsonBuilder().setPrettyPrinting().create();
+        JsonReader reader = new JsonReader(new FileReader(CONFIG_PATH));
+        Config parsedConfig = gson.fromJson(reader, Config.class);
+        //System.out.println(gson.toJson(parsedConfig)); // print the current configuration
+
+        iterateAllPossibleConfig(parsedConfig);
+    }
+
+    @Test
+    public void runAll_V100_multi() throws FileNotFoundException, ClassNotFoundException, InvocationTargetException, NoSuchMethodException, InstantiationException, IllegalAccessException, JsonProcessingException {
+        assumeTrue(this.currentGPU.equals(GPU.V100));
+
+        // get the configuration for the selected GPU into a Config class
+        String CONFIG_PATH = PATH + "/config_V100.json";
+        Gson gson = new GsonBuilder().setPrettyPrinting().create();
+        JsonReader reader = new JsonReader(new FileReader(CONFIG_PATH));
+        Config parsedConfig = gson.fromJson(reader, Config.class);
+        System.out.println(gson.toJson(parsedConfig)); // print the current configuration
+
+        iterateAllPossibleConfig(parsedConfig);
+    }
+
+    @Test
+    public void runAll_A100_multi() throws FileNotFoundException, ClassNotFoundException, InvocationTargetException, NoSuchMethodException, InstantiationException, IllegalAccessException, JsonProcessingException {
+        assumeTrue(this.currentGPU.equals(GPU.A100));
+
+        // get the configuration for the selected GPU into a Config class
+        String CONFIG_PATH = PATH + "/config_A100.json";
+        Gson gson = new GsonBuilder().setPrettyPrinting().create();
+        JsonReader reader = new JsonReader(new FileReader(CONFIG_PATH));
+        Config parsedConfig = gson.fromJson(reader, Config.class);
+        //System.out.println(gson.toJson(parsedConfig)); // print the current configuration
+
+        iterateAllPossibleConfig(parsedConfig);
+    }
+
+    /*
+    This method reflects the pattern of benchmark_wrapper.py present in the python suite.
+ */
+    private void iterateAllPossibleConfig(Config parsedConfig) throws ClassNotFoundException, InvocationTargetException, NoSuchMethodException, InstantiationException, IllegalAccessException, JsonProcessingException {
+        String BANDWIDTH_MATRIX;
+        ArrayList<String> dp, nsp, psp, cdp;
+        ArrayList<Integer> ng, block_sizes;
+        Integer nb; // number of blocks
+        Integer blockSize1D, blockSize2D;
+        int num_iter = parsedConfig.num_iter;
+
+        Benchmark benchToRun;
+        for(String bench : parsedConfig.benchmarks){ // given bench X from the set of all the benchmarks iterate over the number of elements associated with that benchmark
+            ArrayList<Integer> sizes = parsedConfig.num_elem.get(bench);
+            if(sizes == null) continue; //skip everything if no sizes are specified for the current bench
+            for(Integer curr_size : sizes){ // given a specific input size iterate over the various execution policies
+                for(String policy : parsedConfig.exec_policies){
+                    if(policy.equals("sync")){
+                        dp = new ArrayList<>(List.of(parsedConfig.dependency_policies.get(0)));
+                        nsp = new ArrayList<>(List.of(parsedConfig.new_stream_policies.get(0)));
+                        psp = new ArrayList<>(List.of(parsedConfig.parent_stream_policies.get(0)));
+                        cdp = new ArrayList<>(List.of(parsedConfig.choose_device_policies.get(0)));
+                        ng = new ArrayList<>(List.of(1));
+                    }
+                    else{
+                        dp = parsedConfig.dependency_policies;
+                        nsp = parsedConfig.new_stream_policies;
+                        psp = parsedConfig.parent_stream_policies;
+                        cdp = parsedConfig.choose_device_policies;
+                        ng = parsedConfig.num_gpus;
+                    }
+                    for(int num_gpu : ng){
+                        if(policy.equals("async") && num_gpu == 1){
+                            dp = new ArrayList<>(List.of(parsedConfig.dependency_policies.get(0)));
+                            nsp = new ArrayList<>(List.of(parsedConfig.new_stream_policies.get(0)));
+                            psp = new ArrayList<>(List.of(parsedConfig.parent_stream_policies.get(0)));
+                            cdp = new ArrayList<>(List.of(parsedConfig.choose_device_policies.get(0)));
+                        }
+                        else{
+                            dp = parsedConfig.dependency_policies;
+                            nsp = parsedConfig.new_stream_policies;
+                            psp = parsedConfig.parent_stream_policies;
+                            cdp = parsedConfig.choose_device_policies;
+                        }
+                        for(String m : parsedConfig.memory_advise){
+                            for(Boolean p : parsedConfig.prefetch ){
+                                for(Boolean s : parsedConfig.stream_attach){
+                                    for(Boolean t : parsedConfig.time_computation){
+                                        BANDWIDTH_MATRIX= GRCUDA_HOME+"/projects/resources/connection_graph/datasets/connection_graph.csv";
+                                        for(String dependency_policy : dp){
+                                            for(String new_stream_policy : nsp){
+                                                for(String parent_stream_policy : psp){
+                                                    for(String choose_device_policy : cdp){
+                                                        BenchmarkConfig config = new BenchmarkConfig();
+
+                                                        nb = parsedConfig.numBlocks.get(bench);
+                                                        if(nb != null) config.numBlocks = nb;
+
+                                                        blockSize1D = parsedConfig.block_size1d.get(bench);
+                                                        if(blockSize1D != null) config.blockSize1D = blockSize1D;
+
+                                                        blockSize2D = parsedConfig.block_size2d.get(bench);
+                                                        if(blockSize2D != null) config.blockSize2D = blockSize2D;
+
+                                                        config.debug = parsedConfig.debug;
+                                                        config.benchmarkName = bench;
+                                                        config.size = curr_size;
+                                                        config.numGpus = num_gpu;
+                                                        config.executionPolicy = policy;
+                                                        config.dependencyPolicy = dependency_policy;
+                                                        config.retrieveNewStreamPolicy = new_stream_policy;
+                                                        config.retrieveParentStreamPolicy = parent_stream_policy;
+                                                        config.deviceSelectionPolicy = choose_device_policy;
+                                                        config.inputPrefetch = p;
+                                                        config.totIter = num_iter;
+                                                        config.forceStreamAttach = s;
+                                                        config.memAdvisePolicy = m;
+                                                        config.bandwidthMatrix = BANDWIDTH_MATRIX;
+                                                        config.enableComputationTimers =t;
+                                                        config.nvprof_profile = parsedConfig.nvprof_profile;
+                                                        config.gpuModel = this.currentGPU.name;
+                                                        config.results_path = this.results_path;
+                                                        config.reInit = parsedConfig.reInit;
+
+                                                        System.out.println(config);
+                                                        benchToRun = createBench(config);
+                                                        benchToRun.run();
+                                                    }
+                                                }
+                                            }
+                                        }
+                                    }
+                                }
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    private Benchmark createBench(BenchmarkConfig config) throws ClassNotFoundException, NoSuchMethodException, InvocationTargetException, InstantiationException, IllegalAccessException {
+        // Courtesy of https://stackoverflow.com/questions/7495785/java-how-to-instantiate-a-class-from-string
+
+        Class currBenchClass = Class.forName("it.necst.grcuda.benchmark.bench."+config.benchmarkName);
+
+        Class[] types = {BenchmarkConfig.class};
+        Constructor constructor = currBenchClass.getConstructor(types);
+
+        Object[] parameters = {config};
+
+        return (Benchmark) constructor.newInstance(parameters);
+
+    }
+
+}
+
+enum GPU {
+    GTX1660_SUPER("GeForce GTX 1660 SUPER"),
+    A100("NVIDIA A100-SXM4-40GB"),
+    V100("Tesla V100-SXM2-16GB"),
+    GTX960("GeForce GTX 960");
+
+    public final String name;
+
+    GPU(String name){
+        this.name = name;
+    }
+
+    public static GPU valueOfName(String toGet){
+        for(GPU g : values()){
+            if(g.name.equals(toGet))
+                return g;
+        }
+        return null;
+    }
+}
+
+/**
+ * Used to map/parse the json config files to a class
+ */
+class Config {
+    int num_iter;
+    int heap_size;
+
+    boolean reInit = false;
+    boolean randomInit;
+    boolean cpuValidation;
+    boolean debug;
+    boolean nvprof_profile;
+
+    ArrayList<String> benchmarks;
+    ArrayList<String> exec_policies;
+    ArrayList<String> dependency_policies;
+    ArrayList<String> new_stream_policies;
+    ArrayList<String> parent_stream_policies;
+    ArrayList<String> choose_device_policies;
+    ArrayList<String> memory_advise;
+
+    ArrayList<Boolean> prefetch;
+    ArrayList<Boolean> stream_attach;
+    ArrayList<Boolean> time_computation;
+
+    ArrayList<Integer> num_gpus;
+
+    HashMap<String, ArrayList<Integer>> num_elem;
+    HashMap<String, Integer> numBlocks;
+    HashMap<String, Integer> block_size1d;
+    HashMap<String, Integer> block_size2d;
+
+    @Override
+    public String toString() {
+        return "Config{" +
+                "num_iter=" + num_iter +
+                ", heap_size=" + heap_size +
+                ", reInit=" + reInit +
+                ", randomInit=" + randomInit +
+                ", cpuValidation=" + cpuValidation +
+                ", benchmarks=" + benchmarks +
+                ", exec_policies=" + exec_policies +
+                ", dependency_policies=" + dependency_policies +
+                ", new_stream_policies=" + new_stream_policies +
+                ", parent_stream_policies=" + parent_stream_policies +
+                ", choose_device_policies=" + choose_device_policies +
+                ", memory_advise=" + memory_advise +
+                ", prefetch=" + prefetch +
+                ", stream_attach=" + stream_attach +
+                ", time_computation=" + time_computation +
+                ", num_gpus=" + num_gpus +
+                ", num_elem=" + num_elem +
+                ", numBlocks=" + numBlocks +
+                ", block_size1d=" + block_size1d +
+                ", block_size2d=" + block_size2d +
+                '}';
+    }
+}
+
+
diff --git a/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_A100.json b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_A100.json
new file mode 100644
index 00000000..473c11aa
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_A100.json
@@ -0,0 +1,48 @@
+{
+  "num_iter": 30,
+  "reInit": false,
+  "randomInit": false,
+  "cpuValidation": false,
+  "heap_size": 470,
+  "debug": false,
+  "nvprof_profile": false,
+  "num_elem": {
+    "B1M": [160000000,250000000, 500000000, 800000000, 950000000],
+    "B5M": [10000000,16000000, 21000000, 28000000, 35000000],
+    "B6M": [1000000,1200000, 1400000, 1600000, 1800000],
+    "B9M": [20000, 30000, 40000, 50000, 60000],
+    "B11M": [20000, 30000, 40000, 50000, 60000]
+  },
+  "benchmarks": ["B1M", "B5M", "B6M", "B9M", "B11M"],
+  "numBlocks": {
+    "B1M": 64,
+    "B5M": 64,
+    "B6M": 64,
+    "B9M": 64,
+    "B11M": 64
+  },
+  "exec_policies" : ["async"],
+  "dependency_policies": ["with-const"],
+  "new_stream_policies": ["always-new"],
+  "parent_stream_policies": ["multigpu-disjoint"],
+  "choose_device_policies": ["round-robin","stream-aware","min-transfer-size", "minmax-transfer-time"],
+  "memory_advise": ["none"],
+  "prefetch": [false],
+  "stream_attach": [false],
+  "time_computation": [false],
+  "num_gpus": [1, 2, 4, 8],
+  "block_size1d": {
+    "B1M": 32,
+    "B5M": 1024,
+    "B6M": 32,
+    "B9M": 32,
+    "B11M": 256
+  },
+  "block_size2d": {
+    "B1M": 8,
+    "B5M": 8,
+    "B6M": 8,
+    "B9M": 8,
+    "B11M": 8
+  }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_GTX1660_super.json b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_GTX1660_super.json
new file mode 100644
index 00000000..692d5feb
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_GTX1660_super.json
@@ -0,0 +1,36 @@
+{
+  "num_iter": 2,
+  "reInit": false,
+  "randomInit": false,
+  "cpuValidation": true,
+  "heap_size": 26,
+  "debug": true,
+  "nvprof_profile": false,
+  "num_elem": {
+    "B1": [60000000, 80000000],
+    "B5M": [6000000, 8000000]
+  },
+  "benchmarks": ["B1", "B5M"],
+  "numBlocks": {
+    "B1":  32,
+    "B5M": 64
+  },
+  "exec_policies" : ["sync", "async"],
+  "dependency_policies": ["with-const"],
+  "new_stream_policies": ["always-new"],
+  "parent_stream_policies": ["disjoint", "multigpu-disjoint"],
+  "choose_device_policies": ["round-robin", "stream-aware", "minmax-transfer-time"],
+  "memory_advise": ["none"],
+  "prefetch": [false],
+  "stream_attach": [false],
+  "time_computation": [false],
+  "num_gpus": [1],
+  "block_size1d": {
+    "B1": 32,
+    "B5M": 1024
+    },
+  "block_size2d": {
+    "B1":  8,
+    "B5M": 8
+  }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_GTX960.json b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_GTX960.json
new file mode 100644
index 00000000..262df93c
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_GTX960.json
@@ -0,0 +1,44 @@
+{
+  "num_iter": 10,
+  "reInit": false,
+  "randomInit": false,
+  "cpuValidation": true,
+  "heap_size": 26,
+  "debug": false,
+  "nvprof_profile": false,
+  "num_elem": {
+    "B1": [20000000, 60000000, 80000000, 100000000, 120000000],
+    "B1M": [20000000, 60000000, 80000000, 100000000, 120000000],
+    "B5M": [2000000,6000000, 8000000, 10000000, 12000000],
+    "B11M": [10000, 15000, 20000]
+  },
+  "benchmarks": ["B1M","B5M","B11M"],
+  "numBlocks": {
+    "B1":  32,
+    "B1M": 64,
+    "B5M": 64,
+    "B11M": 64
+  },
+  "exec_policies" : ["async"],
+  "dependency_policies": ["with-const"],
+  "new_stream_policies": ["always-new"],
+  "parent_stream_policies": ["disjoint", "multigpu-disjoint"],
+  "choose_device_policies": ["round-robin", "stream-aware", "minmax-transfer-time"],
+  "memory_advise": ["none"],
+  "prefetch": [false],
+  "stream_attach": [false],
+  "time_computation": [false],
+  "num_gpus": [2],
+  "block_size1d": {
+    "B1": 32,
+    "B1M": 32,
+    "B5M": 1024,
+    "B11M": 256
+  },
+  "block_size2d": {
+    "B1":  8,
+    "B1M": 8,
+    "B5M": 8,
+    "B11M": 8
+  }
+}
\ No newline at end of file
diff --git a/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_V100.json b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_V100.json
new file mode 100644
index 00000000..473c11aa
--- /dev/null
+++ b/projects/resources/java/grcuda-benchmark/src/test/java/it/necst/grcuda/benchmark/config_V100.json
@@ -0,0 +1,48 @@
+{
+  "num_iter": 30,
+  "reInit": false,
+  "randomInit": false,
+  "cpuValidation": false,
+  "heap_size": 470,
+  "debug": false,
+  "nvprof_profile": false,
+  "num_elem": {
+    "B1M": [160000000,250000000, 500000000, 800000000, 950000000],
+    "B5M": [10000000,16000000, 21000000, 28000000, 35000000],
+    "B6M": [1000000,1200000, 1400000, 1600000, 1800000],
+    "B9M": [20000, 30000, 40000, 50000, 60000],
+    "B11M": [20000, 30000, 40000, 50000, 60000]
+  },
+  "benchmarks": ["B1M", "B5M", "B6M", "B9M", "B11M"],
+  "numBlocks": {
+    "B1M": 64,
+    "B5M": 64,
+    "B6M": 64,
+    "B9M": 64,
+    "B11M": 64
+  },
+  "exec_policies" : ["async"],
+  "dependency_policies": ["with-const"],
+  "new_stream_policies": ["always-new"],
+  "parent_stream_policies": ["multigpu-disjoint"],
+  "choose_device_policies": ["round-robin","stream-aware","min-transfer-size", "minmax-transfer-time"],
+  "memory_advise": ["none"],
+  "prefetch": [false],
+  "stream_attach": [false],
+  "time_computation": [false],
+  "num_gpus": [1, 2, 4, 8],
+  "block_size1d": {
+    "B1M": 32,
+    "B5M": 1024,
+    "B6M": 32,
+    "B9M": 32,
+    "B11M": 256
+  },
+  "block_size2d": {
+    "B1M": 8,
+    "B5M": 8,
+    "B6M": 8,
+    "B9M": 8,
+    "B11M": 8
+  }
+}
\ No newline at end of file
diff --git a/projects/resources/python/benchmark/__init__.py b/projects/resources/python/benchmark/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/projects/resources/python/benchmark/bench/__init__.py b/projects/resources/python/benchmark/bench/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/projects/resources/python/benchmark/bench/multi_gpu/__init__.py b/projects/resources/python/benchmark/bench/multi_gpu/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/projects/resources/python/benchmark/bench/multi_gpu/bench_1.py b/projects/resources/python/benchmark/bench/multi_gpu/bench_1.py
new file mode 100644
index 00000000..af193acb
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/multi_gpu/bench_1.py
@@ -0,0 +1,187 @@
+# coding=utf-8
+import polyglot
+from java.lang import System
+import numpy as np
+from random import random, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+# Number of partitions;
+P = 16
+
+SQUARE_KERNEL = """
+extern "C" __global__ void square(const float *x, float *y, int n) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i] * x[i];  
+    }
+}
+"""
+
+REDUCE_KERNEL = """
+
+// From https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void reduce(const float *x, const float *y, float *z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] - y[i];
+    }
+    sum = warp_reduce(sum);                    // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0)  // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum);                     // The first thread in the warp updates the output;
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark1M(Benchmark):
+    """
+    Compute the sum of difference of squares of 2 vectors, using multiple GrCUDA kernels.
+    Parallelize the computation on multiple GPUs, by computing a chunk of the output on each.
+    Then aggregate results on the CPU;
+    Structure of the computation:
+    * GPU0:
+       A: x^2 ──┐
+                ├─> C: z0=sum(x-y)
+       B: x^2 ──┘
+    * GPU1:
+       A: x^2 ──┐
+                ├─> C: z1=sum(x-y)
+       B: x^2 ──┘
+    * GPU2: [...]
+    * CPU: z = z0 + z1 + ...
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b1m", benchmark, nvprof_profile)
+        self.size = 0
+        self.S = 0
+        self.x = None
+        self.y = None
+        self.x1 = None
+        self.y1 = None
+        self.z = None
+        self.res = None
+        self.square_kernel = None
+        self.diff_kernel = None
+        self.reduce_kernel = None
+        self.initialize = None
+        self.res_tot = 0
+        self.cpu_result = 0
+
+        # self.num_blocks = DEFAULT_NUM_BLOCKS
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+
+        # Number of items in each partition;
+        self.S = (self.size + P - 1) // P
+
+        self.x = [None for _ in range(P)]
+        self.y = [None for _ in range(P)]
+        self.x1 = [None for _ in range(P)]
+        self.y1 = [None for _ in range(P)]
+        self.res = [None for _ in range(P)]
+
+        # Allocate 2 vectors;
+        for i in range(P):
+            self.x[i] = polyglot.eval(language="grcuda", string=f"float[{self.S}]")
+            self.y[i] = polyglot.eval(language="grcuda", string=f"float[{self.S}]")
+            self.x1[i] = polyglot.eval(language="grcuda", string=f"float[{self.S}]")
+            self.y1[i] = polyglot.eval(language="grcuda", string=f"float[{self.S}]")
+            self.res[i] = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.square_kernel = build_kernel(SQUARE_KERNEL, "square", "const pointer, pointer, sint32")
+        self.reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "const pointer, const pointer, pointer, sint32")
+
+        self.initialize = polyglot.eval(language="js", string="(x, i, N, a) => { for (let j = 0; j < x.length; j++) { let index = i * x.length + j; if (index < N) {x[j] = a / (index + 1); }}}")
+
+
+    @time_phase("initialization")
+    def init(self):
+        for i in range(P):
+            self.initialize(self.x[i], i, self.size, 1)
+            self.initialize(self.y[i], i, self.size, 2)
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        for i in range(P):
+            self.initialize(self.x[i], i, self.size, 1)
+            self.initialize(self.y[i], i, self.size, 2)
+            self.res[i][0] = 0.0
+        self.res_tot = 0
+
+    def execute(self) -> object:
+        self.block_size = self._block_size["block_size_1d"]
+        start_comp = System.nanoTime()
+        start = 0
+        for i in range(P):
+            # A, B. Call the kernel. The 2 computations are independent, and can be done in parallel;
+            self.execute_phase(f"square_1_{i}", self.square_kernel(self.num_blocks, self.block_size), self.x[i], self.x1[i], self.S)
+            self.execute_phase(f"square_2_{i}", self.square_kernel(self.num_blocks, self.block_size), self.y[i], self.y1[i], self.S)
+            # C. Compute the sum of the result;
+            self.execute_phase(f"reduce_{i}", self.reduce_kernel(self.num_blocks, self.block_size), self.x1[i], self.y1[i], self.res[i], self.S)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        for i in range(P):
+            self.res_tot += self.res[i][0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        self.benchmark.add_to_benchmark("gpu_result", self.res_tot)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {self.res_tot:.4f}")
+
+        return self.res_tot
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            if self.benchmark.random_init:
+                x_g = np.zeros(self.size)
+                y_g = np.zeros(self.size)
+                for i in range(self.size):
+                    x_g[i] = random()
+                    y_g[i] = 2 * random()
+            else:
+                x_g = 1 / np.linspace(1, self.size, self.size)
+                y_g = 2 / np.linspace(1, self.size, self.size)
+
+            x_g = x_g ** 2
+            y_g = y_g ** 2
+            x_g -= y_g
+            self.cpu_result = np.sum(x_g)
+        cpu_time = System.nanoTime() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/multi_gpu/bench_11.py b/projects/resources/python/benchmark/bench/multi_gpu/bench_11.py
new file mode 100644
index 00000000..a5e246de
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/multi_gpu/bench_11.py
@@ -0,0 +1,193 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sat Sep 18 09:25:57 2021
+@author: alberto.parravicini
+"""
+
+import polyglot
+from java.lang import System 
+import numpy as np
+from random import random, randint, seed, sample, uniform
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_BLOCK_SIZE_2D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+MATRIX_VECTOR_MULT_KERNEL = """   
+extern "C" __global__ void matrix_vector_mult_1(const float* x, const float* y, float* z, int n, int m, int z_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[z_offset + i] = sum;
+    }
+}
+
+extern "C" __global__ void matrix_vector_mult_2(const float* x, const float* y, float* z, int n, int m) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[i] = sum;
+    }
+}
+
+extern "C" __global__ void copy(const float *x, float *y, int n, int offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i + offset] = x[i];
+    }
+}
+"""
+
+'''THIS IS EMPLOYED AS DEFAULT BENCHMARK FOR MULTIGPU TEST'''
+
+class Benchmark11M(Benchmark):
+    """
+    Dense matrix-vector multiplication, partitioning the matrix in blocks of rows;
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b11m", benchmark, nvprof_profile)
+        self.size = 0
+        
+        # Square matrix of size x size;
+        self.N = self.size
+        self.M = self.size
+        
+        # Use P horizontal partitions;
+        self.P = 16
+        
+        # Size of partitions;
+        self.S = (self.N + self.P - 1) // self.P
+        
+        # Full matrix;
+        self.x_cpu = None
+        # Dense vector;
+        self.y_cpu = None
+        
+        # The GPU matrix is stored using P arrays;
+        self.x = [None for _ in range(self.P)]
+        # Dense vector;
+        self.y = None
+        # Result;
+        # self.z = None        
+        self.z = [None for _ in range(self.P)]
+        self.z_out = None
+
+        self.cpu_result = None
+        self.gpu_result = None
+
+        self.block_size_1d = DEFAULT_BLOCK_SIZE_1D
+        self.block_size_2d = DEFAULT_BLOCK_SIZE_2D
+
+        self.matrix_vector_mult_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.N = self.size
+        self.M = self.size
+        self.S = (self.N + self.P - 1) // self.P
+        self.block_size_1d = block_size["block_size_1d"]
+        self.block_size_2d = block_size["block_size_2d"]
+
+        self.gpu_result = 0.0
+
+        # Allocate vectors;
+        for p in range(self.P):
+            self.x[p] = polyglot.eval(language="grcuda", string=f"float[{self.S * self.M}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{self.M}]")
+        # self.z = polyglot.eval(language="grcuda", string=f"float[{self.N}]")
+        for p in range(self.P):
+            self.z[p] = polyglot.eval(language="grcuda", string=f"float[{self.S}]")
+        self.z_out = polyglot.eval(language="grcuda", string=f"float[{self.N}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        # self.matrix_vector_mult_kernel = build_kernel(MATRIX_VECTOR_MULT_KERNEL, "matrix_vector_mult_2", "const pointer, const pointer, pointer, sint32, sint32, sint32")
+        self.matrix_vector_mult_kernel = build_kernel(MATRIX_VECTOR_MULT_KERNEL, "matrix_vector_mult_2", "const pointer, const pointer, pointer, sint32, sint32")
+        self.copy_kernel = build_kernel(MATRIX_VECTOR_MULT_KERNEL, "copy", "const pointer, pointer, sint32, sint32")
+        self.initialize = polyglot.eval(language="js", string="x => { for (let i = 0; i < x.length; i++) { x[i] = i / x.length }}")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = 10
+        seed(self.random_seed)
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        self.gpu_result = 0.0
+        for p in range(self.P):
+            self.initialize(self.x[p])
+        for i in range(self.M):
+            self.y[i] = i / self.M
+        # for p in range(self.P):
+        #     for i in range(len(self.x[p])):
+        #         print(f"p={p}, x[{p}][{i}]={self.x[p][i]}")
+        # for i in range(len(self.y)):
+        #     print(f"i={i}, y[{i}]={self.y[i]}")
+        
+    def execute(self) -> object:
+        self.num_blocks = self.num_blocks 
+        self.block_size = self._block_size["block_size_1d"]
+        # Schedule the categorical Naive Bayes and Ridge Regression kernels
+        start_comp = System.nanoTime()
+        start = 0
+
+        # Compute all partitions;
+        for p in range(self.P):
+            # self.execute_phase("mmul_{p}", self.matrix_vector_mult_kernel(self.num_blocks, self.block_size),
+            #                    self.x[p], self.y, self.z, min(self.S, self.N - p * self.S), self.M, p * self.S)
+            self.execute_phase("mmul_{p}", self.matrix_vector_mult_kernel(self.num_blocks, self.block_size),
+                               self.x[p], self.y, self.z[p], min(self.S, self.N - p * self.S), self.M)
+        # Aggregate results;
+        for p in range(self.P):      
+            self.execute_phase("copy_{p}", self.copy_kernel(self.num_blocks, self.block_size),
+                               self.z[p], self.z_out, min(self.S, self.N - p * self.S), p * self.S)      
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp = self.z_out[0]
+        end = System.nanoTime()
+        self.gpu_result = sum(self.z_out[:10])
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        self.benchmark.add_to_benchmark("gpu_result", self.gpu_result)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.z_out[:10]]) + "...]")
+
+        return self.gpu_result   
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        start = System.nanoTime()
+        x_cpu = [0.0] * self.N * self.M
+        y_cpu = [self.y[i] for i in range(len(self.y))]
+        for i in range(self.P):
+            for j in range(self.S * self.M):
+                if i * self.S * self.M + j < len(x_cpu):
+                    x_cpu[i * self.S * self.M + j] = j / (self.S * self.M)
+        z_cpu = np.array(x_cpu).reshape((self.N, self.M)) @ np.array(y_cpu)
+        self.cpu_result = sum(z_cpu[:10])
+        cpu_time = System.nanoTime() - start
+            
+        # Compare GPU and CPU results;
+        difference = np.abs(self.cpu_result - gpu_result)
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in z_cpu[:10]]) + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
+
+
+
+
diff --git a/projects/resources/python/benchmark/bench/multi_gpu/bench_5.py b/projects/resources/python/benchmark/bench/multi_gpu/bench_5.py
new file mode 100644
index 00000000..27b6ac02
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/multi_gpu/bench_5.py
@@ -0,0 +1,186 @@
+# coding=utf-8
+import polyglot
+import time
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+from java.lang import System
+import math
+
+##############################
+##############################
+
+R = 0.08
+V = 0.3
+T = 1.0
+K = 60.0
+
+BS_KERNEL = """
+__device__ inline double cndGPU(double d) {
+    const double       A1 = 0.31938153f;
+    const double       A2 = -0.356563782f;
+    const double       A3 = 1.781477937f;
+    const double       A4 = -1.821255978f;
+    const double       A5 = 1.330274429f;
+    const double RSQRT2PI = 0.39894228040143267793994605993438f;
+
+    double
+    K = 1.0 / (1.0 + 0.2316419 * fabs(d));
+
+    double
+    cnd = RSQRT2PI * exp(- 0.5f * d * d) *
+          (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
+
+    if (d > 0)
+        cnd = 1.0 - cnd;
+
+    return cnd;
+}
+
+extern "C" __global__ void bs(const double *x, double *y, int N, double R, double V, double T, double K) {
+
+    double sqrtT = 1.0 / rsqrt(T);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        double expRT;
+        double d1, d2, CNDD1, CNDD2;
+        d1 = (log(x[i] / K) + (R + 0.5 * V * V) * T) / (V * sqrtT);
+        d2 = d1 - V * sqrtT;
+
+        CNDD1 = cndGPU(d1);
+        CNDD2 = cndGPU(d2);
+
+        //Calculate Call and Put simultaneously
+        expRT = exp(-R * T);
+        y[i] = x[i] * CNDD1 - K * expRT * CNDD2;
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark5M(Benchmark):
+    """
+    Black & Scholes equation benchmark, executed concurrently on different input vectors;
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b5m", benchmark, nvprof_profile)
+        self.size = 0
+
+        # self.num_blocks = DEFAULT_NUM_BLOCKS
+        self.sum_kernel = None
+        self.cpu_result = 0
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+        self.K = 24
+        self.x = [[]] * self.K
+        self.x_tmp = None
+        self.y = [[]] * self.K
+
+        self.bs_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+        self.x_tmp = None
+
+        # Allocate vectors;
+        for i in range(self.K):
+            self.x[i] = polyglot.eval(language="grcuda", string=f"double[{size}]")
+            self.y[i] = polyglot.eval(language="grcuda", string=f"double[{size}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.bs_kernel = build_kernel(BS_KERNEL, "bs", "const pointer, pointer, sint32, double, double, double, double")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+        self.x_tmp = [K] * self.size
+        if self.benchmark.random_init:
+            for i in range(len(self.x_tmp)):
+                self.x_tmp[i] = random() - 0.5 + K
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        for i in range(self.K):
+            X = self.x[i]
+            for j in range(self.size):
+                X[j] = self.x_tmp[j]
+
+    def execute(self) -> object:
+        self.block_size = self._block_size["block_size_1d"]
+        result = [0] * self.K
+
+        # Call the kernels;
+        start_comp = System.nanoTime()
+        start = System.nanoTime()
+        for i in range(self.K):
+            self.execute_phase(f"bs_{i}", self.bs_kernel(self.num_blocks, self.block_size), self.x[i], self.y[i], self.size, R, V, T, K)
+
+        if self.time_phases:
+            start = System.nanoTime()
+        for i in range(self.K):
+            result[i] = self.y[i][0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+
+        self.benchmark.add_to_benchmark("gpu_result", result[0])
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {result[0]}")
+
+        return result[0]
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def CND(X):
+            """
+            Cumulative normal distribution.
+            Helper function used by BS(...).
+            """
+
+            (a1, a2, a3, a4, a5) = (0.31938153, -0.356563782, 1.781477937, -1.821255978, 1.330274429)
+            L = np.absolute(X)
+            K = np.float64(1.0) / (1.0 + 0.2316419 * L)
+            w = 1.0 - 1.0 / math.sqrt(2 * np.pi) * np.exp(-L * L / 2.) * \
+                (a1 * K +
+                 a2 * (K ** 2) +
+                 a3 * (K ** 3) +
+                 a4 * (K ** 4) +
+                 a5 * (K ** 5))
+
+            mask = X < 0
+            w = w * ~mask + (1.0 - w) * mask
+
+            return w
+
+        def BS(X, R, V, T, K):
+            """Black Scholes Function."""
+            d1_arr = (np.log(X / K) + (R + V * V / 2.) * T) / (V * math.sqrt(T))
+            d2_arr = d1_arr - V * math.sqrt(T)
+            w_arr = CND(d1_arr)
+            w2_arr = CND(d2_arr)
+            return X * w_arr - X * math.exp(-R * T) * w2_arr
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            res = BS(np.array(self.x_tmp), R, V, T, K)
+            self.cpu_result = res[0]
+        cpu_time = System.nanoTime() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/multi_gpu/bench_6.py b/projects/resources/python/benchmark/bench/multi_gpu/bench_6.py
new file mode 100644
index 00000000..2f7937f7
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/multi_gpu/bench_6.py
@@ -0,0 +1,420 @@
+# coding=utf-8
+import polyglot
+import time
+from java.lang import System
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+BLOCK_SIZE_V100 = 64  # Just a recommendation of optimal block size for the V100;
+P = 16
+
+NB_KERNEL = """   
+extern "C" __global__ void nb_1(const int* x, const float* y, float* z, int n, int partition_rows, int n_feat, int n_classes, int partition_num) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < min(partition_rows, n - partition_num * partition_rows); i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_classes; j++) {
+            for (int q = 0; q < n_feat; q++) {
+                z[partition_num * partition_rows * n_classes + i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+            }
+        }
+    }
+}
+    
+extern "C" __global__ void nb_2(const float* x, float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float curr_max = x[i * n_col_x];
+        for (int j = 0; j < n_col_x; j++) {
+            curr_max = fmaxf(curr_max, x[i * n_col_x + j]);
+        }
+        y[i] = curr_max;
+    }
+}
+
+extern "C" __global__ void nb_3(const float* x, const float* y, float* z, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < n_col_x; j++) {
+            sum += expf(x[i * n_col_x + j] - y[i]);
+        }
+        z[i] = logf(sum) + y[i];
+    }
+}
+
+extern "C" __global__ void nb_4(float* x, float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] = expf(x[i * n_col_x + j] - y[i]);
+        }
+    }
+}
+"""
+
+RR_KERNEL = """
+extern "C" __global__ void rr_1(const int* x, float* mean, float *std, int n_row_x, int n_col_x, int partition, int partition_size) {
+    for (int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+        float feature_mean = 0;
+        float sum_sq = 0;
+        // Compute mean and variance;
+        for (int i = 0; i < partition_size; i++) {
+            float x_tmp = x[j * partition_size + i];
+            feature_mean += x_tmp;
+            sum_sq += x_tmp * x_tmp;
+        }
+        // feature_mean /= n_row_x;
+        // std[j] = sqrtf(sum_sq / n_row_x - feature_mean * feature_mean);
+        // mean[j] = feature_mean;
+
+        // Keep just the sum and squared sum, compute mean and std later;
+        mean[j] += feature_mean;
+        std[j] += sum_sq;
+    }
+}
+
+extern "C" __global__ void rr_1_1(float* mean, float *std, const float *mean_curr, const float *std_curr, int n_row_x, int n_col_x, int partition_index, int partition_size) {
+    // We use partition 0 to accumulate, so skip it;
+    if (partition_index == 0) return;
+
+    // Aggregate mean and std from different partitions;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_col_x; i += blockDim.x * gridDim.x) {
+        mean[i] += mean_curr[i];
+        std[i] += std_curr[i];
+        // When processing the last partition, compute the final mean and std;
+        if (partition_index == %d - 1) {
+            mean[i] /= n_row_x;
+            std[i] = sqrtf(std[i] / n_row_x - mean[i] * mean[i]);
+        }
+    }
+}
+
+extern "C" __global__ void rr_1_2(const int *x, float *y, const float* mean, const float *std, int n_row_x, int n_col_x, int partition_size) {
+    // Normalize each row;
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < partition_size; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            float mean_curr = mean[j];
+            float std_curr = std[j];
+            y[i * n_col_x + j] = (x[i * n_col_x + j] - mean_curr) / std_curr;
+        }
+    }
+}
+
+extern "C" __global__ void rr_2(const float* x, const float* y, float* z, int n, int partition_rows, int n_feat, int n_classes, int partition_num) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < min(partition_rows, n - partition_num * partition_rows); i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_classes; j++) {
+            for (int q = 0; q < n_feat; q++) {
+                z[partition_num * partition_rows * n_classes + i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+            }
+        }
+    }
+}
+
+extern "C" __global__ void rr_3(float* x, const float* y, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] += y[j];
+        }
+    }
+}
+""" % (P)
+
+ENSEMBLE_KERNEL = """
+extern "C" __global__ void softmax(float* x, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        float row_exp_sum = 0;
+        for (int j = 0; j < n_col_x; j++) {
+            row_exp_sum += expf(x[i * n_col_x + j]);
+        }
+        for (int j = 0; j < n_col_x; j++) {
+            x[i * n_col_x + j] = expf(x[i * n_col_x + j]) / row_exp_sum;
+        }
+    }
+}
+
+extern "C" __global__ void argmax(const float* x, const float* y, int* z, int n_row_x, int n_col_x) {
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+        int curr_best_index = 0;
+        float curr_best = x[i * n_col_x] + y[i * n_col_x];
+        for (int j = 0; j < n_col_x; j++) {
+            float curr = x[i * n_col_x + j] + y[i * n_col_x + j];
+            if (curr > curr_best) {
+                curr_best = curr;
+                curr_best_index = j;
+            }
+        }
+        z[i] = curr_best_index;
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark6M(Benchmark):
+    """
+    Compute a an ensemble of Categorical Naive Bayes and Ridge Regression classifiers.
+    Predictions are aggregated averaging the class scores after softmax normalization.
+    The computation is done on mock data and parameters, but is conceptually identical to a real ML pipeline.
+    In the DAG below, input arguments that are not involved in the computation of dependencies are omitted;
+
+    RR-1: standard column normalization (partitioned row-wise)
+        RR-1-1: aggregate mean/std across partitions (partitioned row-wise, but partitions are not independent)
+        RR-1-2: apply normalization (partitioned row-wise)
+    RR-2: matrix multiplication (partitioned row-wise)
+    RR-3: add vector to matrix, row-wise
+    NB-1: matrix multiplication (partitioned row-wise)
+    NB-2: row-wise maximum
+    NB-3: log of sum of exponential, row-wise
+    NB-4: exponential, element-wise
+
+         ┌─> RR-1(const X,MEAN,STD) ─> RR-1-1(MEAN,STD) -> RR-1-2(X, Z, MEAN, STD) ─> (...)
+         │     (...) -> RR-2(const Z,R2) ─> RR-3(R2) ─> SOFTMAX(R1) ─────────────────────┐
+        ─┤                                                                               ├─> ARGMAX(const R1,const R2,R)
+         └─> NB-1(const X,R1) ─> NB-2(const R1,AMAX) ─> (...)                            │
+               (...) -> NB-3(const R1,const AMAX,L) ─> NB-4(R1,const L) ─> SOFTMAX(R2) ──┘
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b6m", benchmark, nvprof_profile)
+        self.size = 0
+        self.S = 0
+        self.x = [None for _ in range(P)]
+        self.z = [None for _ in range(P)]
+        self.mean = [None for _ in range(P)]
+        self.std = [None for _ in range(P)]
+        self.r1 = None
+        self.r2 = None
+        self.r = None
+
+        self.nb_1 = None
+        self.nb_2 = None
+        self.nb_3 = None
+        self.nb_4 = None
+        self.rr_1 = None
+        self.rr_1_1 = None
+        self.rr_1_2 = None
+        self.rr_2 = None
+        self.rr_3 = None
+        self.softmax = None
+        self.argmax = None
+
+        self.cpu_result = None
+      
+        # Internal arrays used by the algorithms, they do not affect the DAG structure;
+        self.nb_feat_log_prob = None
+        self.nb_class_log_prior = None
+        self.ridge_coeff = None
+        self.ridge_intercept = None
+        self.nb_amax = None
+        self.nb_l = None
+
+        self.num_features = 1024 
+        self.num_classes = 16  
+        self.max_occurrence_of_ngram = 10
+
+        self.num_blocks_size = self.num_blocks 
+        self.num_blocks_feat = self.num_blocks 
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+        self.x_cpu = None
+        self.nb_feat_log_prob_cpu = None
+        self.ridge_coeff_cpu = None
+        self.nb_class_log_prior_cpu = None
+        self.ridge_intercept_cpu = None
+        self.r1_cpu = None
+        self.r2_cpu = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.S = (self.size + P - 1) // P
+        self.block_size = block_size["block_size_1d"]
+
+        # Allocate vectors;
+        for i in range(P):
+            self.x[i] = polyglot.eval(language="grcuda", string=f"int[{self.S * self.num_features}]")
+            self.z[i] = polyglot.eval(language="grcuda", string=f"float[{self.S * self.num_features}]")
+            self.mean[i] = polyglot.eval(language="grcuda", string=f"float[{self.num_features}]")
+            self.std[i] = polyglot.eval(language="grcuda", string=f"float[{self.num_features}]")
+
+        self.nb_feat_log_prob = polyglot.eval(language="grcuda", string=f"float[{self.num_classes * self.num_features}]")
+        self.nb_class_log_prior = polyglot.eval(language="grcuda", string=f"float[{self.num_classes}]")
+        self.ridge_coeff = polyglot.eval(language="grcuda", string=f"float[{self.num_classes * self.num_features}]")
+        self.ridge_intercept = polyglot.eval(language="grcuda", string=f"float[{self.num_classes}]")
+
+        self.nb_amax = polyglot.eval(language="grcuda", string=f"float[{self.size}]")
+        self.nb_l = polyglot.eval(language="grcuda", string=f"float[{self.size}]")
+
+        self.r1 = polyglot.eval(language="grcuda", string=f"float[{self.size * self.num_classes}]")
+        self.r2 = polyglot.eval(language="grcuda", string=f"float[{self.size * self.num_classes}]")
+        self.r = polyglot.eval(language="grcuda", string=f"int[{self.size}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.nb_1 = build_kernel(NB_KERNEL, "nb_1", "const pointer, const pointer, const pointer, sint32, sint32, sint32, sint32, sint32")
+        self.nb_2 = build_kernel(NB_KERNEL, "nb_2", "pointer, pointer, sint32, sint32")
+        self.nb_3 = build_kernel(NB_KERNEL, "nb_3", "const pointer, const pointer, pointer, sint32, sint32")
+        self.nb_4 = build_kernel(NB_KERNEL, "nb_4", "pointer, const pointer, sint32, sint32")
+
+        self.rr_1 = build_kernel(RR_KERNEL, "rr_1", "const pointer, pointer, pointer, sint32, sint32, sint32, sint32")
+        self.rr_1_1 = build_kernel(RR_KERNEL, "rr_1_1", "pointer, pointer, const pointer, const pointer, sint32, sint32, sint32, sint32")
+        self.rr_1_2 = build_kernel(RR_KERNEL, "rr_1_2", "const pointer, pointer, const pointer, const pointer, sint32, sint32, sint32")
+        self.rr_2 = build_kernel(RR_KERNEL, "rr_2", "const pointer, const pointer, const pointer, sint32, sint32, sint32, sint32, sint32")
+        self.rr_3 = build_kernel(RR_KERNEL, "rr_3", "pointer, const pointer, sint32, sint32")
+
+        self.softmax = build_kernel(ENSEMBLE_KERNEL, "softmax", "pointer, sint32, sint32")
+        self.argmax = build_kernel(ENSEMBLE_KERNEL, "argmax", "const pointer, const pointer, pointer, sint32, sint32")
+        self.initialize_rand = polyglot.eval(language="js", string="(x, m) => { for (let i = 0; i < x.length; i++) { x[i] = Math.floor(Math.random() * m) }}")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+
+        self.nb_feat_log_prob_cpu = np.random.random_sample((self.num_classes, self.num_features)).astype(dtype=np.float32)
+        self.ridge_coeff_cpu = np.random.random_sample((self.num_classes, self.num_features)).astype(dtype=np.float32)
+        self.nb_class_log_prior_cpu = np.random.random_sample(self.num_classes).astype(dtype=np.float32)
+        self.ridge_intercept_cpu = np.random.random_sample(self.num_classes).astype(dtype=np.float32)
+
+        for i in range(P):
+            self.initialize_rand(self.x[i], self.max_occurrence_of_ngram)
+        self.nb_feat_log_prob.copyFrom(int(np.int64(self.nb_feat_log_prob_cpu.ctypes.data)), len(self.nb_feat_log_prob))
+        self.ridge_coeff.copyFrom(int(np.int64(self.ridge_coeff_cpu.ctypes.data)), len(self.ridge_coeff))
+        self.nb_class_log_prior.copyFrom(int(np.int64(self.nb_class_log_prior_cpu.ctypes.data)), len(self.nb_class_log_prior))
+        self.ridge_intercept.copyFrom(int(np.int64(self.ridge_intercept_cpu.ctypes.data)), len(self.ridge_intercept))
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        for i in range(self.size):
+            for j in range(self.num_classes):
+                self.r1[i * self.num_classes + j] = self.nb_class_log_prior[j]
+                self.r2[i * self.num_classes + j] = 0
+        for i in range(P):
+            for j in range(self.num_features):
+                self.mean[i][j] = 0.0
+                self.std[i][j] = 0.0
+
+    def execute(self) -> object:
+        self.num_blocks_size = self.num_blocks
+        self.num_blocks_feat = self.num_blocks
+        self.block_size = self._block_size["block_size_1d"]
+        # Schedule the categorical Naive Bayes and Ridge Regression kernels
+        start_comp = System.nanoTime()
+        start = 0
+
+        # RR - 1.
+        for i in range(P):
+            self.execute_phase(f"rr_1_{i}", self.rr_1(self.num_blocks_feat, self.block_size),
+                            self.x[i], self.mean[i], self.std[i], self.size, self.num_features, i, self.S)
+
+        # RR - 1.1
+        for i in range(P):
+            self.execute_phase(f"rr_1_1_{i}", self.rr_1_1(self.num_blocks_feat, self.block_size),
+                            self.mean[0], self.std[0], self.mean[i], self.std[i], self.size, self.num_features, i, self.S)
+
+        # RR - 1.2 and 2.
+        for i in range(P):
+            self.execute_phase(f"rr_1_2_{i}", self.rr_1_2(self.num_blocks_feat, self.block_size),
+                            self.x[i], self.z[i], self.mean[0], self.std[0], self.size, self.num_features, self.S)
+            self.execute_phase(f"rr_2_{i}", self.rr_2(self.num_blocks_size, self.block_size),
+                            self.z[i], self.ridge_coeff, self.r2, self.size, self.S, self.num_features, self.num_classes, i)
+
+        # RR - 3.
+        self.execute_phase("rr_3", self.rr_3(self.num_blocks_size, self.block_size),
+                           self.r2, self.ridge_intercept, self.size, self.num_classes)
+
+        # NB - 1.
+        for i in range(P):
+            self.execute_phase(f"nb_1_{i}", self.nb_1(self.num_blocks_size, self.block_size),
+                            self.x[i], self.nb_feat_log_prob, self.r1, self.size, self.S, self.num_features, self.num_classes, i)
+
+        # NB - 2.
+        self.execute_phase("nb_2", self.nb_2(self.num_blocks_size, self.block_size),
+                           self.r1, self.nb_amax, self.size, self.num_classes)
+
+        # NB - 3.
+        self.execute_phase("nb_3", self.nb_3(self.num_blocks_size, self.block_size),
+                           self.r1, self.nb_amax, self.nb_l, self.size, self.num_classes)
+
+        # NB - 4.
+        self.execute_phase("nb_4", self.nb_4(self.num_blocks_size, self.block_size),
+                           self.r1, self.nb_l, self.size, self.num_classes)
+
+        # Ensemble results;
+
+        # Softmax normalization;
+        self.execute_phase("softmax_1", self.softmax(self.num_blocks_size, self.block_size), self.r1, self.size, self.num_classes)
+        self.execute_phase("softmax_2", self.softmax(self.num_blocks_size, self.block_size), self.r2, self.size, self.num_classes)
+
+        # Prediction;
+        self.execute_phase("argmax", self.argmax(self.num_blocks_size, self.block_size), self.r1, self.r2, self.r, self.size, self.num_classes)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp = self.r[0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        self.benchmark.add_to_benchmark("gpu_result", 0)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.r[:10]]) + "...]")
+
+        return self.r
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def softmax(X):
+            return np.exp(X) / np.sum(np.exp(X), axis=1).reshape(X.shape[0], 1)
+
+        def logsumexp(X):
+            return np.log(np.sum(np.exp(X)))
+
+        def naive_bayes_predict(X, feature_log_prob, log_class_prior):
+            jll = X.dot(feature_log_prob.T) + log_class_prior
+            amax = np.amax(jll, axis=1)
+            l = logsumexp(jll - np.atleast_2d(amax).T) + amax
+
+            return np.exp(jll - np.atleast_2d(l).T)
+
+        def normalize(X):
+            return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
+
+        def ridge_pred(X, coef, intercept):
+            return np.dot(X, coef.T) + intercept
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+
+            x_cpu = np.zeros((self.size, self.num_features), dtype=np.int32)
+            for i in range(self.size):
+                for j in range(self.num_features):
+                    x_cpu[i, j] = self.x[i * self.num_features + j]
+
+            r1_g = naive_bayes_predict(x_cpu, self.nb_feat_log_prob_cpu, self.nb_class_log_prior_cpu)
+            r2_g = ridge_pred(normalize(x_cpu), self.ridge_coeff_cpu, self.ridge_intercept_cpu)
+            r_g = np.argmax(softmax(r1_g) + softmax(r2_g), axis=1)
+            self.cpu_result = r_g
+
+        cpu_time = System.nanoTime() - start
+
+        # Compare GPU and CPU results;
+        difference = 0
+        for i in range(self.size):
+            difference += np.abs(self.cpu_result[i] - gpu_result[i])
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in self.cpu_result[:10]]) + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/multi_gpu/bench_9.py b/projects/resources/python/benchmark/bench/multi_gpu/bench_9.py
new file mode 100644
index 00000000..47ee811e
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/multi_gpu/bench_9.py
@@ -0,0 +1,316 @@
+# coding=utf-8
+import polyglot
+from java.lang import System
+import numpy as np
+from random import random, randint, seed, sample
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+BLOCK_SIZE_V100 = 64  # Just a recommendation of optimal block size for the V100;
+P = 16
+ITER = 50
+
+PRECONDITION_KERNEL = """
+// Add a small epsilon to the main diagonal:
+extern "C" __global__ void precondition(float *A, int n, int m, int offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < m; i += blockDim.x * gridDim.x) {
+        A[i * n + i + offset] += 1e-12; 
+    }
+}
+"""
+
+MMUL_KERNEL = """
+// z = x @ y;
+extern "C" __global__ void matrix_vector_mult(const float* x, const float* y, float* z, int n, int m, int z_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[z_offset + i] = sum;
+    }
+}
+
+// z := w + alpha * A @ y;
+extern "C" __global__ void matrix_vector_mult_axpy(const float* x, const float* y, const float *w, const float alpha, float* z, int n, int m, int z_offset) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int j = 0; j < m; j++) {                
+            sum += x[i * m + j] * y[j];
+        }
+        z[z_offset + i] = alpha * sum + w[z_offset + i];
+    }
+}
+"""
+
+DP_KERNEL = """
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+// z = <x, x>;
+extern "C" __global__ void l2_norm(const float *x, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        float x_tmp = x[i];
+        sum += x_tmp * x_tmp;
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+
+// z = <x, y>;
+extern "C" __global__ void dot(const float *x, const float *y, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] * y[i];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+"""
+
+SAXPY_KERNEL = """
+// y = val + alpha * x;
+extern "C" __global__ void saxpy(float* y, const float *val, const float *x, float alpha, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = val[i] + alpha * x[i];
+    }
+}
+
+// Simply copy array x into y;
+extern "C" __global__ void cpy(float *y, const float *x, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i];
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark9M(Benchmark):
+    """
+    Compute the conjugate gradient algorithm on a dense symmetric matrix.
+    The matrix-vector multiplications are row-partitioned to scale across multiple GPUs;
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b9m", benchmark, nvprof_profile)
+        self.size = 0
+        self.S = 0
+        self.A = [None for _ in range(P)]
+        self.x = None
+        self.b = None
+        self.p = None
+        self.r = None
+        self.y = None
+        self.t1 = None
+        self.t2 = None
+
+        self.num_blocks_size = BLOCK_SIZE_V100
+        self.block_size = None
+
+        self.mmul_axpy_kernel = None
+        self.mmul_kernel = None
+        self.l2_norm_kernel = None
+        self.dp_kernel = None
+        self.saxpy_kernel = None
+        self.copy_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.S = (self.size + P - 1) // P
+        self.block_size = self._block_size["block_size_1d"]
+
+        self.random_seed = 12
+        seed(self.random_seed)
+
+        # Allocate vectors;
+        for i in range(P):
+            self.A[i] = polyglot.eval(language="grcuda", string=f"float[{self.size * self.S}]")
+        self.x = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.b = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.p = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.r = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.t1 = polyglot.eval(language="grcuda", string=f"float[1]")
+        self.t2 = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.precondition_kernel = build_kernel(PRECONDITION_KERNEL, "precondition", "pointer, sint32, sint32, sint32")
+        self.mmul_kernel = build_kernel(MMUL_KERNEL, "matrix_vector_mult", "const pointer, const pointer, const pointer, sint32, sint32, sint32")
+        self.mmul_axpy_kernel = build_kernel(MMUL_KERNEL, "matrix_vector_mult_axpy", "const pointer, const pointer, const pointer, float, const pointer, sint32, sint32, sint32")
+        self.l2_norm_kernel = build_kernel(DP_KERNEL, "l2_norm", "const pointer, pointer, sint32")
+        self.dp_kernel = build_kernel(DP_KERNEL, "dot", "const pointer, pointer, pointer, sint32")
+        self.saxpy_kernel = build_kernel(SAXPY_KERNEL, "saxpy", "pointer, const pointer, const pointer, float, sint32")
+        self.cpy_kernel = build_kernel(SAXPY_KERNEL, "cpy", "pointer, pointer, sint32")
+        self.initialize_random_symmetric_matrix = polyglot.eval(language="js", string="""(X, S, N) => { 
+            for (let i = 0; i < N; i++) {
+                s = (i / S) >> 0;
+                k = i % S;
+                Xs = X[s];
+                i_N = k * N;
+                for (let j = i; j < N; j++) {
+                    val = 2 * Math.random() - 1; 
+                    Xs[i_N + j] = val;
+                    X[(j / S) >> 0][(j % S) * N + i] = val;
+                }
+            }}
+            """)
+
+    @time_phase("initialization")
+    def init(self):
+        self.initialize_random_symmetric_matrix(self.A, self.S, self.size)
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        seed(self.random_seed)
+        # Random initial solution;
+        for i in range(self.size):
+            self.x[i] = 1.0 / self.size
+        self.t1[0] = 0.0
+        self.t2[0] = 0.0
+
+    def execute(self) -> object:
+      
+        start_comp = System.nanoTime()
+        start = 0
+        
+        # Initialization phase;
+        # precondition: A += I * np.eps;
+        for i in range(P):
+            self.execute_phase(f"precondition_{i}", self.precondition_kernel(self.num_blocks, self.block_size),
+                               self.A[i], self.size, min(self.S, self.size - i * self.S), i * self.S)
+        # r = b - A * x
+        for i in range(P):
+            self.execute_phase(f"mmul_init_{i}", self.mmul_axpy_kernel(self.num_blocks, self.block_size),
+                               self.A[i], self.x, self.b, -1, self.r, self.S, self.size, i * self.S)
+        # p = r
+        self.execute_phase("cpy_init", self.cpy_kernel(self.num_blocks, self.block_size),
+                           self.p, self.r, self.size)
+        # t1 = r^t * r
+        self.execute_phase("norm_init", self.l2_norm_kernel(self.num_blocks, self.block_size),
+                           self.r, self.t1, self.size)
+        for curr_iter in range(ITER):
+            # t2 = p^t * A * p
+            for i in range(P):
+                self.execute_phase(f"mmul_{i}_{curr_iter}", self.mmul_kernel(self.num_blocks, self.block_size),
+                                self.A[i], self.p, self.y, self.S, self.size, i * self.S)
+            self.execute_phase(f"dp_{curr_iter}", self.dp_kernel(self.num_blocks, self.block_size), 
+                               self.p, self.y, self.t2, self.size)
+            
+            if self.time_phases:
+                start = System.nanoTime()
+            alpha = self.t1[0] / self.t2[0]
+            old_r_norm_squared = self.t1[0]
+            self.t1[0] = 0
+            self.t2[0] = 0
+            if self.time_phases:
+                end = System.nanoTime()
+                self.benchmark.add_phase({"name": f"alpha_{curr_iter}", "time_sec": (end - start) / 1_000_000_000})
+
+            # Update x: x = x + alpha * p
+            self.execute_phase(f"saxpy_x_{curr_iter}", self.saxpy_kernel(self.num_blocks, self.block_size),
+                               self.x, self.x, self.p, alpha, self.size)
+            # r = r - alpha * y
+            self.execute_phase(f"saxpy_r_{curr_iter}", self.saxpy_kernel(self.num_blocks, self.block_size),
+                               self.r, self.r, self.y, -1 * alpha, self.size)
+            # t1 = r^t * r
+            self.execute_phase(f"norm_{curr_iter}", self.l2_norm_kernel(self.num_blocks, self.block_size), 
+                               self.r, self.t1, self.size)
+
+            if self.time_phases:
+                start = System.nanoTime()
+            beta = self.t1[0] / old_r_norm_squared
+            if self.time_phases:
+                end = System.nanoTime()
+                self.benchmark.add_phase({"name": f"beta_{curr_iter}", "time_sec": (end - start) / 1_000_000_000})
+
+            self.execute_phase(f"saxpy_p_{curr_iter}", self.saxpy_kernel(self.num_blocks, self.block_size),
+                               self.p, self.r, self.p, beta, self.size)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp = self.x[0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        # Compute GPU result;
+        for i in range(P):
+            self.mmul_axpy_kernel(self.num_blocks, self.block_size)(self.A[i], self.x, self.b, -1, self.y, min(self.S, self.size - i * self.S), self.size, i * self.S)
+
+        self.gpu_result = sum(self.y[:10])
+        self.benchmark.add_to_benchmark("gpu_result", 0)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.y[:10]]) + f"...] = {self.gpu_result:.4f}")
+
+        return self.gpu_result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        x_cpu = np.zeros(self.size)
+        A_cpu = np.zeros((self.size, self.size))
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            # Initialize the support device arrays;
+            N = self.size
+
+            for i in range(N):
+                p = i // self.S
+                for j in range(N):
+                    A_cpu[i, j] = self.A[p][(i % self.S) * N + j]
+
+            b = np.random.random(N)
+            x = np.ones(N)
+            r = b - A_cpu @ x
+            p = r.copy()
+            t1 = r.T.dot(r)
+
+            # Main iteration;
+            for i in range(ITER):
+                y = A_cpu @ p
+                t2 = p.dot(y)
+                alpha = t1 / t2
+                t1_old = t1
+                x += alpha * p
+                r -= alpha * y
+                t1 = r.T.dot(r)
+                beta = t1 / t1_old
+                p = r + beta * p
+
+            self.cpu_result = x
+
+        cpu_time = System.nanoTime() - start
+
+        # Compare GPU and CPU results;
+        difference = np.abs(self.cpu_result - gpu_result)
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in x_cpu[:10]]) + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/__init__.py b/projects/resources/python/benchmark/bench/single_gpu/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_1.py b/projects/resources/python/benchmark/bench/single_gpu/bench_1.py
new file mode 100644
index 00000000..fef2b98b
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_1.py
@@ -0,0 +1,216 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+from java.lang import System
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+SQUARE_KERNEL = """
+extern "C" __global__ void square(float* x, float* y, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i] * x[i];
+    }
+}
+"""
+
+DIFF_KERNEL = """
+extern "C" __global__ void diff(const float* x, const float* y, float* z, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        z[i] = x[i] - y[i];
+    }
+}
+"""
+
+REDUCE_KERNEL = """
+
+// From https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
+
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+__global__ void reduce(float *x, float *y, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] - y[i];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark1(Benchmark):
+    """
+    Compute the sum of difference of squares of 2 vectors, using multiple GrCUDA kernels. 
+    It's a fairly artificial benchmark that measures a simple case of parallelism.
+    Most of the execution time is spent in the reduction computation, limiting the amount of parallelism available, 
+    especially on large input data.
+    Speedups are achievable by overlapping data-transfer and computations, 
+    although the data-transfer takes about 4x-5x longer than the square computation, limiting the maximum achievable speedup.
+
+    Structure of the computation:
+
+    A: x^2 ──┐
+            ├─> C: z=sum(x-y)
+    B: x^2 ──┘
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b1", benchmark, nvprof_profile)
+        self.size = 0
+        self.x = None
+        self.y = None
+        self.x1 = None
+        self.y1 = None
+        self.z = None
+        self.res = None
+        self.square_kernel = None
+        self.diff_kernel = None
+        self.reduce_kernel = None
+        self.cpu_result = 0
+
+        # self.num_blocks = DEFAULT_NUM_BLOCKS
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+
+        # Allocate 2 vectors;
+        self.x = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.x1 = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.y1 = polyglot.eval(language="grcuda", string=f"float[{size}]")
+
+        # Allocate a support vector;
+        self.res = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.square_kernel = build_kernel(SQUARE_KERNEL, "square", "pointer, pointer, sint32")
+        self.reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "const pointer, const pointer, pointer, sint32")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+        for i in range(self.size):
+            if self.benchmark.random_init:
+                self.x[i] = random()
+                self.y[i] = 2 * random()
+            else:
+                self.x[i] = 1 / (i + 1)
+                self.y[i] = 2 / (i + 1)
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        if self.benchmark.random_init:
+            seed(self.random_seed)
+            for i in range(self.size):
+                self.x[i] = random()
+                self.y[i] = 2 * random()
+        else:
+            for i in range(self.size):
+                self.x[i] = 1 / (i + 1)
+                self.y[i] = 2 / (i + 1)
+        self.res[0] = 0.0
+
+    def execute(self) -> object:
+        self.block_size = self._block_size["block_size_1d"]
+        start_comp = System.nanoTime()
+        start = 0
+
+        # A, B. Call the kernel. The 2 computations are independent, and can be done in parallel;
+        self.execute_phase("square_1", self.square_kernel(self.num_blocks, self.block_size), self.x, self.x1, self.size)
+        self.execute_phase("square_2", self.square_kernel(self.num_blocks, self.block_size), self.y, self.y1, self.size)
+
+        # C. Compute the sum of the result;
+        self.execute_phase("reduce", self.reduce_kernel(self.num_blocks, self.block_size), self.x1, self.y1, self.res, self.size)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        result = self.res[0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        self.benchmark.add_to_benchmark("gpu_result", result)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {result:.4f}")
+
+        return result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            if self.benchmark.random_init:
+                x_g = np.zeros(self.size)
+                y_g = np.zeros(self.size)
+                for i in range(self.size):
+                    x_g[i] = random()
+                    y_g[i] = 2 * random()
+            else:
+                x_g = 1 / np.linspace(1, self.size, self.size)
+                y_g = 2 / np.linspace(1, self.size, self.size)
+
+            x_g = x_g ** 2
+            y_g = y_g ** 2
+            x_g -= y_g
+            self.cpu_result = np.sum(x_g)
+        cpu_time = System.nanoTime() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_10.py b/projects/resources/python/benchmark/bench/single_gpu/bench_10.py
new file mode 100644
index 00000000..6b62d670
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_10.py
@@ -0,0 +1,476 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+from java.lang import System 
+import numpy as np
+from random import random, randint, seed, sample, uniform
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_BLOCK_SIZE_2D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NUM_THREADS_PER_BLOCK_2D = 8
+NUM_THREADS_PER_BLOCK = 32
+WARP_SIZE = 32
+
+CONV2D = """
+extern "C" __global__ void conv2d(float *out, float *x, float *kernels, int N, int M, int L, int K, int k_out, int stride) {
+    extern __shared__ float kernel_local[];
+    int radius = K / 2;
+    
+    for (int m = 0; m < k_out; m++) {
+        for (int i = threadIdx.x; i < K; i += blockDim.x) {
+            for (int j = threadIdx.y; j < K; j += blockDim.y) {
+                for (int l = 0; l < L; l++) {
+                    kernel_local[l + L * (j + K * (i + K * m))] = kernels[l + L * (j  + K * (i + K * m))];
+                }
+            }
+        }
+    }
+    __syncthreads();
+    
+   
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (int) ceilf((float) N / stride) - radius; i += blockDim.x * gridDim.x) {
+        int out_index = M * i / stride;
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < (int) ceilf((float) M / stride) - radius; j += blockDim.y * gridDim.y) {
+            for (int m = 0; m < k_out; m++) {
+            // for (int m = blockIdx.z * blockDim.z + threadIdx.z; m < k_out; m += blockDim.z * gridDim.z) {
+                float res = 0;
+                int i_f = i * stride + radius;
+                int j_f = j * stride + radius;
+                for (int k_i = -radius; k_i <= radius; k_i++) {
+                    for (int k_j = -radius; k_j <= radius; k_j++) {
+                        int kernel_index = (k_j + radius + K * (k_i + radius + K * m));
+                        for (int l = 0; l < L; l++) {                
+                            int ni = i_f + k_i;
+                            int nj = j_f + k_j;
+                            res += kernel_local[l + L * kernel_index] * x[((ni * M) + nj) * L + l];
+                        }
+                    }
+                }
+                // Apply ReLU operator;
+                out[m + k_out * (j + out_index)] = max(res, 0.0);
+            }
+        }
+    }
+}
+"""
+
+POOLING = """
+extern "C" __global__ void mean_pooling(float *out, float *x, int N, int M, int L, int K, int stride) {
+    int radius = K / 2;   
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (int) ceilf((float) N / stride) - radius; i += blockDim.x * gridDim.x) {
+        int out_index = M * i / stride;
+        int i_f = i * stride + radius;
+        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < (int) ceilf((float) M / stride) - radius; j += blockDim.y * gridDim.y) {
+            int j_f = j * stride + radius;
+            for (int l = blockIdx.z * blockDim.z + threadIdx.z; l < L; l += blockDim.z * gridDim.z) {
+                float res = 0;
+                for (int k_i = -radius; k_i <= radius; k_i++) {
+                    int ni = i_f + k_i;
+                    for (int k_j = -radius; k_j <= radius; k_j++) {
+                        int nj = j_f + k_j;
+                        res += x[((ni * M) + nj) * L + l];
+                    }
+                }
+                // Apply mean operator;
+                out[l + L * (j + out_index)] = res / (K * K);
+            }
+        }
+    }
+}
+"""
+
+GAP = """
+extern "C" __global__ void gap(float *out, float *x, int N, int M, int L) {
+    extern __shared__ float out_local[];
+    for(int i = threadIdx.x; i < L; i += blockDim.x) {
+        out_local[i] = 0;
+    }
+    __syncthreads();
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < M; j += blockDim.y * gridDim.y) {
+            for (int l = 0; l < L; l++) {   
+                atomicAdd(out_local + l, x[l + L * (j + M * i)]);
+            }
+        }
+    }
+    __syncthreads();
+    for(int l = threadIdx.x; l < L; l += blockDim.x) {
+        atomicAdd(out + l, out_local[l] / (M * N));
+    }
+}
+"""
+
+DOT_PRODUCT = """
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void dot_product(const float *x, const float *y, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] * y[i];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+"""
+
+CONCAT = """
+extern "C" __global__ void concat(float *z, const float *x, const float *y, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        z[i] = x[i];
+        z[i + n] = y[i];
+    }
+}
+"""
+
+
+def sigmoid(x):
+    return 1 / (1 + np.exp(-x))
+
+##############################
+##############################
+
+
+class Benchmark10(Benchmark):
+    """
+    Compute a convolutional neural network that takes 2 images as inputs, computes their low-dimensional embeddings,
+    concatenate them and apply a dense classifier. It can represent, for example, a network that detects if 2 images contain the same object;
+
+    CONV(x) ─> CONV(x1) ─> GAP(x2) ──┬─> CONCAT(x2, y2) ─> DENSE(z)
+    CONV(y) ─> CONV(y1) ─> GAP(y2) ──┘
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b10", benchmark, nvprof_profile)
+        self.size = 0
+
+        self.x = None
+        self.y = None
+        self.x_cpu = None
+        self.y_cpu = None
+
+        self.kernel_1 = None
+        self.kernel_2 = None
+        self.kernel_3 = None
+        self.kernel_4 = None
+        self.channels = 1
+        self.K = 3
+        self.kn1 = 8
+        self.kn2 = 16
+        self.stride = 2
+        self.pooling = 5
+
+        self.x1 = None
+        self.x2 = None
+        self.x11 = None
+        self.y11 = None
+        self.x3 = None
+        self.y1 = None
+        self.y2 = None
+        self.y3 = None
+        self.z = None
+        self.res = None
+        self.dense_weights = None
+
+        self.cpu_result = None
+        self.gpu_result = None
+
+        self.num_blocks_per_processor = self.num_blocks
+
+        self.block_size_1d = DEFAULT_BLOCK_SIZE_1D
+        self.block_size_2d = DEFAULT_BLOCK_SIZE_2D
+
+        self.conv2d_kernel = None
+        self.gap_kernel = None
+        self.concat_kernel = None
+        self.dp_kernel = None
+        self.pooling_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size_1d = block_size["block_size_1d"]
+        self.block_size_2d = block_size["block_size_2d"]
+
+        self.gpu_result = 0.0
+
+        # Allocate vectors;
+        self.x = polyglot.eval(language="grcuda", string=f"float[{size * size * self.channels}]")
+        self.x1 = polyglot.eval(language="grcuda", string=f"float[{(size // self.stride) * (size // self.stride) * self.kn1}]")
+        self.x11 = polyglot.eval(language="grcuda", string=f"float[{(size // self.stride // self.pooling) * (size // self.stride // self.pooling) * self.kn1}]")
+        self.x2 = polyglot.eval(language="grcuda", string=f"float[{(size // self.stride // self.pooling // self.stride) * (size // self.stride // self.pooling // self.stride) * self.kn2}]")
+        self.x3 = polyglot.eval(language="grcuda", string=f"float[{self.kn2}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{size * size * self.channels}]")
+        self.y1 = polyglot.eval(language="grcuda", string=f"float[{(size // self.stride) * (size // self.stride) * self.kn1}]")
+        self.y11 = polyglot.eval(language="grcuda", string=f"float[{(size // self.stride // self.pooling) * (size // self.stride // self.pooling) * self.kn1}]")
+        self.y2 = polyglot.eval(language="grcuda", string=f"float[{(size // self.stride // self.pooling // self.stride) * (size // self.stride // self.pooling // self.stride) * self.kn2}]")
+        self.y3 = polyglot.eval(language="grcuda", string=f"float[{self.kn2}]")
+        self.kernel_1 = polyglot.eval(language="grcuda", string=f"float[{self.kn1 * self.K * self.K * self.channels}]")
+        self.kernel_2 = polyglot.eval(language="grcuda", string=f"float[{self.kn1 * self.K * self.K * self.kn2}]")
+        self.kernel_3 = polyglot.eval(language="grcuda", string=f"float[{self.kn1 * self.K * self.K * self.channels}]")
+        self.kernel_4 = polyglot.eval(language="grcuda", string=f"float[{self.kn1 * self.K * self.K * self.kn2}]")
+        self.z = polyglot.eval(language="grcuda", string=f"float[{len(self.y2) * 2}]")
+        self.dense_weights = polyglot.eval(language="grcuda", string=f"float[{len(self.z)}]")
+        self.res = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.conv2d_kernel = build_kernel(CONV2D, "conv2d", "pointer, pointer, const pointer, sint32, sint32, sint32, sint32, sint32, sint32")
+        self.pooling_kernel = build_kernel(POOLING, "mean_pooling", "pointer, const pointer, sint32, sint32, sint32, sint32, sint32")
+        self.gap_kernel = build_kernel(GAP, "gap", "pointer, pointer, sint32, sint32, sint32")
+        self.concat_kernel = build_kernel(CONCAT, "concat", "pointer, const pointer, const pointer, sint32")
+        self.dp_kernel = build_kernel(DOT_PRODUCT, "dot_product", "const pointer, const pointer, pointer, sint32")
+
+    @time_phase("initialization")
+    def init(self):
+
+        self.random_seed = 10 # randint(0, 10000000)
+        seed(self.random_seed)
+
+        # Random weights;
+        for i in range(len(self.kernel_1)):
+            self.kernel_1[i] = uniform(-1, 1)
+            self.kernel_3[i] = uniform(-1, 1)
+        for i in range(len(self.kernel_2)):
+            self.kernel_2[i] = uniform(-1, 1)
+            self.kernel_4[i] = uniform(-1, 1)
+
+        for i in range(len(self.dense_weights)):
+            self.dense_weights[i] = uniform(-1, 1) / len(self.dense_weights)
+
+        # Create random images. Leave it for last so that we can re-create identical random weights from the same seed;
+        self.x_cpu = [0] * len(self.x)
+        self.y_cpu = [0] * len(self.y)
+        for i in range(len(self.x_cpu)):
+            self.x_cpu[i] = random()
+            self.y_cpu[i] = random()
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        self.gpu_result = 0.0
+        self.res[0] = 0.0
+        for i in range(len(self.x_cpu)):
+            self.x[i] = self.x_cpu[i]
+            self.y[i] = self.y_cpu[i]
+
+    def execute(self) -> object:
+        self.num_blocks_per_processor = self.num_blocks
+        self.block_size_1d = self._block_size["block_size_1d"]
+        self.block_size_2d = self._block_size["block_size_2d"]
+        start_comp = System.nanoTime()
+        start = 0
+
+        a = self.num_blocks_per_processor / 2
+        # Convolutions;
+        self.execute_phase("conv_x1",
+                           self.conv2d_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * (self.K ** 2) * self.kn1 * self.channels),
+                           self.x1, self.x, self.kernel_1, self.size, self.size, self.channels, self.K, self.kn1, self.stride)
+        self.execute_phase("conv_y1",
+                           self.conv2d_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * (self.K ** 2) * self.kn1 * self.channels),
+                           self.y1, self.y, self.kernel_3, self.size, self.size, self.channels, self.K, self.kn1, self.stride)
+        # Pooling;
+        self.execute_phase("pool_x1",
+                           self.pooling_kernel((a / 2, a / 2, a / 2), (self.block_size_2d / 2, self.block_size_2d / 2, self.block_size_2d / 2)),
+                           self.x11, self.x1, self.size // self.stride, self.size // self.stride, self.kn1, self.pooling, self.pooling)
+        self.execute_phase("pool_y1",
+                           self.pooling_kernel((a / 2, a / 2, a / 2), (self.block_size_2d / 2, self.block_size_2d / 2, self.block_size_2d / 2)),
+                           self.y11, self.y1, self.size // self.stride, self.size // self.stride, self.kn1, self.pooling, self.pooling)
+        # Other convolutions;
+        self.execute_phase("conv_x2",
+                           self.conv2d_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * (self.K ** 2) * self.kn1 * self.kn2),
+                           self.x2, self.x11, self.kernel_2, self.size // self.stride // self.pooling, self.size // self.stride // self.pooling, self.kn1, self.K, self.kn2, self.stride)
+        self.execute_phase("conv_y2",
+                           self.conv2d_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * (self.K ** 2) * self.kn1 * self.kn2),
+                           self.y2, self.y11, self.kernel_4, self.size // self.stride // self.pooling, self.size // self.stride // self.pooling, self.kn1, self.K, self.kn2, self.stride)
+
+        # Global average pooling;
+        # self.execute_phase("gap_x",
+        #                    self.gap_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * self.kn2),
+        #                    self.x3, self.x2, self.size // self.stride**2, self.size // self.stride**2, self.kn2)
+        # self.execute_phase("gap_y",
+        #                    self.gap_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * self.kn2),
+        #                    self.y3, self.y2, self.size // self.stride ** 2, self.size // self.stride ** 2, self.kn2)
+
+        # Dense layer;
+        self.execute_phase("concat",
+                           self.concat_kernel(self.num_blocks_per_processor, self.block_size_1d),
+                           self.z, self.x2, self.y2, len(self.x2))
+        self.execute_phase("dot_product",
+                           self.dp_kernel(self.num_blocks_per_processor, self.block_size_1d),
+                           self.z, self.dense_weights, self.res, len(self.z))
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        # self.gpu_result = sigmoid(self.res[0])
+        self.gpu_result = self.res[0]
+        # self.gpu_result = [self.x1[i] for i in range(100)]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+
+        self.benchmark.add_to_benchmark("gpu_result", self.gpu_result)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {self.gpu_result:.4f}")
+            # BenchmarkResult.log_message(
+            #     f"\tgpu result: [" + ", ".join([f"{x:.2f}" for x in self.gpu_result[:100]]) + "...]")
+
+        return self.gpu_result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def relu(x):
+            return np.maximum(x, 0)
+
+        def conv3d2(x, kernels, shape, K, k_out, stride=1, operator=relu):
+            N, M, L = shape
+            out = np.zeros((N // stride) * (M // stride) * k_out)
+            radius = K // 2
+
+            for m in range(k_out):
+                for i in range(0, int(np.ceil(N / stride)) - radius):
+                    for j in range(0, int(np.ceil(M / stride)) - radius):
+                        res = 0
+                        i_f = i * stride + radius
+                        j_f = j * stride + radius
+                        for k_i in range(-radius, radius + 1):
+                            for k_j in range(-radius, radius + 1):
+                                for l in range(L):
+                                    ni = i_f + k_i
+                                    nj = j_f + k_j
+                                    res += kernels[l + L * (k_j + radius + K * (k_i + radius + K * m))] * x[((ni * M) + nj) * L + l]
+                        out[m + k_out * (j + M * i // stride)] = operator(res)
+            return out
+
+        def pooling(x, shape, K, stride):
+            N, M, L = shape
+            out = np.zeros((N // pooling, M // pooling, L))
+            radius = K // 2
+            for i in range(0, int(np.ceil(N / stride)) - radius):
+                for j in range(0, int(np.ceil(M / stride)) - radius):
+                    for l in range(L):
+                        res = 0
+                        i_f = i * stride + radius
+                        j_f = j * stride + radius
+                        for k_i in range(-radius, radius + 1):
+                            for k_j in range(-radius, radius + 1):
+                                    ni = i_f + k_i
+                                    nj = j_f + k_j
+                                    res += x[((ni * M) + nj) * L + l]
+                        out[l + L * (j + M * i // stride)] = res / K**2
+            return out
+
+        def gap2(x, shape):
+            N, M, L = shape
+            out = np.zeros(L)
+            for n in range(N):
+                for m in range(M):
+                    for i in range(L):
+                        out[i] += x[i + L * (m + M * n)] / (N * M)
+            return out
+
+        def concat(x, y):
+            # x and y have the same length;
+            out = np.zeros(2 * len(x))
+            for i in range(len(x)):
+                out[i] = x[i]
+                out[i + len(x)] = y[i]
+            return out
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+
+            # Initialize weights;
+            N = self.size
+            kernel_1 = np.zeros(len(self.kernel_1))
+            kernel_2 = np.zeros(len(self.kernel_2))
+            kernel_3 = np.zeros(len(self.kernel_3))
+            kernel_4 = np.zeros(len(self.kernel_4))
+            dense_weights = np.zeros(len(self.dense_weights))
+            # Random weights;
+            for i in range(len(self.kernel_1)):
+                kernel_1[i] = self.kernel_1[i]
+                kernel_3[i] = self.kernel_3[i]
+            for i in range(len(self.kernel_2)):
+                kernel_2[i] = self.kernel_2[i]
+                kernel_4[i] = self.kernel_4[i]
+
+            for i in range(len(self.dense_weights)):
+                dense_weights[i] = self.dense_weights[i]
+
+            # First convolution (N,N,1) -> (N/stride,N/stride,kn1)
+            x_1 = conv3d2(np.array(self.x_cpu), kernel_1, (N, N, self.channels), self.K, self.kn1, stride=self.stride)
+            x_11 = pooling(x_1, (N // self.stride, N // self.stride, self.kn1), self.pooling, self.pooling)
+            # Second convolution (N/stride,N/stride,kn1) -> (N/stride^2,N/stride^2,kn2)
+            x_2 = conv3d2(x_11, kernel_2, (N // self.stride // self.pooling, N // self.stride // self.pooling, self.kn1), self.K, self.kn2, stride=self.stride)
+
+            # First convolution (N,N,1) -> (N/stride,N/stride,kn1)
+            y_1 = conv3d2(np.array(self.y_cpu), kernel_3, (N, N, self.channels), self.K, self.kn1, stride=self.stride)
+            y_11 = pooling(y_1, (N // self.stride, N // self.stride, self.kn1), self.pooling, self.pooling)
+            # Second convolution (N/stride,N/stride,kn1) -> (N/stride^2,N/stride^2,kn2)
+            y_2 = conv3d2(y_11, kernel_4, (N // self.stride // self.pooling, N // self.stride // self.pooling, self.kn1), self.K, self.kn2, stride=self.stride)
+
+            # Global average pooling 2D;
+            # x_3 = gap2(x_2, (N // (self.stride * self.stride), N // (self.stride * self.stride), self.kn2))
+            # y_3 = gap2(y_2, (N // (self.stride * self.stride), N // (self.stride * self.stride), self.kn2))
+
+            # Concatenate;
+            out = concat(x_2, y_2)
+
+            # Final dense layer;
+            self.cpu_result = out.dot(dense_weights[:len(out)])
+            # self.cpu_result = x_1[:100]
+
+        cpu_time = (System.nanoTime() - start) / 1_000_000_000
+
+        # Compare GPU and CPU results;
+        difference = np.abs(self.cpu_result - gpu_result)
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            # BenchmarkResult.log_message(
+            #     f"\tcpu result: [" + ", ".join([f"{x:.2f}" for x in self.cpu_result[:100]]) + "...]"+
+            #                             f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_2.py b/projects/resources/python/benchmark/bench/single_gpu/bench_2.py
new file mode 100644
index 00000000..3892849b
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_2.py
@@ -0,0 +1,243 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import time
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NUM_THREADS_PER_BLOCK = 128
+
+SQUARE_KERNEL = """
+    extern "C" __global__ void square(float* x, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            x[idx] = x[idx] * x[idx];
+        }
+    }
+    """
+
+DIFF_KERNEL = """
+    extern "C" __global__ void diff(float* x, float* y, float* z, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            z[idx] = x[idx] - y[idx];
+        }
+    }
+    """
+
+ADDTWO_KERNEL = """
+    extern "C" __global__ void addtwo(float* a, float* b, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            b[idx] = a[idx] + 2.0;
+        }
+    }
+    """
+
+REDUCE_KERNEL = """
+    extern "C" __global__ void reduce(float *x, float *y, float *res, int n) {
+        __shared__ float cache[%d];
+        int i = blockIdx.x * blockDim.x + threadIdx.x;
+        if (i < n) {
+            cache[threadIdx.x] = x[i] + y[i];
+        }
+        __syncthreads();
+
+        // Perform tree reduction;
+        i = %d / 2;
+        while (i > 0) {
+            if (threadIdx.x < i) {
+                cache[threadIdx.x] += cache[threadIdx.x + i];
+            }
+            __syncthreads();
+            i /= 2;
+        }
+        if (threadIdx.x == 0) {
+            atomicAdd(res, cache[0]);
+        }
+    }
+    """ % (NUM_THREADS_PER_BLOCK, NUM_THREADS_PER_BLOCK)
+
+##############################
+##############################
+
+
+class Benchmark2(Benchmark):
+    """
+    Compute a complex graph of interconnected computations using GrCUDA.
+    Structure of the computation:
+       A: x^2 ──┐
+                ├─> C: z=x-y ───┐
+       B: x^2 ──┘               │
+                                ├-> F: sum(z+b)
+                                │
+       D: a^2 ────> E: b=a+2  ──┘
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b2", benchmark, nvprof_profile)
+        self.size = 0
+        self.x = None
+        self.y = None
+        self.z = None
+        self.res = None
+        self.a = None
+        self.b = None
+        self.num_blocks = 0
+        self.square_kernel = None
+        self.diff_kernel = None
+        self.addtwo_kernel = None
+        self.reduce_kernel = None
+        self.cpu_result = 0
+
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.num_blocks = (size + NUM_THREADS_PER_BLOCK - 1) // NUM_THREADS_PER_BLOCK
+
+        # Allocate 2 vectors;
+        start = time.time()
+        self.x = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.a = polyglot.eval(language="grcuda", string=f"float[{size}]")
+
+        # Allocate support vectors;
+        self.z = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.b = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.res = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.square_kernel = build_kernel(SQUARE_KERNEL, "square", "pointer, sint32")
+        self.diff_kernel = build_kernel(DIFF_KERNEL, "diff", "pointer, pointer, pointer, sint32")
+        self.addtwo_kernel = build_kernel(ADDTWO_KERNEL, "addtwo", "pointer, pointer, sint32")
+        self.reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "pointer, pointer, pointer, sint32")
+
+        end = time.time()
+        self.benchmark.add_phase({"name": "allocation", "time_sec": end - start})
+
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+        start = time.time()
+        for i in range(self.size):
+            if self.benchmark.random_init:
+                self.x[i] = random()
+                self.y[i] = 2 * random()
+                self.a[i] = 4 * random()
+            else:
+                self.x[i] = 1 / (i + 1)
+                self.y[i] = 2 / (i + 1)
+                self.a[i] = 4 / (i + 1)
+        end = time.time()
+        self.benchmark.add_phase({"name": "initialization", "time_sec": end - start})
+
+    def execute(self) -> object:
+        # This must be reset at every execution;
+        self.res[0] = 0
+
+        # Call the kernel. The 2 computations are independent, and can be done in parallel;
+        start = time.time()
+        self.square_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.x, self.size)
+        self.square_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.y, self.size)
+        end = time.time()
+        self.benchmark.add_phase({"name": "square", "time_sec": end - start})
+
+        # C. Compute the difference of the 2 vectors. This must be done after the 2 previous computations;
+        start = time.time()
+        self.diff_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.x, self.y, self.z, self.size)
+        end = time.time()
+        self.benchmark.add_phase({"name": "diff", "time_sec": end - start})
+
+        # D. Compute the other branch of the computation;
+        start = time.time()
+        self.square_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.a, self.size)
+        end = time.time()
+        self.benchmark.add_phase({"name": "square_other_branch", "time_sec": end - start})
+
+        # E. Continue computing the other branch;
+        start = time.time()
+        self.addtwo_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.a, self.b, self.size)
+        end = time.time()
+        self.benchmark.add_phase({"name": "add_two_other_branch", "time_sec": end - start})
+
+        # F. Compute the sum of the result;
+        start = time.time()
+        self.reduce_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.z, self.b, self.res, self.size)
+        end = time.time()
+        self.benchmark.add_phase({"name": "reduce", "time_sec": end - start})
+
+        result = self.res[0]
+        self.benchmark.add_to_benchmark("gpu_result", result)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {result:.4f}")
+
+        return result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        # Recompute the CPU result only if necessary;
+        start = time.time()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            if self.benchmark.random_init:
+                x_g = np.zeros(self.size)
+                y_g = np.zeros(self.size)
+                a_g = np.zeros(self.size)
+                for i in range(self.size):
+                    x_g[i] = random()
+                    y_g[i] = 2 * random()
+                    a_g[i] = 4 * random()
+            else:
+                x_g = 1 / np.linspace(1, self.size, self.size)
+                y_g = 2 / np.linspace(1, self.size, self.size)
+                a_g = 4 / np.linspace(1, self.size, self.size)
+
+            x_g = x_g ** 2
+            y_g = y_g ** 2
+            a_g = a_g ** 2
+            x_g -= y_g
+            a_g += 2
+            self.cpu_result = np.sum(x_g + a_g)
+        cpu_time = time.time() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_3.py b/projects/resources/python/benchmark/bench/single_gpu/bench_3.py
new file mode 100644
index 00000000..ea9d628b
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_3.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import time
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NUM_ITER = 5
+
+NUM_THREADS_PER_BLOCK = 128
+
+SQUARE_KERNEL = """
+    extern "C" __global__ void square(float* x, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            x[idx] = x[idx] * x[idx];
+        }
+    }
+    """
+
+REDUCE_KERNEL = """
+    extern "C" __global__ void reduce(float *x, float *y, float *res, int n) {
+        __shared__ float cache[%d];
+        int i = blockIdx.x * blockDim.x + threadIdx.x;
+        if (i < n) {
+            cache[threadIdx.x] = x[i] + y[i];
+        }
+        __syncthreads();
+
+        // Perform tree reduction;
+        i = %d / 2;
+        while (i > 0) {
+            if (threadIdx.x < i) {
+                cache[threadIdx.x] += cache[threadIdx.x + i];
+            }
+            __syncthreads();
+            i /= 2;
+        }
+        if (threadIdx.x == 0) {
+            atomicAdd(res, cache[0]);
+        }
+    }
+    """ % (NUM_THREADS_PER_BLOCK, NUM_THREADS_PER_BLOCK)
+
+##############################
+##############################
+
+
+class Benchmark3(Benchmark):
+    """
+    Compute a pipeline of GrCUDA kernels using loops to build a dynamic graph.
+    Structure of the computation:
+       A: x^2 ─ [5 times] ─┐
+                           ├─> C: res=sum(x+y)
+       B: x^2 ─ [5 times] ─┘
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b3", benchmark, nvprof_profile)
+        self.size = 0
+        self.x = None
+        self.y = None
+        self.res = None
+        self.num_blocks = 0
+        self.square_kernel = None
+        self.reduce_kernel = None
+        self.cpu_result = 0
+        self.num_iter = NUM_ITER
+
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.num_blocks = (size + NUM_THREADS_PER_BLOCK - 1) // NUM_THREADS_PER_BLOCK
+
+        # Allocate 2 vectors;
+        start = time.time()
+        self.x = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{size}]")
+
+        # Allocate a support vector;
+        self.res = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.square_kernel = build_kernel(SQUARE_KERNEL, "square", "pointer, sint32")
+        self.reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "pointer, pointer, pointer, sint32")
+
+        end = time.time()
+        self.benchmark.add_phase({"name": "allocation", "time_sec": end - start})
+
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+        start = time.time()
+        for i in range(self.size):
+            if self.benchmark.random_init:
+                self.x[i] = random()
+                self.y[i] = random()
+            else:
+                self.x[i] = 1 / (i + 1)
+                self.y[i] = 1 / (i + 1)
+        end = time.time()
+        self.benchmark.add_phase({"name": "initialization", "time_sec": end - start})
+
+    def execute(self) -> object:
+        # This must be reset at every execution;
+        self.res[0] = 0
+
+        # A. B. Call the kernels. The 2 computations are independent, and can be done in parallel;
+        for i in range(self.num_iter):
+            start = time.time()
+            self.square_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.x, self.size)
+            self.square_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.y, self.size)
+            end = time.time()
+            self.benchmark.add_phase({"name": f"square_{i}", "time_sec": end - start})
+
+        # C. Compute the sum of the result;
+        start = time.time()
+        self.reduce_kernel(self.num_blocks, NUM_THREADS_PER_BLOCK)(self.x, self.y, self.res, self.size)
+        end = time.time()
+        self.benchmark.add_phase({"name": "reduce", "time_sec": end - start})
+
+        result = self.res[0]
+        self.benchmark.add_to_benchmark("gpu_result", result)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {result:.4f}")
+
+        return result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        # Recompute the CPU result only if necessary;
+        start = time.time()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            if self.benchmark.random_init:
+                x_g = np.zeros(self.size)
+                y_g = np.zeros(self.size)
+                for i in range(self.size):
+                    x_g[i] = random()
+                    y_g[i] = random()
+            else:
+                x_g = 1 / np.linspace(1, self.size, self.size)
+                y_g = 1 / np.linspace(1, self.size, self.size)
+
+            for i in range(NUM_ITER):
+                x_g = x_g ** 2
+                y_g = y_g ** 2
+            self.cpu_result = np.sum(x_g + y_g)
+        cpu_time = time.time() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_4.py b/projects/resources/python/benchmark/bench/single_gpu/bench_4.py
new file mode 100644
index 00000000..77973667
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_4.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import time
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D
+from benchmark_result import BenchmarkResult
+from java.lang import System
+
+##############################
+##############################
+
+SUM_KERNEL = """
+extern "C" __global__ void sum(int* x, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        x[i] += 1;
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark4(Benchmark):
+    """
+    A benchmark with 2 very simple independent computations, used to measure overheads and the impact of data transfer;
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b4", benchmark, nvprof_profile)
+        self.size = 0
+        self.x = None
+        self.y = None
+        self.num_blocks = 64
+        self.sum_kernel = None
+        self.cpu_result = 0
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+
+        # Allocate 4 vectors;
+        self.x = polyglot.eval(language="grcuda", string=f"int[{size}]")
+        self.y = polyglot.eval(language="grcuda", string=f"int[{size}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.sum_kernel = build_kernel(SUM_KERNEL, "sum", "pointer, sint32")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+        for i in range(self.size):
+            if self.benchmark.random_init:
+                self.x[i] = randint(0, 10)
+                self.y[i] = randint(0, 10)
+            else:
+                self.x[i] = 1 / (i + 1)
+                self.y[i] = 1 / (i + 1)
+
+    def execute(self) -> object:
+
+        # A. B. Call the kernels. The 2 computations are independent, and can be done in parallel;
+        start = System.nanoTime()
+        self.sum_kernel(self.num_blocks, self.block_size)(self.x, self.size)
+        end = System.nanoTime()
+        self.benchmark.add_phase({"name": "sum_1", "time_sec": (end - start) / 1_000_000_000})
+
+        start = System.nanoTime()
+        self.sum_kernel(self.num_blocks, self.block_size)(self.y, self.size)
+        end = System.nanoTime()
+        self.benchmark.add_phase({"name": "sum_2", "time_sec": (end - start) / 1_000_000_000})
+
+        start = System.nanoTime()
+        result_1 = self.x[0]
+        result_2 = self.y[0]
+        end = System.nanoTime()
+        self.benchmark.add_phase({"name": "read_result", "time_sec": (end - start) / 1_000_000_000})
+
+        self.benchmark.add_to_benchmark("gpu_result", result_1 + result_2)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {result_1} {result_2}")
+
+        return result_1 + result_2
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            if self.benchmark.random_init:
+                x_g = np.zeros(self.size)
+                y_g = np.zeros(self.size)
+                for i in range(self.size):
+                    x_g[i] = randint(0, 10)
+                    y_g[i] = randint(0, 10)
+            else:
+                x_g = 1 / np.linspace(1, self.size, self.size)
+                y_g = 1 / np.linspace(1, self.size, self.size)
+
+            x_g += 1
+            y_g += 1
+            self.cpu_result = x_g[0] + y_g[0]
+        cpu_time = System.nanoTime() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_5.py b/projects/resources/python/benchmark/bench/single_gpu/bench_5.py
new file mode 100644
index 00000000..3e5d6776
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_5.py
@@ -0,0 +1,231 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import numpy as np
+from random import random, randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D
+from benchmark_result import BenchmarkResult
+from java.lang import System
+import math
+
+##############################
+##############################
+
+R = 0.08
+V = 0.3
+T = 1.0
+K = 60.0
+
+BS_KERNEL = """
+__device__ inline double cndGPU(double d) {
+    const double       A1 = 0.31938153f;
+    const double       A2 = -0.356563782f;
+    const double       A3 = 1.781477937f;
+    const double       A4 = -1.821255978f;
+    const double       A5 = 1.330274429f;
+    const double RSQRT2PI = 0.39894228040143267793994605993438f;
+
+    double
+    K = 1.0 / (1.0 + 0.2316419 * fabs(d));
+
+    double
+    cnd = RSQRT2PI * exp(- 0.5f * d * d) *
+          (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
+
+    if (d > 0)
+        cnd = 1.0 - cnd;
+
+    return cnd;
+}
+
+extern "C" __global__ void bs(const double *x, double *y, int N, double R, double V, double T, double K) {
+
+    double sqrtT = 1.0 / rsqrt(T);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        double expRT;
+        double d1, d2, CNDD1, CNDD2;
+        d1 = (log(x[i] / K) + (R + 0.5 * V * V) * T) / (V * sqrtT);
+        d2 = d1 - V * sqrtT;
+
+        CNDD1 = cndGPU(d1);
+        CNDD2 = cndGPU(d2);
+
+        //Calculate Call and Put simultaneously
+        expRT = exp(-R * T);
+        y[i] = x[i] * CNDD1 - K * expRT * CNDD2;
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark5(Benchmark):
+    """
+    Compute the Black & Scholes equation for European call options, for 10 different underlying types of stocks,
+    and for each stock a vector of prices at time 0.
+    The main computation is taken from Nvidia's CUDA code samples (link),
+    and adapted to use double precision arithmetic to create a more computationally intensive kernel. 
+    The idea of this benchmark is to simulate a streaming computation in which data-transfer 
+    and computation of multiple kernels can be overlapped efficiently, without data-dependencies between kernels. 
+    To the contrary of bench_1, computation, and not data transfer, is the main limiting factor for parallel execution.
+
+    Structure of the computation:
+
+    BS(x[1]) -> ... -> BS(x[10])
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b5", benchmark, nvprof_profile)
+        self.size = 0
+
+        # self.num_blocks = DEFAULT_NUM_BLOCKS
+        self.sum_kernel = None
+        self.cpu_result = 0
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+        self.K = 24
+        self.x = [[]] * self.K
+        self.x_tmp = None
+        self.y = [[]] * self.K
+
+        self.bs_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+        self.x_tmp = None
+        # self.x_tmp = [0] * self.size
+
+        # Allocate vectors;
+        for i in range(self.K):
+            self.x[i] = polyglot.eval(language="grcuda", string=f"double[{size}]")
+            self.y[i] = polyglot.eval(language="grcuda", string=f"double[{size}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.bs_kernel = build_kernel(BS_KERNEL, "bs", "const pointer, pointer, sint32, double, double, double, double")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        # seed(self.random_seed)
+        # if self.benchmark.random_init:
+        #     self.x_tmp = np.random.uniform(-0.5, 0.5, self.size).astype(np.float64) + K
+        # else:
+        #     self.x_tmp = np.zeros(self.size, dtype=np.float64) + K
+        seed(self.random_seed)
+        self.x_tmp = [K] * self.size
+        if self.benchmark.random_init:
+            # self.x_tmp = np.random.uniform(-0.5, 0.5, self.size).astype(float) + K
+            for i in range(len(self.x_tmp)):
+                self.x_tmp[i] = random() - 0.5 + K
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        for i in range(self.K):
+            # self.x[i].copyFrom(int(np.int64(self.x_tmp.ctypes.data)), self.size)
+            for j in range(self.size):
+                self.x[i][j] = self.x_tmp[j]
+
+    def execute(self) -> object:
+        self.block_size = self._block_size["block_size_1d"]
+        result = [0] * self.K
+
+        # Call the kernels;
+        start_comp = System.nanoTime()
+        start = System.nanoTime()
+        for i in range(self.K):
+            self.execute_phase(f"bs_{i}", self.bs_kernel(self.num_blocks, self.block_size), self.x[i], self.y[i], self.size, R, V, T, K)
+
+        if self.time_phases:
+            start = System.nanoTime()
+        for i in range(self.K):
+            result[i] = self.y[i][0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+
+        self.benchmark.add_to_benchmark("gpu_result", result[0])
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: {result[0]}")
+
+        return result[0]
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def CND(X):
+            """
+            Cumulative normal distribution.
+            Helper function used by BS(...).
+            """
+
+            (a1, a2, a3, a4, a5) = (0.31938153, -0.356563782, 1.781477937, -1.821255978, 1.330274429)
+            L = np.absolute(X)
+            K = np.float64(1.0) / (1.0 + 0.2316419 * L)
+            w = 1.0 - 1.0 / math.sqrt(2 * np.pi) * np.exp(-L * L / 2.) * \
+                (a1 * K +
+                 a2 * (K ** 2) +
+                 a3 * (K ** 3) +
+                 a4 * (K ** 4) +
+                 a5 * (K ** 5))
+
+            mask = X < 0
+            w = w * ~mask + (1.0 - w) * mask
+
+            return w
+
+        def BS(X, R, V, T, K):
+            """Black Scholes Function."""
+            d1_arr = (np.log(X / K) + (R + V * V / 2.) * T) / (V * math.sqrt(T))
+            d2_arr = d1_arr - V * math.sqrt(T)
+            w_arr = CND(d1_arr)
+            w2_arr = CND(d2_arr)
+            return X * w_arr - X * math.exp(-R * T) * w2_arr
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            res = BS(np.array(self.x_tmp), R, V, T, K)
+            self.cpu_result = res[0]
+        cpu_time = System.nanoTime() - start
+        difference = np.abs(self.cpu_result - gpu_result)
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", difference)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: {self.cpu_result:.4f}, " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_6.py b/projects/resources/python/benchmark/bench/single_gpu/bench_6.py
new file mode 100644
index 00000000..7a2f33f1
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_6.py
@@ -0,0 +1,412 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+from java.lang import System
+import numpy as np
+from random import randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NB_KERNEL = """   
+    extern "C" __global__ void nb_1(const int* x, float* y, float* z, int size, int n_feat, int n_classes) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+            for (int j = 0; j < n_classes; j++) {
+                for (int q = 0; q < n_feat; q++) {
+                    z[i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+                } 
+            }
+        }
+    }
+    
+    extern "C" __global__ void nb_2(float* x, float* y, int n_row_x, int n_col_x) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+            float curr_max = x[i * n_col_x];
+            for (int j = 0; j < n_col_x; j++) {
+                curr_max = fmaxf(curr_max, x[i * n_col_x + j]); 
+            }
+            y[i] = curr_max;
+        }
+    }
+    
+    extern "C" __global__ void nb_3(float* x, float* y, float* z, int n_row_x, int n_col_x) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+            float sum = 0;
+            for (int j = 0; j < n_col_x; j++) {
+                sum += expf(x[i * n_col_x + j] - y[i]);
+            }
+            z[i] = logf(sum) + y[i];
+        }
+    }
+    
+    extern "C" __global__ void nb_4(float* x, float* y, int n_row_x, int n_col_x) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+            for (int j = 0; j < n_col_x; j++) {
+                x[i * n_col_x + j] = expf(x[i * n_col_x + j] - y[i]);
+            }
+        }
+    }
+    """
+
+RR_KERNEL = """
+    extern "C" __global__ void rr_1(const int* x, float *y, int n_row_x, int n_col_x) {
+        for(int j = blockIdx.x * blockDim.x + threadIdx.x; j < n_col_x; j += blockDim.x * gridDim.x) {
+            float feature_mean = 0;
+            float sum_sq = 0;
+            // Compute mean and variance;
+            for (int i = 0; i < n_row_x; i++) {
+                feature_mean += x[j * n_row_x + i];
+                sum_sq += x[j * n_row_x + i] * x[j * n_row_x + i];
+            }
+            feature_mean /= n_row_x;
+            float std = sqrtf(sum_sq / n_row_x - feature_mean * feature_mean);
+            
+            // Update values;
+            for (int i = 0; i < n_row_x; i++) {
+                y[j * n_row_x + i] = ((float) x[j * n_row_x + i] - feature_mean) / std;
+            }
+        }
+    }
+    
+    extern "C" __global__ void rr_2(float* x, float* y, float* z, int size, int n_feat, int n_classes) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+            for (int j = 0; j < n_classes; j++) {
+                for (int q = 0; q < n_feat; q++) {
+                    z[i * n_classes + j] += x[i * n_feat + q] * y[j * n_feat + q];
+                }
+            }
+        }
+    }
+
+    extern "C" __global__ void rr_3(float* x, float *y, int n_row_x, int n_col_x) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+            for (int j = 0; j < n_col_x; j++) {
+                x[i * n_col_x + j] += y[j];
+            }
+        }
+    }
+    """
+
+ENSEMBLE_KERNEL = """
+    extern "C" __global__ void softmax(float *x, int n_row_x, int n_col_x) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+            float row_exp_sum = 0;
+            for (int j = 0; j < n_col_x; j++) {
+                row_exp_sum += expf( x[i * n_col_x + j]);
+            }
+            for (int j = 0; j < n_col_x; j++) {
+                 x[i * n_col_x + j] = expf(x[i * n_col_x + j]) / row_exp_sum;
+            }
+        }
+    }
+    
+    extern "C" __global__ void argmax(float *x, float *y, int *z, int n_row_x, int n_col_x) {
+        for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n_row_x; i += blockDim.x * gridDim.x) {
+            int curr_best_index = 0;
+            float curr_best = x[i * n_col_x] + y[i * n_col_x];
+            for (int j = 0; j < n_col_x; j++) {
+                float curr = x[i * n_col_x + j] + y[i * n_col_x + j];
+                if (curr > curr_best) {
+                    curr_best = curr;
+                    curr_best_index = j;
+                }
+            }
+            z[i] = curr_best_index;
+        }
+    }
+    """
+
+##############################
+##############################
+
+
+class Benchmark6(Benchmark):
+    """
+    Compute an ensemble of Categorical Naive Bayes and Ridge Regression classifiers. 
+    Predictions are aggregated averaging the class scores after softmax normalization.
+    The computation is done on mock data and parameters, but is conceptually identical to a real ML pipeline.
+    In the DAG below, input arguments that are not involved in the computation of dependencies are omitted.
+
+    The size of the benchmark is the number of rows in the matrix (each representing a document with 200 features).
+    Predictions are done by choosing among 10 classes.
+    The Ridge Regression classifier takes about 2x the time of the Categorical Naive Bayes classifier.
+
+    Structure of the computation:
+
+    RR-1: standard normalization
+    RR-2: matrix multiplication
+    RR-3: add vector to matrix, row-wise
+    NB-1: matrix multiplication
+    NB-2: row-wise maximum
+    NB-3: log of sum of exponential, row-wise
+    NB-4: exponential, element-wise
+
+     ┌─> RR-1(const X,Z) ─> RR-2(const Z,R2) ─> RR-3(R2) ─> SOFTMAX(R1) ─────────────┐
+    ─┤                                                                               ├─> ARGMAX(const R1,const R2,R)
+     └─> NB-1(const X,R1) ─> NB-2(const R1,AMAX) ─> (...)                            │
+           (...) -> NB-3(const R1,const AMAX,L) ─> NB-4(R1,const L) ─> SOFTMAX(R2) ──┘
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b6", benchmark, nvprof_profile)
+        self.size = 0
+        self.x = None
+        self.z = None
+        self.r1 = None
+        self.r2 = None
+        self.r = None
+
+        self.nb_1 = None
+        self.nb_2 = None
+        self.nb_3 = None
+        self.nb_4 = None
+        self.rr_1 = None
+        self.rr_2 = None
+        self.rr_3 = None
+        self.softmax = None
+        self.argmax = None
+
+        self.cpu_result = None
+
+        # Load matrices from files;
+        # self.nb_feat_log_prob_np = np.loadtxt("../other/data/nb_feat_log_prob.csv", delimiter=",")
+        # self.nb_class_log_prior_np = np.loadtxt("../other/data/nb_class_log_prior.csv", delimiter=",")
+        # self.ridge_coeff_np = np.loadtxt("../other/data/ridge_coeff.csv", delimiter=",")
+        # self.ridge_intercept_np = np.loadtxt("../other/data/ridge_intercept.csv", delimiter=",")
+
+        # Internal arrays used by the algorithms, they do not affect the DAG structure;
+        self.nb_feat_log_prob = None
+        self.nb_class_log_prior = None
+        self.ridge_coeff = None
+        self.ridge_intercept = None
+        self.nb_amax = None
+        self.nb_l = None
+
+        self.num_features = 200  # self.nb_feat_log_prob_np.shape[1]
+        self.num_classes = 10  # self.nb_feat_log_prob_np.shape[0]
+        self.max_occurrence_of_ngram = 10
+
+        self.num_blocks_size = self.num_blocks # 64  # DEFAULT_NUM_BLOCKS
+        self.num_blocks_feat = self.num_blocks # 64  # DEFAULT_NUM_BLOCKS
+        self.block_size = DEFAULT_BLOCK_SIZE_1D
+
+        self.x_cpu = None
+        self.nb_feat_log_prob_cpu = None
+        self.ridge_coeff_cpu = None
+        self.nb_class_log_prior_cpu = None
+        self.ridge_intercept_cpu = None
+        self.r1_cpu = None
+        self.r2_cpu = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+
+        # Allocate vectors;
+        self.x = polyglot.eval(language="grcuda", string=f"int[{size * self.num_features}]")
+        self.z = polyglot.eval(language="grcuda", string=f"float[{size * self.num_features}]")
+
+        self.nb_feat_log_prob = polyglot.eval(language="grcuda", string=f"float[{self.num_classes * self.num_features}]")
+        self.nb_class_log_prior = polyglot.eval(language="grcuda", string=f"float[{self.num_classes}]")
+        self.ridge_coeff = polyglot.eval(language="grcuda", string=f"float[{self.num_classes * self.num_features}]")
+        self.ridge_intercept = polyglot.eval(language="grcuda", string=f"float[{self.num_classes}]")
+
+        self.nb_amax = polyglot.eval(language="grcuda", string=f"float[{self.size}]")
+        self.nb_l = polyglot.eval(language="grcuda", string=f"float[{self.size}]")
+
+        self.r1 = polyglot.eval(language="grcuda", string=f"float[{self.size * self.num_classes}]")
+        self.r2 = polyglot.eval(language="grcuda", string=f"float[{self.size * self.num_classes}]")
+        self.r = polyglot.eval(language="grcuda", string=f"int[{self.size}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.nb_1 = build_kernel(NB_KERNEL, "nb_1", "const pointer, pointer, pointer, sint32, sint32, sint32")
+        self.nb_2 = build_kernel(NB_KERNEL, "nb_2", "pointer, pointer, sint32, sint32")
+        self.nb_3 = build_kernel(NB_KERNEL, "nb_3", "pointer, pointer, pointer, sint32, sint32")
+        self.nb_4 = build_kernel(NB_KERNEL, "nb_4", "pointer, pointer, sint32, sint32")
+
+        self.rr_1 = build_kernel(RR_KERNEL, "rr_1", "const pointer, pointer, sint32, sint32")
+        self.rr_2 = build_kernel(RR_KERNEL, "rr_2", "pointer, pointer, pointer, sint32, sint32, sint32")
+        self.rr_3 = build_kernel(RR_KERNEL, "rr_3", "pointer, pointer, sint32, sint32")
+
+        self.softmax = build_kernel(ENSEMBLE_KERNEL, "softmax", "pointer, sint32, sint32")
+        self.argmax = build_kernel(ENSEMBLE_KERNEL, "argmax", "pointer, pointer, pointer, sint32, sint32")
+        self.initialize_rand = polyglot.eval(language="js", string="(x, m) => { for (let i = 0; i < x.length; i++) { x[i] = Math.floor(Math.random() * m) }}")
+
+    @time_phase("initialization")
+    def init(self):
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+
+        # Create a random input;
+        max_occurrence_of_ngram = 10
+        # self.x_cpu = np.random.randint(0, max_occurrence_of_ngram, (self.size, self.num_features), dtype=np.int32)
+
+        self.nb_feat_log_prob_cpu = np.random.random_sample((self.num_classes, self.num_features)).astype(dtype=np.float32)
+        self.ridge_coeff_cpu = np.random.random_sample((self.num_classes, self.num_features)).astype(dtype=np.float32)
+        self.nb_class_log_prior_cpu = np.random.random_sample(self.num_classes).astype(dtype=np.float32)
+        self.ridge_intercept_cpu = np.random.random_sample(self.num_classes).astype(dtype=np.float32)
+
+        # self.r1_cpu = np.zeros((self.size, self.num_classes))
+        # for j in range(self.num_classes):
+            # self.r1_cpu[:, j] = self.nb_class_log_prior_cpu[j]
+        # self.r2_cpu = np.zeros((self.size, self.num_classes))
+
+        self.initialize_rand(self.x, self.max_occurrence_of_ngram)
+        # self.x.copyFrom(int(np.int64(self.x_cpu.ctypes.data)), len(self.x))
+        self.nb_feat_log_prob.copyFrom(int(np.int64(self.nb_feat_log_prob_cpu.ctypes.data)), len(self.nb_feat_log_prob))
+        self.ridge_coeff.copyFrom(int(np.int64(self.ridge_coeff_cpu.ctypes.data)), len(self.ridge_coeff))
+        self.nb_class_log_prior.copyFrom(int(np.int64(self.nb_class_log_prior_cpu.ctypes.data)), len(self.nb_class_log_prior))
+        self.ridge_intercept.copyFrom(int(np.int64(self.ridge_intercept_cpu.ctypes.data)), len(self.ridge_intercept))
+        # self.r1.copyFrom(int(np.int64(self.r1_cpu.ctypes.data)), len(self.r1))
+        # self.r2.copyFrom(int(np.int64(self.r2_cpu.ctypes.data)), len(self.r2))
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        for i in range(self.size):
+            for j in range(self.num_classes):
+                self.r1[i * self.num_classes + j] = self.nb_class_log_prior[j]
+                self.r2[i * self.num_classes + j] = 0
+
+    def execute(self) -> object:
+        self.num_blocks_size = self.num_blocks  # 64  # DEFAULT_NUM_BLOCKS
+        self.num_blocks_feat = self.num_blocks  # 64  # DEFAULT_NUM_BLOCKS
+        self.block_size = self._block_size["block_size_1d"]
+        # Schedule the categorical Naive Bayes and Ridge Regression kernels
+        start_comp = System.nanoTime()
+        start = 0
+
+        # RR - 1.
+        self.execute_phase("rr_1", self.rr_1(self.num_blocks_feat, self.block_size),
+                           self.x, self.z, self.size, self.num_features)
+
+        # NB - 1.
+        self.execute_phase("nb_1", self.nb_1(self.num_blocks_size, self.block_size),
+                           self.x, self.nb_feat_log_prob, self.r1, self.size, self.num_features, self.num_classes)
+
+        # RR - 2.
+        self.execute_phase("rr_2", self.rr_2(self.num_blocks_size, self.block_size),
+                           self.z, self.ridge_coeff, self.r2, self.size, self.num_features, self.num_classes)
+
+        # NB - 2.
+        self.execute_phase("nb_2", self.nb_2(self.num_blocks_size, self.block_size),
+                           self.r1, self.nb_amax, self.size, self.num_classes)
+
+        # NB - 3.
+        self.execute_phase("nb_3", self.nb_3(self.num_blocks_size, self.block_size),
+                           self.r1, self.nb_amax, self.nb_l, self.size, self.num_classes)
+
+        # RR - 3.
+        self.execute_phase("rr_3", self.rr_3(self.num_blocks_size, self.block_size),
+                           self.r2, self.ridge_intercept, self.size, self.num_classes)
+
+        # NB - 4.
+        self.execute_phase("nb_4", self.nb_4(self.num_blocks_size, self.block_size),
+                           self.r1, self.nb_l, self.size, self.num_classes)
+
+        # Ensemble results;
+
+        # Softmax normalization;
+        self.execute_phase("softmax_1", self.softmax(self.num_blocks_size, self.block_size), self.r1, self.size, self.num_classes)
+        self.execute_phase("softmax_2", self.softmax(self.num_blocks_size, self.block_size), self.r2, self.size, self.num_classes)
+
+        # Prediction;
+        self.execute_phase("argmax", self.argmax(self.num_blocks_size, self.block_size), self.r1, self.r2, self.r, self.size, self.num_classes)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp = self.r[0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        self.benchmark.add_to_benchmark("gpu_result", 0)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.r[:10]]) + "...]")
+
+        return self.r
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def softmax(X):
+            return np.exp(X) / np.sum(np.exp(X), axis=1).reshape(X.shape[0], 1)
+
+        def logsumexp(X):
+            return np.log(np.sum(np.exp(X)))
+
+        def naive_bayes_predict(X, feature_log_prob, log_class_prior):
+            jll = X.dot(feature_log_prob.T) + log_class_prior
+            amax = np.amax(jll, axis=1)
+            l = logsumexp(jll - np.atleast_2d(amax).T) + amax
+
+            return np.exp(jll - np.atleast_2d(l).T)
+
+        def normalize(X):
+            return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
+
+        def ridge_pred(X, coef, intercept):
+            return np.dot(X, coef.T) + intercept
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+
+            x_cpu = np.zeros((self.size, self.num_features), dtype=np.int32)
+            for i in range(self.size):
+                for j in range(self.num_features):
+                    x_cpu[i, j] = self.x[i * self.num_features + j]
+
+            r1_g = naive_bayes_predict(x_cpu, self.nb_feat_log_prob_cpu, self.nb_class_log_prior_cpu)
+            r2_g = ridge_pred(normalize(x_cpu), self.ridge_coeff_cpu, self.ridge_intercept_cpu)
+            r_g = np.argmax(softmax(r1_g) + softmax(r2_g), axis=1)
+            self.cpu_result = r_g
+
+        cpu_time = System.nanoTime() - start
+
+        # Compare GPU and CPU results;
+        difference = 0
+        for i in range(self.size):
+            difference += np.abs(self.cpu_result[i] - gpu_result[i])
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in self.cpu_result[:10]]) + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_7.py b/projects/resources/python/benchmark/bench/single_gpu/bench_7.py
new file mode 100644
index 00000000..b7241dbf
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_7.py
@@ -0,0 +1,419 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+from java.lang import System
+import numpy as np
+from random import random, randint, seed, sample
+import os
+import pickle
+import json
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_NUM_BLOCKS
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NUM_THREADS_PER_BLOCK = 32
+THREADS_PER_VECTOR = 4
+MAX_NUM_VECTORS_PER_BLOCK = 1024 / THREADS_PER_VECTOR
+
+SPMV_KERNEL = """
+extern "C" __global__ void spmv(const int *ptr, const int *idx, const int *val, const float *vec, float *res, int num_rows, int num_nnz) {
+
+    for(int n = blockIdx.x * blockDim.x + threadIdx.x; n < num_rows; n += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int i = ptr[n]; i < ptr[n + 1]; i++) {
+            sum += val[i] * vec[idx[i]];
+        }
+        res[n] = sum;
+    }
+}
+
+extern "C" __global__ void spmv3(int* cudaRowCounter, int* d_ptr, int* d_cols, int* d_val, float* d_vector, float* d_out, int N) {
+    int i;
+    int thread_per_vector = %d;
+    float sum;
+    int row;
+    int rowStart, rowEnd;
+    int laneId = threadIdx.x %% thread_per_vector; //lane index in the vector
+    int vectorId = threadIdx.x / thread_per_vector; //vector index in the thread block
+    int warpLaneId = threadIdx.x & 31;	//lane index in the warp
+    int warpVectorId = warpLaneId / thread_per_vector;	//vector index in the warp
+    
+    __shared__ volatile int space[%d][2];
+    
+    // Get the row index
+    if (warpLaneId == 0) {
+        row = atomicAdd(cudaRowCounter, 32 / thread_per_vector);
+    }
+    // Broadcast the value to other threads in the same warp and compute the row index of each vector
+    row = __shfl_sync(0xffffffff, row, 0) + warpVectorId;
+    
+    while (row < N) {
+    
+        // Use two threads to fetch the row offset
+        if (laneId < 2) {
+            space[vectorId][laneId] = d_ptr[row + laneId];
+        }
+        rowStart = space[vectorId][0];
+        rowEnd = space[vectorId][1];
+    
+        sum = 0;
+        // Compute dot product
+        if (thread_per_vector == 32) {
+    
+            // Ensure aligned memory access
+            i = rowStart - (rowStart & (thread_per_vector - 1)) + laneId;
+    
+            // Process the unaligned part
+            if (i >= rowStart && i < rowEnd) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+    
+                // Process the aligned part
+            for (i += thread_per_vector; i < rowEnd; i += thread_per_vector) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        } else {
+            for (i = rowStart + laneId; i < rowEnd; i += thread_per_vector) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        }
+        // Intra-vector reduction
+        for (i = thread_per_vector >> 1; i > 0; i >>= 1) {
+            sum += __shfl_down_sync(0xffffffff,sum, i);
+        }
+    
+        // Save the results
+        if (laneId == 0) {
+            d_out[row] = sum;
+        }
+    
+        // Get a new row index
+        if(warpLaneId == 0) {
+            row = atomicAdd(cudaRowCounter, 32 / thread_per_vector);
+        }
+        // Broadcast the row index to the other threads in the same warp and compute the row index of each vector
+        row = __shfl_sync(0xffffffff,row, 0) + warpVectorId;
+    }
+}
+""" % (THREADS_PER_VECTOR, MAX_NUM_VECTORS_PER_BLOCK)
+
+
+SUM_KERNEL = """
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void sum(float *x, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+"""
+
+DIVIDE_KERNEL = """
+extern "C" __global__ void divide(float* x, float *y, float *val, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i] / val[0];
+    }
+}
+"""
+
+
+##############################
+##############################
+
+
+class Benchmark7(Benchmark):
+    """
+    Compute the HITS algorithm on a graph. The algorithm is composed of repeated sparse matrix-vector multiplications
+    on a matrix and its transpose (outgoing and ingoing edges of a graph).
+    The 2 matrix multiplications, for each iteration, can be computed in parallel, and take most of the total computation time.
+    
+    The input graph has size vertices, degree 3 and uniform distribution.
+    Each execution of this algorithm is composed of 5 iterations.
+    
+    As the benchmark is composed of 2 independent branches, the maximum theoretical speedup is 2x,
+    although realistic speedup will be lower and mostly achieved through transfer-computation overlapping.
+    
+    Structure of the computation (read-only parameters that do not influence the DAG are omitted):
+    
+     ┌─> SPMV(const H1,A2) ┬─> SUM(const A2,A_norm) ┬─> DIVIDE(A1,const A2,const A_norm) ─> CPU: A_norm=0 ─> (repeat)
+     │                     └─────────┐              │
+    ─┤                     ┌─────────│──────────────┘                                                         
+     │                     │         └──────────────┐
+     └─> SPMV(const A1,H2) ┴─> SUM(const H2,H_norm) ┴─> DIVIDE(H1,const H2,const H_norm) ─> CPU: H_norm=0 ─> (repeat)        
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b7", benchmark, nvprof_profile)
+        self.size = 0
+        self.num_nnz = 0
+        self.max_degree = 3  # Each vertex has 3 edges;
+        self.num_iterations = 5
+        self.ptr = None
+        self.idx = None
+        self.val = None
+        self.ptr2 = None
+        self.idx2 = None
+        self.val2 = None
+        self.auth1 = None
+        self.auth2 = None
+        self.hub1 = None
+        self.hub2 = None
+        self.auth_norm = None
+        self.hub_norm = None
+        self.row_cnt_1 = None
+        self.row_cnt_2 = None
+
+        self.ptr_cpu = None
+        self.idx_cpu = None
+        self.val_cpu = None
+        self.ptr2_cpu = None
+        self.idx2_cpu = None
+        self.val2_cpu = None
+
+        self.cpu_result = None
+        self.gpu_result = None
+
+        self.num_blocks_size = self.num_blocks
+        self.block_size = None
+
+        self.spmv_kernel = None
+        self.sum_kernel = None
+        self.divide_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.num_nnz = size * self.max_degree
+        self.block_size = block_size["block_size_1d"]
+
+        self.gpu_result = np.zeros(self.size)
+
+        # Allocate vectors;
+        self.ptr = polyglot.eval(language="grcuda", string=f"int[{size + 1}]")
+        self.ptr2 = polyglot.eval(language="grcuda", string=f"int[{size + 1}]")
+        self.idx = polyglot.eval(language="grcuda", string=f"int[{self.num_nnz}]")
+        self.idx2 = polyglot.eval(language="grcuda", string=f"int[{self.num_nnz}]")
+        self.val = polyglot.eval(language="grcuda", string=f"int[{self.num_nnz}]")
+        self.val2 = polyglot.eval(language="grcuda", string=f"int[{self.num_nnz}]")
+
+        self.auth1 = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.auth2 = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.hub1 = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.hub2 = polyglot.eval(language="grcuda", string=f"float[{size}]")
+
+        self.auth_norm = polyglot.eval(language="grcuda", string=f"float[1]")
+        self.hub_norm = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        self.row_cnt_1 = polyglot.eval(language="grcuda", string=f"int[1]")
+        self.row_cnt_2 = polyglot.eval(language="grcuda", string=f"int[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.spmv_kernel = build_kernel(SPMV_KERNEL, "spmv3",
+                                        "pointer, pointer, pointer, pointer, pointer, pointer, sint32")
+        self.sum_kernel = build_kernel(SUM_KERNEL, "sum", "pointer, pointer, sint32")
+        self.divide_kernel = build_kernel(DIVIDE_KERNEL, "divide", "pointer, pointer, pointer, sint32")
+
+    @time_phase("initialization")
+    def init(self):
+
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+
+        # Create a random COO graph;
+        self.ptr_cpu = [0] * (self.size + 1)
+        self.idx_cpu = [0] * self.size * self.max_degree
+        self.val_cpu = [1] * self.size * self.max_degree
+        self.val2_cpu = self.val_cpu
+        csc_dict = {}
+        for i in range(self.size):
+            # Create degree random edges;
+            self.ptr_cpu[i + 1] = self.ptr_cpu[i] + self.max_degree
+            edges = sample(range(self.size), self.max_degree)
+            self.idx_cpu[(i * self.max_degree):((i + 1) * self.max_degree)] = edges
+            for y_i in edges:
+                if y_i in csc_dict:
+                    csc_dict[y_i] += [i]
+                else:
+                    csc_dict[y_i] = [i]
+        self.ptr2_cpu = [0] * (self.size + 1)
+        self.idx2_cpu = [0] * self.size * self.max_degree
+        for i in range(self.size):
+            if i in csc_dict:
+                edges = csc_dict[i]
+                self.ptr2_cpu[i + 1] = self.ptr2_cpu[i] + len(edges)
+                self.idx2_cpu[self.ptr2_cpu[i]:self.ptr2_cpu[i + 1]] = edges
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        # FIXME: using the same data for CSC and CSR, because ptr2 is giving data-dependent performance differences
+        for i in range(len(self.ptr_cpu)):
+            self.ptr[i] = self.ptr_cpu[i]
+            self.ptr2[i] = self.ptr_cpu[i]
+        for i in range(len(self.idx_cpu)):
+            self.idx[i] = self.idx_cpu[i]
+            self.idx2[i] = self.idx_cpu[i]
+            self.val[i] = self.val_cpu[i]
+            self.val2[i] = self.val_cpu[i]
+        for i in range(self.size):
+            self.auth1[i] = 1.0
+            self.auth2[i] = 1.0
+            self.hub1[i] = 1.0
+            self.hub2[i] = 1.0
+        self.auth_norm[0] = 0.0
+        self.hub_norm[0] = 0.0
+        self.row_cnt_1[0] = 0
+        self.row_cnt_2[0] = 0
+
+    def execute(self) -> object:
+        self.block_size = self._block_size["block_size_1d"]
+        self.num_blocks_size = self.num_blocks
+        num_blocks_spmv = int(np.ceil(self.size / self.block_size))
+
+        start_comp = System.nanoTime()
+        start = 0
+
+        for i in range(self.num_iterations):
+            # Authorities;
+            self.execute_phase(f"spmv_a_{i}", self.spmv_kernel(num_blocks_spmv, self.block_size, 4 * self.block_size), self.row_cnt_1, self.ptr2,
+                               self.idx2, self.val2, self.hub1, self.auth2, self.size)
+
+            # Hubs;
+            self.execute_phase(f"spmv_h_{i}", self.spmv_kernel(num_blocks_spmv, self.block_size, 4 * self.block_size), self.row_cnt_2, self.ptr,
+                               self.idx, self.val, self.auth1, self.hub2, self.size)
+
+            # Normalize authorities;
+            self.execute_phase(f"sum_a_{i}", self.sum_kernel(self.num_blocks_size, self.block_size), self.auth2,
+                               self.auth_norm, self.size)
+
+            # Normalize hubs;
+            self.execute_phase(f"sum_h_{i}", self.sum_kernel(self.num_blocks_size, self.block_size), self.hub2,
+                               self.hub_norm, self.size)
+
+            self.execute_phase(f"divide_a_{i}", self.divide_kernel(self.num_blocks_size, self.block_size), self.auth2,
+                               self.auth1, self.auth_norm, self.size)
+
+            self.execute_phase(f"divide_h_{i}", self.divide_kernel(self.num_blocks_size, self.block_size), self.hub2,
+                               self.hub1, self.hub_norm, self.size)
+
+            if self.time_phases:
+                start = System.nanoTime()
+            self.auth_norm[0] = 0.0
+            self.hub_norm[0] = 0.0
+            self.row_cnt_1[0] = 0.0
+            self.row_cnt_2[0] = 0.0
+            if self.time_phases:
+                end = System.nanoTime()
+                self.benchmark.add_phase({"name": f"norm_reset_{i}", "time_sec": (end - start) / 1_000_000_000})
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp1 = self.auth1[0]
+        tmp2 = self.hub1[0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        # Compute GPU result;
+        # for i in range(self.size):
+        #     self.gpu_result[i] = self.auth1[i] + self.hub1[i]
+
+        self.benchmark.add_to_benchmark("gpu_result", 0)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(
+                f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.gpu_result[:10]]) + "...]")
+
+        return self.gpu_result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def spmv(ptr, idx, val, vec):
+            res = np.zeros(len(ptr) - 1)
+            for i in range(len(ptr) - 1):
+                curr_sum = 0
+                start = int(ptr[i])
+                end = int(ptr[i + 1])
+                for j in range(start, end):
+                    curr_sum += val[j] * vec[idx[j]]
+                res[i] = curr_sum
+            return res
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            # Initialize the support device arrays;
+            N = self.size
+
+            auth1 = np.ones(N)
+            hub1 = np.ones(N)
+
+            # Main iteration;
+            for i in range(self.num_iterations):
+                # Authority;
+                auth2 = spmv(self.ptr2_cpu, self.idx2_cpu, self.val2_cpu, hub1)
+                auth2 = auth2 / np.sum(auth2)
+                # Hubs
+                hub2 = spmv(self.ptr_cpu, self.idx_cpu, self.val_cpu, auth1)
+                hub2 = hub2 / np.sum(hub2)
+
+                auth1 = auth2
+                hub1 = hub2
+            self.cpu_result = hub1 + auth1
+
+        cpu_time = System.nanoTime() - start
+
+        # Compare GPU and CPU results;
+        difference = 0
+        for i in range(self.size):
+            difference += np.abs(self.cpu_result[i] - gpu_result[i])
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in self.cpu_result[:10]])
+                                        + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_8.py b/projects/resources/python/benchmark/bench/single_gpu/bench_8.py
new file mode 100644
index 00000000..ce4ee632
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_8.py
@@ -0,0 +1,564 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+from java.lang import System 
+import numpy as np
+from random import randint, seed
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D, DEFAULT_BLOCK_SIZE_2D
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NUM_THREADS_PER_BLOCK_2D = 8
+NUM_THREADS_PER_BLOCK = 32
+WARP_SIZE = 32
+
+GAUSSIAN_BLUR = """
+extern "C" __global__ void gaussian_blur(const float *image, float *result, int rows, int cols, const float* kernel, int diameter) {
+    extern __shared__ float kernel_local[];
+    for(int i = threadIdx.x; i < diameter; i += blockDim.x) {
+        for(int j = threadIdx.y; j < diameter; j += blockDim.y) {
+            kernel_local[i * diameter + j] = kernel[i * diameter + j];
+        }
+    }
+    __syncthreads();
+
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum = 0;
+            int radius = diameter / 2;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        sum += kernel_local[(x + radius) * diameter + (y + radius)] * image[nx * cols + ny];
+                    }
+                }
+            }
+            result[i * cols + j] = sum;
+        }
+    }
+}
+"""
+
+
+SOBEL = """
+
+extern "C" __global__ void sobel(float *image, float *result, int rows, int cols) {
+    // int SOBEL_X[3][3] = {{-1, -2, -1}, {0, 0, 0}, {1, 2, 1}};
+    // int SOBEL_Y[3][3] = {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}};
+    __shared__ int SOBEL_X[9];
+    __shared__ int SOBEL_Y[9];
+    if (threadIdx.x == 0 && threadIdx.y == 0) {   
+        SOBEL_X[0] = -1;
+        SOBEL_X[1] = -2;
+        SOBEL_X[2] = -1;
+        SOBEL_X[3] = 0;
+        SOBEL_X[4] = 0;
+        SOBEL_X[5] = 0;
+        SOBEL_X[6] = 1;
+        SOBEL_X[7] = 2;
+        SOBEL_X[8] = 1;
+
+        SOBEL_Y[0] = -1;
+        SOBEL_Y[1] = 0;
+        SOBEL_Y[2] = 1;
+        SOBEL_Y[3] = -2;
+        SOBEL_Y[4] = 0;
+        SOBEL_Y[5] = 2;
+        SOBEL_Y[6] = -1;
+        SOBEL_Y[7] = 0;
+        SOBEL_Y[8] = 1;
+    }
+    __syncthreads();
+    
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x * gridDim.x) {
+        for(int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y * gridDim.y) {
+            float sum_gradient_x = 0.0, sum_gradient_y = 0.0;
+            int radius = 1;
+            for (int x = -radius; x <= radius; ++x) {
+                for (int y = -radius; y <= radius; ++y) {
+                    int nx = x + i;
+                    int ny = y + j;
+                    if (nx >= 0 && ny >= 0 && nx < rows && ny < cols) {
+                        float neighbour = image[nx * cols + ny];
+                        int s = (x + radius) * 3 + y + radius;
+                        sum_gradient_x += SOBEL_X[s] * neighbour;
+                        sum_gradient_y += SOBEL_Y[s] * neighbour;
+                    }
+                }
+            }
+            result[i * cols + j] = sqrt(sum_gradient_x * sum_gradient_x + sum_gradient_y * sum_gradient_y);
+        }
+    }
+}
+"""
+
+EXTEND_MASK = """
+__device__ float atomicMinf(float* address, float val)
+{
+    int *address_as_int =(int*)address;
+    int old = *address_as_int, assumed;
+    while (val < __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed,
+                        __float_as_int(val));
+        }
+    return __int_as_float(old);
+}
+
+__device__ float atomicMaxf(float* address, float val)
+{
+    int *address_as_int = (int*) address;
+    int old = *address_as_int, assumed;
+    // If val is smaller than current, don't do anything, else update the current value atomically;
+    while (val > __int_as_float(old)) {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed, __float_as_int(val));
+    }
+    return __int_as_float(old);
+}
+
+__inline__ __device__ float warp_reduce_max(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = max(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+__inline__ __device__ float warp_reduce_min(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return val;
+}
+
+extern "C" __global__ void maximum(float *in, float* out, int N) {
+    int warp_size = 32;
+    float maximum = -1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        maximum = max(maximum, in[i]);
+    }
+    maximum = warp_reduce_max(maximum); // Obtain the max of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMaxf(out, maximum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void minimum(float *in, float* out, int N) {
+    int warp_size = 32;
+    float minimum = 1000;
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { 
+        minimum = min(minimum, in[i]);
+    }
+    minimum = warp_reduce_min(minimum); // Obtain the min of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicMinf(out, minimum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void extend(float *x, const float *minimum, const float *maximum, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = 5 * (x[i] - *minimum) / (*maximum - *minimum);
+        x[i] = res_tmp > 1 ? 1 : res_tmp;
+    }
+}
+"""
+
+UNSHARPEN = """
+extern "C" __global__ void unsharpen(float *x, float *y, float *res, float amount, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        float res_tmp = x[i] * (1 + amount) - y[i] * amount;
+        res_tmp = res_tmp > 1 ? 1 : res_tmp;
+        res[i] = res_tmp < 0 ? 0 : res_tmp;
+    }
+}
+"""
+
+COMBINE = """
+extern "C" __global__ void combine(const float *x, const float *y, const float *mask, float *res, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        res[i] = x[i] * mask[i] + y[i] * (1 - mask[i]);
+    }
+}
+"""
+
+RESET = """
+extern "C" __global__ void reset(float *x, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { 
+        x[i] = 0.0;
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark8(Benchmark):
+    """
+    Compute an image processing pipeline in which we sharpen an image and combine it
+    with copies that have been blurred at low and medium frequencies. The result is an image sharper on the edges,
+    and softer everywhere else: this filter is common, for example, in portrait retouching, where a photographer desires
+    to enhance the clarity of facial features while smoothing the subject' skin and the background;
+
+    The input is a random square single-channel image with floating-point values between 0 and 1, with side of length size.
+
+    BLUR(image,blur1) ─> SOBEL(blur1,mask1) ───────────────────────────────────────────────────────────────────────────────┐
+    BLUR(image,blur2) ─> SOBEL(blur2,mask2) ┬─> MAX(mask2) ──┬─> EXTEND(mask2) ──┐                                         │
+                                            └─> MIN(mask2) ──┘                   │                                         │
+    SHARPEN(image,blur3) ─> UNSHARPEN(image,blur3,sharpened) ────────────────────┴─> COMBINE(sharpened,blur2,mask2,image2) ┴─> COMBINE(image2,blur1,mask1,image3)
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b8", benchmark, nvprof_profile)
+        self.size = 0
+
+        self.image = None
+        self.image2 = None
+        self.image3 = None
+
+        self.blurred_small = None
+        self.mask_small = None
+        self.kernel_small = None
+        self.kernel_small_diameter = 3
+        self.kernel_small_variance = 1
+
+        self.blurred_large = None
+        self.mask_large = None
+        self.kernel_large = None
+        self.kernel_large_diameter = 5
+        self.kernel_large_variance = 10
+        self.maximum = None
+        self.minimum = None
+        self.reset = None
+
+        self.blurred_unsharpen = None
+        self.image_unsharpen = None
+        self.kernel_unsharpen = None
+        self.kernel_unsharpen_diameter = 3
+        self.kernel_unsharpen_variance = 5
+        self.unsharpen_amount = 0.5
+
+        # self.image_cpu = None
+        self.kernel_small_cpu = None
+        self.kernel_large_cpu = None
+        self.kernel_unsharpen_cpu = None
+
+        self.cpu_result = None
+        self.gpu_result = None
+
+        self.num_blocks_per_processor = self.num_blocks # 12  # 32
+
+        self.block_size_1d = DEFAULT_BLOCK_SIZE_1D
+        self.block_size_2d = DEFAULT_BLOCK_SIZE_2D
+
+        self.gaussian_blur_kernel = None
+        self.sobel_kernel = None
+        self.extend_kernel = None
+        self.unsharpen_kernel = None
+        self.combine_mask_kernel = None
+        self.maximum_kernel = None
+        self.minimum_kernel = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size_1d = block_size["block_size_1d"]
+        self.block_size_2d = block_size["block_size_2d"]
+
+        # Allocate vectors;
+        self.image = polyglot.eval(language="grcuda", string=f"float[{size * size}]")
+        self.image2 = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+        self.image3 = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+
+        self.kernel_small = polyglot.eval(language="grcuda", string=f"float[{self.kernel_small_diameter}][{self.kernel_small_diameter}]")
+        self.kernel_large = polyglot.eval(language="grcuda", string=f"float[{self.kernel_large_diameter}][{self.kernel_large_diameter}]")
+        self.kernel_unsharpen = polyglot.eval(language="grcuda", string=f"float[{self.kernel_unsharpen_diameter}][{self.kernel_unsharpen_diameter}]")
+        self.maximum = polyglot.eval(language="grcuda", string=f"float[1]")
+        self.minimum = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        self.mask_small = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+        self.mask_large = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+        self.image_unsharpen = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+
+        self.blurred_small = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+        self.blurred_large = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+        self.blurred_unsharpen = polyglot.eval(language="grcuda", string=f"float[{size}][{size}]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.gaussian_blur_kernel = build_kernel(GAUSSIAN_BLUR, "gaussian_blur", "const pointer, pointer, sint32, sint32, const pointer, sint32")
+        self.sobel_kernel = build_kernel(SOBEL, "sobel", "pointer, pointer, sint32, sint32")
+        self.extend_kernel = build_kernel(EXTEND_MASK, "extend", "pointer, const pointer, const pointer, sint32")
+        self.maximum_kernel = build_kernel(EXTEND_MASK, "maximum", "const pointer, pointer, sint32")
+        self.minimum_kernel = build_kernel(EXTEND_MASK, "minimum", "const pointer, pointer, sint32")
+        self.unsharpen_kernel = build_kernel(UNSHARPEN, "unsharpen", "pointer, pointer, pointer, float, sint32")
+        self.combine_mask_kernel = build_kernel(COMBINE, "combine", "const pointer, const pointer, const pointer, pointer, sint32")
+        self.reset_kernel = build_kernel(RESET, "reset", "pointer, sint32")
+        self.initialize_rand = polyglot.eval(language="js", string="x => { for (let i = 0; i < x.length; i++) { x[i] = Math.random() }}")
+
+    @time_phase("initialization")
+    def init(self):
+
+        def gaussian_kernel(diameter, sigma):
+            kernel = np.zeros((diameter, diameter))
+            mean = diameter / 2
+            sum_tmp = 0
+            for x in range(diameter):
+                for y in range(diameter):
+                    kernel[x, y] = np.exp(-0.5 * ((x - mean) ** 2 + (y - mean) ** 2) / sigma ** 2)
+                    sum_tmp += kernel[x, y]
+            for x in range(diameter):
+                for y in range(diameter):
+                    kernel[x, y] /= sum_tmp
+            return kernel
+
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+
+        # Create a random image;
+        self.initialize_rand(self.image)
+        self.gpu_result = [[0.0] * self.size for _ in range(self.size)]
+        # self.image_cpu = np.random.rand(self.size, self.size).astype(np.float32)  # Create here the image used for validation;
+        # self.image.copyFrom(int(np.int64(self.image_cpu.ctypes.data)), len(self.image))
+        # self.gpu_result = np.zeros((self.size, self.size))
+        self.kernel_small_cpu = gaussian_kernel(self.kernel_small_diameter, self.kernel_small_variance)
+        self.kernel_large_cpu = gaussian_kernel(self.kernel_large_diameter, self.kernel_large_variance)
+        self.kernel_unsharpen_cpu = gaussian_kernel(self.kernel_unsharpen_diameter, self.kernel_unsharpen_variance)
+        for i in range(self.kernel_small_diameter):
+            for j in range(self.kernel_small_diameter):
+                self.kernel_small[i][j] = float(self.kernel_small_cpu[i, j])
+        for i in range(self.kernel_large_diameter):
+            for j in range(self.kernel_large_diameter):
+                self.kernel_large[i][j] = float(self.kernel_large_cpu[i, j])
+        for i in range(self.kernel_unsharpen_diameter):
+            for j in range(self.kernel_unsharpen_diameter):
+                self.kernel_unsharpen[i][j] = float(self.kernel_unsharpen_cpu[i, j])
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        # for i in range(self.size):
+        #     for j in range(self.size):
+        #         self.image3[i][j] = 0.0
+        # self.image3.copyFrom(int(np.int64(self.image_cpu.ctypes.data)), len(self.image3))
+        self.maximum[0] = 0.0
+        self.minimum[0] = 0.0
+
+    def execute(self) -> object:
+        self.block_size_1d = self._block_size["block_size_1d"]
+        self.block_size_2d = self._block_size["block_size_2d"]
+        self.num_blocks_per_processor = self.num_blocks  # 12  # 32
+        a = self.num_blocks_per_processor / 2
+
+        start_comp = System.nanoTime()
+        start = 0
+
+        self.reset_kernel(self.num_blocks_per_processor, self.block_size_1d)(self.image3, 0)
+
+        self.reset_kernel((a, a), (self.block_size_2d, self.block_size_2d))(self.image3, 0)
+
+        # Blur - Small;
+        self.execute_phase("blur_small",
+                           self.gaussian_blur_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * self.kernel_small_diameter**2),
+                           self.image, self.blurred_small, self.size, self.size, self.kernel_small, self.kernel_small_diameter)
+
+        # Blur - Large;
+        self.execute_phase("blur_large",
+                           self.gaussian_blur_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * self.kernel_large_diameter**2),
+                           self.image, self.blurred_large, self.size, self.size, self.kernel_large, self.kernel_large_diameter)
+
+        # Blur - Unsharpen;
+        self.execute_phase("blur_unsharpen",
+                           self.gaussian_blur_kernel((a, a), (self.block_size_2d, self.block_size_2d), 4 * self.kernel_unsharpen_diameter**2),
+                           self.image, self.blurred_unsharpen, self.size, self.size, self.kernel_unsharpen, self.kernel_unsharpen_diameter)
+
+        # Sobel filter (edge detection);
+        self.execute_phase("sobel_small",
+                           self.sobel_kernel((a, a), (self.block_size_2d, self.block_size_2d)),
+                           self.blurred_small, self.mask_small, self.size, self.size)
+
+        self.execute_phase("sobel_large",
+                           self.sobel_kernel((a, a), (self.block_size_2d, self.block_size_2d)),
+                           self.blurred_large, self.mask_large, self.size, self.size)
+
+        # Extend large edge detection mask;
+        self.execute_phase("maximum",
+                           self.maximum_kernel(self.num_blocks_per_processor, self.block_size_1d), self.mask_large, self.maximum, self.size**2)
+        self.execute_phase("minimum",
+                           self.minimum_kernel(self.num_blocks_per_processor, self.block_size_1d), self.mask_large, self.minimum, self.size**2)
+        self.execute_phase("extend",
+                           self.extend_kernel(self.num_blocks_per_processor, self.block_size_1d), self.mask_large, self.minimum, self.maximum, self.size**2)
+
+        # Unsharpen;
+        self.execute_phase("unsharpen",
+                            self.unsharpen_kernel(self.num_blocks_per_processor, self.block_size_1d),
+                            self.image, self.blurred_unsharpen, self.image_unsharpen, self.unsharpen_amount, self.size * self.size)
+
+        # Combine results;
+        self.execute_phase("combine",
+                           self.combine_mask_kernel(self.num_blocks_per_processor, self.block_size_1d),
+                           self.image_unsharpen, self.blurred_large, self.mask_large, self.image2, self.size * self.size)
+        self.execute_phase("combine_2",
+                           self.combine_mask_kernel(self.num_blocks_per_processor, self.block_size_1d),
+                           self.image2, self.blurred_small, self.mask_small, self.image3, self.size * self.size)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp = self.image3[0][0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+
+        # Compute GPU result;
+        # for i in range(self.size):
+        #     for j in range(self.size):
+        #         self.gpu_result[i, j] = self.image3[i][j]
+
+        self.gpu_result = sum(self.image3[-1])
+
+        self.benchmark.add_to_benchmark("gpu_result", 0)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(
+                f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.image3[0][:10]]) + "...]")
+
+        return self.gpu_result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        sobel_filter_diameter = 3
+        sobel_filter_x = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]])
+        sobel_filter_y = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
+
+        def sobel_filter(image):
+            out = np.zeros(image.shape)
+            rows, cols = image.shape
+            radius = sobel_filter_diameter // 2
+
+            for i in range(rows):
+                for j in range(cols):
+                    sum_gradient_x = 0
+                    sum_gradient_y = 0
+                    for x in range(-radius, radius + 1):
+                        for y in range(-radius, radius + 1):
+                            nx = x + i
+                            ny = y + j
+                            if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                                gray_value_neigh = image[nx, ny]
+                                gradient_x = sobel_filter_x[x + radius][y + radius]
+                                gradient_y = sobel_filter_y[x + radius][y + radius]
+                                sum_gradient_x += gray_value_neigh * gradient_x
+                                sum_gradient_y += gray_value_neigh * gradient_y
+                    out[i, j] = np.sqrt(sum_gradient_x ** 2 + sum_gradient_y ** 2)
+            return out
+
+        def gaussian_blur(image, kernel):
+            out = np.zeros(image.shape)
+            rows, cols = image.shape
+
+            # Blur radius;
+            diameter = kernel.shape[0]
+            radius = diameter // 2
+
+            # Flatten image and kernel;
+            image_1d = image.reshape(-1)
+            kernel_1d = kernel.reshape(-1)
+
+            for i in range(rows):
+                for j in range(cols):
+                    sum_tmp = 0
+                    for x in range(-radius, radius + 1):
+                        for y in range(-radius, radius + 1):
+                            nx = x + i
+                            ny = y + j
+                            if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                                sum_tmp += kernel_1d[(x + radius) * diameter + (y + radius)] * image_1d[nx * cols + ny]
+                    out[i, j] = sum_tmp
+            return out
+
+        def normalize(image):
+            return (image - np.min(image)) / (np.max(image) - np.min(image))
+
+        def truncate(image, minimum=0, maximum=1):
+            out = image.copy()
+            out[out < minimum] = minimum
+            out[out > maximum] = maximum
+            return out
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+
+            image_cpu = np.zeros((self.size, self.size))
+            for i in range(self.size):
+                for j in range(self.size):
+                    image_cpu[i, j] = self.image[i * self.size + j]
+
+            # Part 1: Small blur on medium frequencies;
+            blurred_small = gaussian_blur(image_cpu, self.kernel_small_cpu)
+            edges_small = sobel_filter(blurred_small)
+
+            # Part 2: High blur on low frequencies;
+            blurred_large = gaussian_blur(image_cpu, self.kernel_large_cpu)
+            edges_large = sobel_filter(blurred_large)
+            # Extend mask to cover a larger area;
+            edges_large = normalize(edges_large) * 5
+            edges_large[edges_large > 1] = 1
+
+            # Part 3: Sharpen image;
+            unsharpen = gaussian_blur(image_cpu, self.kernel_unsharpen_cpu)
+            amount = 0.5
+            sharpened = truncate(image_cpu * (1 + amount) - unsharpen * amount)
+
+            # Part 4: Merge sharpened image and low frequencies;
+            image2 = normalize(sharpened * edges_large + blurred_large * (1 - edges_large))
+
+            # Part 5: Merge image and medium frequencies;
+            self.cpu_result = image2 * edges_small + blurred_small * (1 - edges_small)
+
+        cpu_time = System.nanoTime() - start
+
+        # Compare GPU and CPU results;
+        difference = sum(self.cpu_result[-1, :]) - gpu_result
+        # difference = 0
+        # for i in range(self.size):
+        #     for j in range(self.size):
+        #         difference += np.abs(self.cpu_result[i, j] - gpu_result[i, j])
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in self.cpu_result[0, :10]])
+                                        + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
diff --git a/projects/resources/python/benchmark/bench/single_gpu/bench_9.py b/projects/resources/python/benchmark/bench/single_gpu/bench_9.py
new file mode 100644
index 00000000..cfbaf7ea
--- /dev/null
+++ b/projects/resources/python/benchmark/bench/single_gpu/bench_9.py
@@ -0,0 +1,497 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+from java.lang import System
+import numpy as np
+from random import random, randint, seed, sample
+
+from benchmark import Benchmark, time_phase, DEFAULT_BLOCK_SIZE_1D
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+NUM_THREADS_PER_BLOCK = 32
+THREADS_PER_VECTOR = 4
+MAX_NUM_VECTORS_PER_BLOCK = 1024 / THREADS_PER_VECTOR
+
+SPMV_KERNEL = """
+extern "C" __global__ void spmv(const int *ptr, const int *idx, const int *val, const float *vec, float *res, int num_rows, int num_nnz) {
+
+    for(int n = blockIdx.x * blockDim.x + threadIdx.x; n < num_rows; n += blockDim.x * gridDim.x) {
+        float sum = 0;
+        for (int i = ptr[n]; i < ptr[n + 1]; i++) {
+            sum += val[i] * vec[idx[i]];
+        }
+        res[n] = sum;
+    }
+}
+
+extern "C" __global__ void spmv2(int* cudaRowCounter, int* d_ptr, int* d_cols, float* d_val, float* d_vector, float* d_out, int N) {
+    int i;
+    int thread_per_vector = %d;
+    float sum;
+    int row;
+    int rowStart, rowEnd;
+    int laneId = threadIdx.x %% thread_per_vector; //lane index in the vector
+    int vectorId = threadIdx.x / thread_per_vector; //vector index in the thread block
+    int warpLaneId = threadIdx.x & 31;	//lane index in the warp
+    int warpVectorId = warpLaneId / thread_per_vector;	//vector index in the warp
+
+    __shared__ volatile int space[%d][2];
+
+    // Get the row index
+    if (warpLaneId == 0) {
+        row = atomicAdd(cudaRowCounter, 32 / thread_per_vector);
+    }
+    // Broadcast the value to other threads in the same warp and compute the row index of each vector
+    row = __shfl_sync(0xffffffff, row, 0) + warpVectorId;
+
+    while (row < N) {
+
+        // Use two threads to fetch the row offset
+        if (laneId < 2) {
+            space[vectorId][laneId] = d_ptr[row + laneId];
+        }
+        rowStart = space[vectorId][0];
+        rowEnd = space[vectorId][1];
+
+        sum = 0;
+        // Compute dot product
+        if (thread_per_vector == 32) {
+
+            // Ensure aligned memory access
+            i = rowStart - (rowStart & (thread_per_vector - 1)) + laneId;
+
+            // Process the unaligned part
+            if (i >= rowStart && i < rowEnd) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+
+                // Process the aligned part
+            for (i += thread_per_vector; i < rowEnd; i += thread_per_vector) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        } else {
+            for (i = rowStart + laneId; i < rowEnd; i += thread_per_vector) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        }
+        // Intra-vector reduction
+        for (i = thread_per_vector >> 1; i > 0; i >>= 1) {
+            sum += __shfl_down_sync(0xffffffff,sum, i);
+        }
+
+        // Save the results
+        if (laneId == 0) {
+            d_out[row] = sum;
+        }
+
+        // Get a new row index
+        if(warpLaneId == 0) {
+            row = atomicAdd(cudaRowCounter, 32 / thread_per_vector);
+        }
+        // Broadcast the row index to the other threads in the same warp and compute the row index of each vector
+        row = __shfl_sync(0xffffffff,row, 0) + warpVectorId;
+    }
+}
+
+// Compute d_out = y + alpha * A * d_vector;
+extern "C" __global__ void spmv_full(int* cudaRowCounter, int* d_ptr, int* d_cols, float* d_val, float* d_vector, float* d_out, int N, float alpha, float* y) {
+    int i;
+    int thread_per_vector = %d;
+    float sum;
+    int row;
+    int rowStart, rowEnd;
+    int laneId = threadIdx.x %% thread_per_vector; //lane index in the vector
+    int vectorId = threadIdx.x / thread_per_vector; //vector index in the thread block
+    int warpLaneId = threadIdx.x & 31;	//lane index in the warp
+    int warpVectorId = warpLaneId / thread_per_vector;	//vector index in the warp
+
+    __shared__ volatile int space[%d][2];
+
+    // Get the row index
+    if (warpLaneId == 0) {
+        row = atomicAdd(cudaRowCounter, 32 / thread_per_vector);
+    }
+    // Broadcast the value to other threads in the same warp and compute the row index of each vector
+    row = __shfl_sync(0xffffffff, row, 0) + warpVectorId;
+
+    while (row < N) {
+
+        // Use two threads to fetch the row offset
+        if (laneId < 2) {
+            space[vectorId][laneId] = d_ptr[row + laneId];
+        }
+        rowStart = space[vectorId][0];
+        rowEnd = space[vectorId][1];
+
+        sum = 0;
+        // Compute dot product
+        if (thread_per_vector == 32) {
+
+            // Ensure aligned memory access
+            i = rowStart - (rowStart & (thread_per_vector - 1)) + laneId;
+
+            // Process the unaligned part
+            if (i >= rowStart && i < rowEnd) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+
+                // Process the aligned part
+            for (i += thread_per_vector; i < rowEnd; i += thread_per_vector) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        } else {
+            for (i = rowStart + laneId; i < rowEnd; i += thread_per_vector) {
+                sum += d_val[i] * d_vector[d_cols[i]];
+            }
+        }
+        // Intra-vector reduction
+        for (i = thread_per_vector >> 1; i > 0; i >>= 1) {
+            sum += __shfl_down_sync(0xffffffff,sum, i);
+        }
+
+        // Save the results
+        if (laneId == 0) {
+            d_out[row] = y[row] + alpha * sum;
+        }
+
+        // Get a new row index
+        if(warpLaneId == 0) {
+            row = atomicAdd(cudaRowCounter, 32 / thread_per_vector);
+        }
+        // Broadcast the row index to the other threads in the same warp and compute the row index of each vector
+        row = __shfl_sync(0xffffffff,row, 0) + warpVectorId;
+    }
+}
+""" % (THREADS_PER_VECTOR, MAX_NUM_VECTORS_PER_BLOCK, THREADS_PER_VECTOR, MAX_NUM_VECTORS_PER_BLOCK)
+
+SUM_KERNEL = """
+__inline__ __device__ float warp_reduce(float val) {
+    int warp_size = 32;
+    for (int offset = warp_size / 2; offset > 0; offset /= 2) 
+        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
+    return val;
+}
+
+extern "C" __global__ void vector_norm(const float *x, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] * x[i];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+
+extern "C" __global__ void dot_product(const float *x, const float *y, float* z, int N) {
+    int warp_size = 32;
+    float sum = float(0);
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) {
+        sum += x[i] * y[i];
+    }
+    sum = warp_reduce(sum); // Obtain the sum of values in the current warp;
+    if ((threadIdx.x & (warp_size - 1)) == 0) // Same as (threadIdx.x % warp_size) == 0 but faster
+        atomicAdd(z, sum); // The first thread in the warp updates the output;
+}
+"""
+
+SAXPY_KERNEL = """
+// Compute y = val + alpha * x;
+extern "C" __global__ void saxpy(float* y, float *val, float *x, float alpha, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = val[i] + alpha * x[i];
+    }
+}
+
+extern "C" __global__ void cpy(float *y, const float *x, int n) {
+    for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
+        y[i] = x[i];
+    }
+}
+"""
+
+##############################
+##############################
+
+
+class Benchmark9(Benchmark):
+    """
+    Compute the conjugate gradient algorithm on a sparse symmetric matrix.
+    """
+
+    def __init__(self, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        super().__init__("b9", benchmark, nvprof_profile)
+        self.size = 0
+        self.num_nnz = 0
+        self.max_degree = 3  # Each row has 3 nnz entries (not counting symmetric entries);
+        self.num_iterations = 4
+        self.ptr = None
+        self.idx = None
+        self.val = None
+
+        self.x = None
+        self.b = None
+        self.p = None
+        self.r = None
+        self.t1 = None
+        self.t2 = None
+
+        self.ptr_cpu = None
+        self.idx_cpu = None
+        self.val_cpu = None
+        self.b_cpu = None
+
+        self.cpu_result = None
+        self.gpu_result = None
+
+        self.num_blocks_size = 32
+        self.block_size = None
+
+        self.vspmv_kernel = None
+        self.spmv_kernel = None
+        self.norm_kernel = None
+        self.saxpy_kernel = None
+
+        self.row_cnt_1 = None
+        self.row_cnt_2 = None
+        self.row_cnt_3 = None
+
+    @time_phase("allocation")
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        self.size = size
+        self.block_size = block_size["block_size_1d"]
+
+        self.gpu_result = np.zeros(self.size)
+
+        # Create a random symmetric COO matrix;
+        self.random_seed = randint(0, 10000000)
+        seed(self.random_seed)
+
+        # Create a random COO symmetric matrix;
+        t = [(0,0,0)] * self.size * self.max_degree * 2
+        for i in range(self.size):
+            # Create max_degree random edges;
+            edges = sample(range(0, self.size), self.max_degree)
+            for j, e in enumerate(edges):
+                while i == e:
+                    e = randint(0, self.size - 1)
+                tmp = random()
+                t[i * self.max_degree + j] = (i, e, tmp)
+                t[i * self.max_degree + j + self.size * self.max_degree] = (e, i, tmp)
+
+        x, self.idx_cpu, self.val_cpu = zip(*sorted(t, key=lambda l: (l[0], l[1])))
+        self.num_nnz = len(self.idx_cpu)
+        self.ptr_cpu = [0] * (self.size + 1)
+        for i, x_i in enumerate(x):
+            self.ptr_cpu[x_i + 1] += 1
+        for i in range(len(self.ptr_cpu) - 1):
+            self.ptr_cpu[i + 1] += self.ptr_cpu[i]
+
+        self.b_cpu = [0] * self.size
+
+        # Allocate vectors;
+        self.ptr = polyglot.eval(language="grcuda", string=f"int[{self.size + 1}]")
+        self.idx = polyglot.eval(language="grcuda", string=f"int[{self.num_nnz}]")
+        self.val = polyglot.eval(language="grcuda", string=f"float[{self.num_nnz}]")
+
+        self.x = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.p = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.r = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.b = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.y = polyglot.eval(language="grcuda", string=f"float[{size}]")
+        self.t1 = polyglot.eval(language="grcuda", string=f"float[1]")
+        self.t2 = polyglot.eval(language="grcuda", string=f"float[1]")
+
+        self.row_cnt_1 = polyglot.eval(language="grcuda", string=f"int[1]")
+        self.row_cnt_2 = polyglot.eval(language="grcuda", string=f"int[1]")
+        self.row_cnt_3 = polyglot.eval(language="grcuda", string=f"int[1]")
+
+        # Build the kernels;
+        build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+        self.spmv_kernel = build_kernel(SPMV_KERNEL, "spmv2", "pointer, pointer, pointer, pointer, pointer, pointer, sint32")
+        self.spmv_full_kernel = build_kernel(SPMV_KERNEL, "spmv_full", "pointer, pointer, pointer, pointer, pointer, pointer, sint32, float, pointer")
+        self.norm_kernel = build_kernel(SUM_KERNEL, "vector_norm", "const pointer, pointer, sint32")
+        self.dp_kernel = build_kernel(SUM_KERNEL, "dot_product", "const pointer, const pointer, pointer, sint32")
+        self.saxpy_kernel = build_kernel(SAXPY_KERNEL, "saxpy", "pointer, const pointer, const pointer, float, sint32")
+        self.cpy_kernel = build_kernel(SAXPY_KERNEL, "cpy", "pointer, const pointer, sint32")
+
+    @time_phase("initialization")
+    def init(self):
+        for i in range(len(self.ptr_cpu)):
+            self.ptr[i] = self.ptr_cpu[i]
+        for i in range(len(self.idx_cpu)):
+            self.idx[i] = self.idx_cpu[i]
+            self.val[i] = self.val_cpu[i]
+        for i in range(len(self.b)):
+            self.b_cpu[i] = random()
+            self.b[i] = self.b_cpu[i]
+
+    @time_phase("reset_result")
+    def reset_result(self) -> None:
+        seed(self.random_seed)
+        # Random initial solution;
+        for i in range(self.size):
+            self.x[i] = 1.0
+        self.t1[0] = 0.0
+        self.t2[0] = 0.0
+        self.row_cnt_1[0] = 0
+        self.row_cnt_2[0] = 0
+
+    def execute(self) -> object:
+        num_blocks_spmv = int(np.ceil(self.size / self.block_size))
+        start_comp = System.nanoTime()
+        start = 0
+        alpha = 0
+        self.t1[0] = 0
+        
+        # Initialization phase;
+        # r = b - A * x
+        self.execute_phase("spmv_init", self.spmv_full_kernel(num_blocks_spmv, self.block_size, 4 * self.block_size),
+                           self.row_cnt_1, self.ptr, self.idx, self.val, self.x, self.r, self.size, -1, self.b)
+        # p = r
+        self.execute_phase("cpy_init", self.cpy_kernel(self.num_blocks_size, self.block_size), self.p, self.r, self.size)
+        # t1 = r^t * r
+        self.execute_phase("norm_init", self.norm_kernel(self.num_blocks_size, self.block_size), self.r, self.t1, self.size)
+
+        for i in range(self.num_iterations):
+            # t2 = p^t * A * p
+            self.execute_phase(f"spmv_{i}", self.spmv_kernel(num_blocks_spmv, self.block_size, 4 * self.block_size),
+                               self.row_cnt_2, self.ptr, self.idx, self.val, self.p, self.y, self.size)
+            self.t2[0] = 0
+            self.execute_phase(f"dp_{i}", self.dp_kernel(self.num_blocks_size, self.block_size), self.p, self.y, self.t2, self.size)
+
+            if self.time_phases:
+                start = System.nanoTime()
+            alpha = self.t1[0] / self.t2[0]
+            old_r_norm_squared = self.t1[0]
+            self.t1[0] = 0
+            self.row_cnt_1[0] = 0.0
+            self.row_cnt_2[0] = 0.0
+            if self.time_phases:
+                end = System.nanoTime()
+                self.benchmark.add_phase({"name": f"alpha_{i}", "time_sec": (end - start) / 1_000_000_000})
+
+            # Update x: x = x + alpha * p
+            self.execute_phase(f"saxpy_x_{i}", self.saxpy_kernel(self.num_blocks_size, self.block_size),
+                               self.x, self.x, self.p, alpha, self.size)
+            # r = r - alpha * y
+            self.execute_phase(f"saxpy_r_{i}", self.saxpy_kernel(self.num_blocks_size, self.block_size),
+                               self.r, self.r, self.y, -1 * alpha, self.size)
+            # t1 = r^t * r
+            self.execute_phase(f"norm_{i}", self.norm_kernel(self.num_blocks_size, self.block_size), self.r, self.t1, self.size)
+
+            if self.time_phases:
+                start = System.nanoTime()
+            beta = self.t1[0] / old_r_norm_squared
+            if self.time_phases:
+                end = System.nanoTime()
+                self.benchmark.add_phase({"name": f"beta_{i}", "time_sec": (end - start) / 1_000_000_000})
+
+            self.execute_phase(f"saxpy_p_{i}", self.saxpy_kernel(self.num_blocks_size, self.block_size),
+                               self.p, self.r, self.p, beta, self.size)
+
+        # Add a final sync step to measure the real computation time;
+        if self.time_phases:
+            start = System.nanoTime()
+        tmp1 = self.x[0]
+        end = System.nanoTime()
+        if self.time_phases:
+            self.benchmark.add_phase({"name": "sync", "time_sec": (end - start) / 1_000_000_000})
+        self.benchmark.add_computation_time((end - start_comp) / 1_000_000_000)
+        # Compute GPU result;
+        for i in range(self.size):
+            self.gpu_result[i] = self.x[i]
+
+        self.benchmark.add_to_benchmark("gpu_result", 0)
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tgpu result: [" + ", ".join([f"{x:.4f}" for x in self.gpu_result[:10]]) + "...]")
+
+        return self.gpu_result
+
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+
+        def spmv(ptr, idx, val, vec):
+            res = np.zeros(len(ptr) - 1)
+            for i in range(len(ptr) - 1):
+                curr_sum = 0
+                start = int(ptr[i])
+                end = int(ptr[i + 1])
+                for j in range(start, end):
+                    curr_sum += val[j] * vec[idx[j]]
+                res[i] = curr_sum
+            return res
+
+        # Recompute the CPU result only if necessary;
+        start = System.nanoTime()
+        if self.current_iter == 0 or reinit:
+            # Re-initialize the random number generator with the same seed as the GPU to generate the same values;
+            seed(self.random_seed)
+            # Initialize the support device arrays;
+            N = self.size
+
+            x = np.ones(N)
+            # r = b - A * x
+            r = np.array(self.b_cpu) - np.array(spmv(self.ptr_cpu, self.idx_cpu, self.val_cpu, x))
+            p = r.copy()
+            t1 = r.T.dot(r)
+
+            # Main iteration;
+            for i in range(self.num_iterations):
+                y = spmv(self.ptr_cpu, self.idx_cpu, self.val_cpu, p)
+                t2 = p.dot(y)
+                alpha = t1 / t2
+                t1_old = t1
+                x += alpha * p
+                r -= alpha * y
+                t1 = r.T.dot(r)
+                beta = t1 / t1_old
+                p = r + beta * p
+
+            self.cpu_result = x
+
+        cpu_time = System.nanoTime() - start
+
+        # Compare GPU and CPU results;
+        difference = 0
+        for i in range(self.size):
+            difference += np.abs(self.cpu_result[i] - gpu_result[i])
+
+        self.benchmark.add_to_benchmark("cpu_time_sec", cpu_time)
+        self.benchmark.add_to_benchmark("cpu_gpu_res_difference", str(difference))
+        if self.benchmark.debug:
+            BenchmarkResult.log_message(f"\tcpu result: [" + ", ".join([f"{x:.4f}" for x in self.cpu_result[:10]])
+                                        + "...]; " +
+                                        f"difference: {difference:.4f}, time: {cpu_time:.4f} sec")
+
+
diff --git a/projects/resources/python/benchmark/benchmark.py b/projects/resources/python/benchmark/benchmark.py
new file mode 100644
index 00000000..e2fbbf98
--- /dev/null
+++ b/projects/resources/python/benchmark/benchmark.py
@@ -0,0 +1,212 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+from benchmark_result import BenchmarkResult
+from abc import ABC, abstractmethod
+from java.lang import System
+from typing import Callable
+import polyglot
+
+DEFAULT_BLOCK_SIZE_1D = 32
+DEFAULT_BLOCK_SIZE_2D = 8
+DEFAULT_NUM_BLOCKS = 64  # GTX 960, 8 SM
+DEFAULT_NUM_BLOCKS = 448  # P100, 56 SM
+DEFAULT_NUM_BLOCKS = 176  # GTX 1660 Super, 22 SM
+
+def time_phase(phase_name: str) -> Callable:
+    """
+    Decorator that simplifies timing a function call and storing the result in the benchmark log;
+    :param phase_name: name of the benchmark phase
+    :return: the output of the wrapped function
+    """
+    def inner_func(func) -> Callable:
+        def func_call(self, *args, **kwargs) -> object:
+            start = System.nanoTime()
+            result = func(self, *args, **kwargs)
+            end = System.nanoTime()
+            self.benchmark.add_phase({"name": phase_name, "time_sec": (end - start) / 1_000_000_000})
+            return result
+        return func_call
+    return inner_func
+
+
+class Benchmark(ABC):
+    """
+    Base class for all benchmarks, it provides the general control flow of the benchmark execution;
+    :param name: name of the benchmark
+    :param benchmark: instance of BenchmarkResult, used to store results
+    :param nvprof_profile: if present activate profiling for nvprof when running the benchmark
+    """
+
+    def __init__(self, name: str, benchmark: BenchmarkResult, nvprof_profile: bool = False):
+        self.name = name
+        self.benchmark = benchmark
+        self.nvprof_profile = nvprof_profile
+        self.time_phases = False
+        self.tot_iter = 0
+        self.current_iter = 0
+        self.random_seed = 42  # Default random seed, it will be overwritten with a random one;
+        self.block_size_1d = DEFAULT_BLOCK_SIZE_1D
+        self.block_size_2d = DEFAULT_BLOCK_SIZE_2D
+        self.num_blocks = DEFAULT_NUM_BLOCKS
+        self._block_size = {}
+
+    @abstractmethod
+    def alloc(self, size: int, block_size: dict = None) -> None:
+        """
+        Allocate new memory on GPU used for the benchmark;
+        :param size: base factor used in the memory allocation, e.g. size of each array
+        :param block_size: optional dictionary containing block size for 1D and 2D kernels
+        """
+        pass
+
+    @abstractmethod
+    def init(self) -> None:
+        """
+        Initialize the content of the input data of the benchmark;
+        """
+        pass
+
+    @abstractmethod
+    def reset_result(self) -> None:
+        """
+        Reset the values that hold the GPU result
+        """
+        pass
+
+    @abstractmethod
+    def cpu_validation(self, gpu_result: object, reinit: bool) -> None:
+        """
+        Run an equivalent benchmark on CPU to obtain the correct result of the benchmark,
+        and compute the distance w.r.t. the GPU result;
+        :param gpu_result: the output of the GPU computation
+        :param reinit: if the GPU data was re-initialized in this computation
+        """
+        pass
+
+    @abstractmethod
+    def execute(self) -> object:
+        """
+        Execute the main computation of this benchmark;
+        :return: the result of the GPU computation, it could be a scalar numeric value or an arbitrary data structure
+        """
+        pass
+
+    def execute_phase(self, phase_name, function, *args) -> object:
+        """
+        Executes a single step of the benchmark, possibily measuring the time it takes
+        :param phase_name: name of this benchmark step
+        :param function: a function to execute
+        :param args: arguments of the function
+        :return: the result of the function
+        """
+        if self.time_phases:
+            start = System.nanoTime()
+            res = function(*args)
+            end = System.nanoTime()
+            self.benchmark.add_phase({"name": phase_name, "time_sec": (end - start) / 1_000_000_000})
+            return res
+        else:
+            return function(*args)
+
+    def run(self, num_iter: int,size: int, number_of_gpus: int,
+                block_size: dict, exec_policy: str,
+                dependency_policy: str, new_stream_policy: str, parent_stream_policy: str,
+                device_selection: str, mem_advise: str, prefetch: str,
+                stream_attach: str, timing: bool,
+                time_phases: bool, realloc: bool, reinit: bool, prevent_reinit=False, number_of_blocks=DEFAULT_NUM_BLOCKS) -> None:
+
+        # Fix missing block size;
+        if "block_size_1d" not in block_size:
+            block_size["block_size_1d"] = DEFAULT_BLOCK_SIZE_1D
+        if "block_size_2d" not in block_size:
+            block_size["block_size_2d"] = DEFAULT_BLOCK_SIZE_2D
+        if number_of_blocks:
+            self.num_blocks = number_of_blocks
+
+        self.benchmark.start_new_benchmark(
+                                        name=self.name,
+                                        size=size,
+                                        number_of_gpus=number_of_gpus,
+                                        block_size=block_size,
+                                        num_blocks=number_of_blocks,
+                                        exec_policy=exec_policy,
+                                        dependency_policy=dependency_policy,
+                                        new_stream_policy=new_stream_policy,
+                                        parent_stream_policy=parent_stream_policy,
+                                        device_selection=device_selection,
+                                        mem_advise=mem_advise,
+                                        prefetch=prefetch,
+                                        stream_attach=stream_attach,
+                                        timing=timing,
+                                        realloc=realloc,
+                                        reinit=reinit,
+                                        iteration=num_iter,
+                                        time_phases=time_phases)
+        self.current_iter = num_iter
+        self.time_phases = time_phases
+        self._block_size = block_size
+        # TODO: set the execution policy;
+
+        # Start a timer to monitor the total GPU execution time;
+        start = System.nanoTime()
+
+        # Allocate memory for the benchmark;
+        if (num_iter == 0 or realloc) and not prevent_reinit:
+            self.alloc(size, block_size)
+        # Initialize memory for the benchmark;
+        if (num_iter == 0 or reinit) and not prevent_reinit:
+            self.init()
+
+        # Reset the result;
+        self.reset_result()
+
+        # Start nvprof profiling if required;
+        if self.nvprof_profile:
+            polyglot.eval(language="grcuda", string="cudaProfilerStart")()
+
+        # Execute the benchmark;
+        gpu_result = self.execute()
+
+        # Stop nvprof profiling if required;
+        if self.nvprof_profile:
+            polyglot.eval(language="grcuda", string="cudaProfilerStop")()
+
+        # Stop the timer;
+        end = System.nanoTime()
+        self.benchmark.add_total_time((end - start) / 1_000_000_000)
+
+        # Perform validation on CPU;
+        if self.benchmark.cpu_validation:
+            self.cpu_validation(gpu_result, reinit)
+
+        # Write to file the current result;
+        self.benchmark.save_to_file()
+        # Book-keeping;
+        self.tot_iter += 1
diff --git a/projects/resources/python/benchmark/benchmark_main.py b/projects/resources/python/benchmark/benchmark_main.py
new file mode 100644
index 00000000..8941796b
--- /dev/null
+++ b/projects/resources/python/benchmark/benchmark_main.py
@@ -0,0 +1,264 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+from distutils.util import strtobool
+
+from bench.single_gpu.bench_1 import Benchmark1
+from bench.single_gpu.bench_2 import Benchmark2
+from bench.single_gpu.bench_3 import Benchmark3
+from bench.single_gpu.bench_4 import Benchmark4
+from bench.single_gpu.bench_5 import Benchmark5
+from bench.single_gpu.bench_6 import Benchmark6
+from bench.single_gpu.bench_7 import Benchmark7
+from bench.single_gpu.bench_8 import Benchmark8
+from bench.single_gpu.bench_9 import Benchmark9
+from bench.single_gpu.bench_10 import Benchmark10
+from bench.multi_gpu.bench_1 import Benchmark1M
+from bench.multi_gpu.bench_5 import Benchmark5M
+from bench.multi_gpu.bench_6 import Benchmark6M
+from bench.multi_gpu.bench_9 import Benchmark9M
+from bench.multi_gpu.bench_11 import Benchmark11M
+from benchmark_result import BenchmarkResult
+
+##############################
+##############################
+
+# Benchmark settings;
+benchmarks = {
+    # Single GPU;
+    "b1": Benchmark1,
+    "b2": Benchmark2,
+    "b3": Benchmark3,
+    "b4": Benchmark4,
+    "b5": Benchmark5,
+    "b6": Benchmark6,
+    "b7": Benchmark7,
+    "b8": Benchmark8,
+    "b9": Benchmark9,
+    "b10": Benchmark10,
+    # Multi GPU;
+    "b1m": Benchmark1M,
+    "b5m": Benchmark5M,
+    "b6m": Benchmark6M,
+    "b9m": Benchmark9M,
+    "b11m": Benchmark11M,
+}
+
+num_elem = {
+    # Single GPU;
+    "b1": [100],
+    "b2": [100],
+    "b3": [100],
+    "b4": [100],
+    "b5": [100],
+    "b6": [100],
+    "b7": [100],
+    "b8": [100],
+    "b9": [100],
+    "b10": [100],
+    # Multi GPU;
+    "b1m": [100],
+    "b5m": [100],
+    "b6m": [100],
+    "b9m": [100],
+    "b11m": [100],
+}
+
+policies = {
+    # Single GPU;
+    "b1": ["async"],
+    "b2": ["async"],
+    "b3": ["async"],
+    "b4": ["async"],
+    "b5": ["async"],
+    "b6": ["async"],
+    "b7": ["async"],
+    "b8": ["async"],
+    "b9": ["async"],
+    "b10": ["async"],
+    # Multi GPU;
+    "b1m": ["async"],
+    "b5m": ["async"],
+    "b6m": ["async"],
+    "b9m": ["async"],
+    "b11m": ["async"],
+}
+
+##############################
+##############################
+
+
+def create_block_size_list(block_size_1d, block_size_2d) -> list:
+    if (not block_size_1d) and block_size_2d:  # Only 2D block size;
+        block_size = [{"block_size_2d": b} for b in block_size_2d]
+    elif (not block_size_2d) and block_size_1d:  # Only 1D block size;
+        block_size = [{"block_size_1d": b} for b in block_size_1d]
+    elif block_size_1d and block_size_2d:  # Both 1D and 2D size;
+        # Ensure they have the same size;
+        if len(block_size_2d) > len(block_size_1d):
+            block_size_1d = block_size_1d + [block_size_1d[-1]] * (len(block_size_2d) - len(block_size_1d))
+        elif len(block_size_1d) > len(block_size_2d):
+            block_size_2d = block_size_2d + [block_size_2d[-1]] * (len(block_size_1d) - len(block_size_2d))
+        block_size = [{"block_size_1d": x[0], "block_size_2d": x[1]} for x in zip(block_size_1d, block_size_2d)]
+    else:
+        block_size = [{}]
+    return block_size
+
+##############################
+##############################
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(description="measure GrCUDA execution time")
+
+    parser.add_argument("-d", "--debug", action="store_true",
+                        help="If present, print debug messages")
+    parser.add_argument("-i", "--num_iter", metavar="N", type=int, default=BenchmarkResult.DEFAULT_NUM_ITER,
+                        help="Number of times each benchmark is executed")
+    parser.add_argument("-o", "--output_path", metavar="path/to/output.json",
+                        help="Path to the file where results will be stored")
+    parser.add_argument("--realloc", metavar="[True|False]", type=lambda x: bool(strtobool(x)), nargs="*",
+                        help="If True, allocate new memory and rebuild the GPU code at each iteration")
+    parser.add_argument("--reinit", metavar="[True|False]", type=lambda x: bool(strtobool(x)), nargs="*",
+                        help="If True, re-initialize the values used in each benchmark at each iteration")
+    parser.add_argument("-c", "--cpu_validation", action="store_true", dest="cpu_validation",
+                        help="Validate the result of each benchmark using the CPU")
+    parser.add_argument("--no_cpu_validation", action="store_false", dest="cpu_validation",
+                        help="Validate the result of each benchmark using the CPU")
+    parser.add_argument("-b", "--benchmark", nargs="*",
+                        help="If present, run the benchmark only for the specified kernel")
+    parser.add_argument("--execution_policy",
+                        help="If present, run the benchmark only with the selected execution policy")
+    parser.add_argument("--dependency_policy",
+                        help="If present, run the benchmark only with the selected dependency policy")
+    parser.add_argument("--new_stream",
+                        help="If present, run the benchmark only with the selected new stream policy")
+    parser.add_argument("--parent_stream",
+                        help="If present, run the benchmark only with the selected parent stream policy")
+    parser.add_argument("--device_selection",
+                        help="If present and parent policy is data aware, run the benchmark only with the selected device_selection")
+    parser.add_argument("--memory_advise_policy",
+                        help="Select a managed memory memAdvise flag, if multiple GPUs are available")
+    parser.add_argument("--prefetch",
+                        help="If present run the benchmark only with the selected prefetcher")
+    parser.add_argument("-n", "--size", metavar="N", type=int, nargs="*",
+                        help="Override the input data size used for the benchmarks")
+    parser.add_argument("--number_of_gpus", metavar="N", type=int, nargs="*",
+                        help="Number of GPU employed for computation")
+    parser.add_argument("--block_size_1d", metavar="N", type=int, nargs="*",
+                        help="Number of threads per block when using 1D kernels")
+    parser.add_argument("--block_size_2d", metavar="N", type=int, nargs="*",
+                        help="Number of threads per block when using 2D kernels")
+    parser.add_argument("-g", "--number_of_blocks", metavar="N", type=int, nargs="?",
+                        help="Number of blocks in the computation")
+    parser.add_argument("-r", "--random", action="store_true",
+                        help="Initialize benchmarks randomly whenever possible")
+    parser.add_argument("--force_stream_attach", action="store_true",
+                        help="stream_attach")
+    parser.add_argument("--timing", action="store_true",
+                        help="Measure the execution time of each kernel")
+    parser.add_argument("-p", "--time_phases", action="store_true",
+                        help="Measure the execution time of each phase of the benchmark;"
+                             " note that this introduces overheads, and might influence the total execution time")
+    parser.add_argument("--nvprof", action="store_true",
+                        help="If present, enable profiling when using nvprof."
+                             " For this option to have effect, run graalpython using nvprof, with flag '--profile-from-start off'")
+    parser.set_defaults(cpu_validation=BenchmarkResult.DEFAULT_CPU_VALIDATION)
+
+    # Parse the input arguments;
+    args = parser.parse_args()
+
+    debug = args.debug if args.debug else BenchmarkResult.DEFAULT_DEBUG
+    num_iter = args.num_iter if args.num_iter else BenchmarkResult.DEFAULT_NUM_ITER
+    output_path = args.output_path if args.output_path else ""
+    realloc = args.realloc if args.realloc else [BenchmarkResult.DEFAULT_REALLOC]
+    reinit = args.reinit if args.reinit else [BenchmarkResult.DEFAULT_REINIT]
+    random_init = args.random if args.random else BenchmarkResult.DEFAULT_RANDOM_INIT
+    cpu_validation = args.cpu_validation
+    time_phases = args.time_phases
+    nvprof_profile = args.nvprof
+    timing = args.timing
+    prefetch = args.prefetch 
+    stream_attach = args.force_stream_attach
+    new_stream_policy = args.new_stream
+    parent_stream_policy = args.parent_stream
+    device_selection = args.device_selection
+    dependency_policy = args.dependency_policy
+    number_of_gpus = args.number_of_gpus if args.number_of_gpus else [BenchmarkResult.DEFAULT_NUM_GPU]
+    exec_policy = args.execution_policy if args.execution_policy else BenchmarkResult.DEFAULT_EXEC_POLICY
+    mem_advise = args.memory_advise_policy if args.memory_advise_policy else BenchmarkResult.DEFAULT_MEM_ADVISE
+    
+    # Create a new benchmark result instance;
+    benchmark_res = BenchmarkResult(debug=debug, num_iterations=num_iter, output_path=output_path,
+                                    cpu_validation=cpu_validation, random_init=random_init)
+    if benchmark_res.debug:
+        BenchmarkResult.log_message(f"using CPU validation: {cpu_validation}")
+
+    if args.benchmark:
+        if benchmark_res.debug:
+            BenchmarkResult.log_message(f"using only benchmark: {args.benchmark}")
+        benchmarks = {b: benchmarks[b] for b in args.benchmark}
+
+    if args.size:
+        if benchmark_res.debug:
+            BenchmarkResult.log_message(f"using only size: {args.size}")
+        num_elem = {n: args.size for n in num_elem.keys()}
+
+    # Setup the block size for each benchmark;
+    block_sizes = BenchmarkResult.create_block_size_list(args.block_size_1d, args.block_size_2d)
+    number_of_blocks = args.number_of_blocks
+    if (args.block_size_1d or args.block_size_2d) and benchmark_res.debug:
+        BenchmarkResult.log_message(f"using block sizes: {block_sizes}")
+    if number_of_blocks:
+        BenchmarkResult.log_message(f"using number of blocks: {number_of_blocks}")
+
+    # Execute each test;
+    for b_name, b in benchmarks.items():
+        benchmark = b(benchmark_res, nvprof_profile=nvprof_profile)
+        for p in policies[b_name]:
+            for n in num_elem[b_name]:
+                prevent_reinit = False
+                for re in realloc:
+                    for ri in reinit:
+                        for block_size in block_sizes:
+                            for i in range(num_iter):
+                                benchmark.run(num_iter=i, size=n, number_of_gpus=number_of_gpus[0], block_size=block_size, exec_policy=exec_policy,
+                                          dependency_policy=dependency_policy, new_stream_policy=new_stream_policy, parent_stream_policy=parent_stream_policy, device_selection=device_selection,
+                                          mem_advise=mem_advise, prefetch=prefetch, stream_attach=stream_attach, timing=timing,
+                                          realloc=re, reinit=ri, time_phases=time_phases, prevent_reinit=prevent_reinit,
+                                          number_of_blocks=number_of_blocks)
+                                prevent_reinit = True
+                            # Print the summary of this block;
+                            if benchmark_res.debug:
+                                benchmark_res.print_current_summary(name=b_name, size=n, number_of_gpus=number_of_gpus[0], block_size=block_size, exec_policy=exec_policy,
+                                          dependency_policy=dependency_policy, new_stream_policy=new_stream_policy, parent_stream_policy=parent_stream_policy, device_selection=device_selection,
+                                          mem_advise=mem_advise, prefetch=prefetch, stream_attach=stream_attach, timing=timing,
+                                          realloc=re, reinit=ri, time_phases=time_phases, num_blocks=number_of_blocks, skip=3)
diff --git a/projects/resources/python/benchmark/benchmark_nvprof_wrapper.py b/projects/resources/python/benchmark/benchmark_nvprof_wrapper.py
new file mode 100644
index 00000000..14d3c924
--- /dev/null
+++ b/projects/resources/python/benchmark/benchmark_nvprof_wrapper.py
@@ -0,0 +1,274 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import subprocess
+import time
+import os
+from datetime import datetime
+from benchmark_result import BenchmarkResult
+from benchmark_main import create_block_size_list
+from java.lang import System
+from pathlib import Path
+
+##############################
+##############################
+
+# True if using GPUs with capabilities with capability >= 7.5. If so, nvprof is no longer supported;
+POST_TURING = True
+
+DEFAULT_NUM_BLOCKS = 64  # GTX 960, 8 SM
+DEFAULT_NUM_BLOCKS = 448  # P100, 56 SM
+DEFAULT_NUM_BLOCKS = 176  # GTX 1660 Super, 22 SM
+
+HEAP_SIZE = 26 
+#HEAP_SIZE = 140 # P100
+
+# Benchmark settings;
+benchmarks = [
+    # Single GPU;
+    # "b1",
+    # "b5",
+    # "b6",
+    # "b7",
+    # "b8",
+    # "b10",
+    # Multi GPU;
+    "b1m",
+    "b5m",
+    "b6m",
+    "b9m",
+    "b11m",
+]
+
+
+num_elem = {
+        # Single GPU;
+        "b1": [160_000_000],
+        "b5": [10_000_000],
+        "b6": [1_600_000],
+        "b7": [25_000_000], 
+        "b8": [6400],
+        "b10": [12000], 
+        # Multi GPU;
+        "b1m": [160_000_000],
+        "b5m": [10_000_000], 
+        "b6m": [1_000_000],
+        "b9m": [20000],
+        "b11m": [20000],
+}
+
+exec_policies = ["async", "sync"]
+
+new_stream_policies = ["always-new"]
+
+parent_stream_policies = ["disjoint"]
+
+dependency_policies = ["with-const"]
+
+block_sizes_1d_dict = {
+    "b1": 32,
+    "b5": 256,
+    "b6": 32,
+    "b7": 32,
+    "b8": 1024,
+    "b10": 32,
+}
+
+block_sizes_2d_dict = {
+    "b1": 8,
+    "b5": 8,
+    "b6": 8,
+    "b7": 8,
+    "b8": 8,
+    "b10": 8,
+}
+
+block_dim_dict = {
+    # Single GPU;
+    "b1": DEFAULT_NUM_BLOCKS,
+    "b5": DEFAULT_NUM_BLOCKS,
+    "b6": 64,
+    "b7": DEFAULT_NUM_BLOCKS,
+    "b8": 32,
+    "b10": DEFAULT_NUM_BLOCKS,
+    "b11": DEFAULT_NUM_BLOCKS,
+    # Multi GPU;
+    "b1m": 64,
+    "b5m": 64,
+    "b6m": 64,
+    "b9m": 64,
+    "b11m": 64,
+}
+
+prefetch = [False]
+
+use_metrics = [True, False]
+
+##############################
+##############################
+
+LOG_FOLDER = f"{os.getenv('GRCUDA_HOME')}/grcuda-data/results/scheduling_multi_gpu/nvprof_log"
+if POST_TURING:
+    METRICS = "--metrics 'dram__bytes_read.sum.per_second,dram__bytes_write.sum.per_second,dram__bytes_read.sum,dram__bytes_write.sum,lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_atom.sum,lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum,lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_st.sum,lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_st.sum,lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_ld.sum,lts__t_sectors_op_read.sum.per_second,lts__t_sectors_op_atom.sum.per_second,lts__t_sectors_op_red.sum.per_second,lts__t_sectors_op_write.sum.per_second,lts__t_sectors_op_atom.sum.per_second,lts__t_sectors_op_red.sum.per_second,smsp__inst_executed.sum,smsp__sass_thread_inst_executed_op_dadd_pred_on.sum,smsp__sass_thread_inst_executed_op_dmul_pred_on.sum,smsp__sass_thread_inst_executed_op_dfma_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__inst_executed.avg.per_cycle_active,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,sm__inst_executed.sum'"
+else:
+    METRICS = "--metrics 'dram_read_throughput,dram_write_throughput,dram_read_bytes,dram_write_bytes,l2_global_atomic_store_bytes,l2_global_load_bytes,l2_global_reduction_bytes,l2_local_global_store_bytes,l2_local_load_bytes,l2_read_throughput,l2_write_throughput,inst_executed,ipc,flop_count_dp,flop_count_sp'"
+
+# This path is hard-coded because nvprof is executed as root,
+# and the superuser doesn't have Graalpython in its environment;
+GRAALPYTHON_FOLDER = "/home/users/ubuntu/graalpython_venv/bin"
+GRCUDA_HOME = f"{os.getenv('GRCUDA_HOME')}"
+
+if POST_TURING:
+    GRAALPYTHON_CMD_METRICS = """/usr/local/cuda/bin/ncu -f --print-units base --csv --log-file "{}" --profile-from-start off --target-processes all {} \
+    {}/graalpython --vm.XX:MaxHeapSize={}G --jvm --polyglot --experimental-options --grcuda.RetrieveNewStreamPolicy={} {} --grcuda.ForceStreamAttach \
+    --grcuda.ExecutionPolicy={} --grcuda.DependencyPolicy={} --grcuda.RetrieveParentStreamPolicy={} benchmark_main.py \
+    -i {} -n {} --reinit false --realloc false -g {} -b {} --block_size_1d {} --block_size_2d {} --no_cpu_validation {} {} --nvprof
+    """
+    GRAALPYTHON_CMD_TRACE = """/usr/local/cuda/bin/nvprof --csv --log-file "{}" --print-gpu-trace {} --profile-from-start off --profile-child-processes \
+    {}/graalpython --vm.XX:MaxHeapSize={}G --jvm --polyglot --experimental-options --grcuda.RetrieveNewStreamPolicy={} {} --grcuda.ForceStreamAttach \
+    --grcuda.ExecutionPolicy={} --grcuda.DependencyPolicy={} --grcuda.RetrieveParentStreamPolicy={} benchmark_main.py \
+    -i {} -n {} --reinit false --realloc false -g {} -b {} --block_size_1d {} --block_size_2d {} --no_cpu_validation {} {} --nvprof
+    """
+else:
+    GRAALPYTHON_CMD = """/usr/local/cuda/bin/nvprof --csv --log-file "{}" --print-gpu-trace {} --profile-from-start off --profile-child-processes \
+    {}/graalpython --vm.XX:MaxHeapSize={}G --jvm --polyglot --experimental-options --grcuda.RetrieveNewStreamPolicy={} {} --grcuda.ForceStreamAttach \
+    --grcuda.ExecutionPolicy={} --grcuda.DependencyPolicy={} --grcuda.RetrieveParentStreamPolicy={} benchmark_main.py \
+    -i {} -n {} --reinit false --realloc false -g {} -b {} --block_size_1d {} --block_size_2d {} --no_cpu_validation {} {} --nvprof
+    """
+
+def execute_grcuda_benchmark(benchmark, size, exec_policy, new_stream_policy,
+                      parent_stream_policy, dependency_policy, num_iter, debug, time_phases, num_blocks=DEFAULT_NUM_BLOCKS, prefetch=False):
+    block_size = (block_sizes_1d_dict[b], block_sizes_2d_dict[b])
+    for m in use_metrics:
+        if debug:
+            BenchmarkResult.log_message("")
+            BenchmarkResult.log_message("")
+            BenchmarkResult.log_message("#" * 30)
+            BenchmarkResult.log_message(f"Benchmark {i + 1}/{tot_benchmarks}")
+            BenchmarkResult.log_message(f"benchmark={b}, size={n},"
+                                        f"block size={block_size}, "
+                                        f"num blocks={num_blocks}, "
+                                        f"exec policy={exec_policy}, "
+                                        f"new stream policy={new_stream_policy}, "
+                                        f"parent stream policy={parent_stream_policy}, "
+                                        f"dependency policy={dependency_policy}, "
+                                        f"prefetch={prefetch}, "
+                                        f"time_phases={time_phases}, "
+                                        f"collect metrics={m}")
+            BenchmarkResult.log_message("#" * 30)
+            BenchmarkResult.log_message("")
+            BenchmarkResult.log_message("")
+
+        log_folder = f"{datetime.now().strftime('%Y_%m_%d')}"
+        # Create a folder if it doesn't exist;
+        output_folder_path = os.path.join(LOG_FOLDER, log_folder)
+        if not os.path.exists(output_folder_path):
+            if debug:
+                BenchmarkResult.log_message(f"creating result folder: {output_folder_path}")
+            Path(output_folder_path).mkdir(parents=True, exist_ok=True)
+        file_name = f"{b}_{exec_policy}_{'metric' if m else 'nometric'}_{prefetch}{'' if (POST_TURING and m) else '_%p'}.csv"
+        output_path = os.path.join(output_folder_path, file_name)
+
+        if POST_TURING:
+            if m:
+                benchmark_cmd = GRAALPYTHON_CMD_METRICS.format(output_path, METRICS, GRAALPYTHON_FOLDER, HEAP_SIZE,
+                                                       new_stream_policy, "--grcuda.InputPrefetch" if prefetch else "", exec_policy, dependency_policy, parent_stream_policy,
+                                                       num_iter, size, num_blocks, benchmark, block_size[0], block_size[1],
+                                                       "-d" if debug else "",  "-p" if time_phases else "")
+            else:
+               benchmark_cmd = GRAALPYTHON_CMD_TRACE.format(output_path, "", GRAALPYTHON_FOLDER, HEAP_SIZE,
+                                                   new_stream_policy, "--grcuda.InputPrefetch" if prefetch else "", exec_policy, dependency_policy, parent_stream_policy,
+                                                   num_iter, size, num_blocks, benchmark, block_size[0], block_size[1],
+                                                   "-d" if debug else "",  "-p" if time_phases else "") 
+        else:
+            benchmark_cmd = GRAALPYTHON_CMD.format(output_path, METRICS if m else "", GRAALPYTHON_FOLDER, HEAP_SIZE,
+                                                   new_stream_policy, "--grcuda.InputPrefetch" if prefetch else "", exec_policy, dependency_policy, parent_stream_policy,
+                                                   num_iter, size, num_blocks, benchmark, block_size[0], block_size[1],
+                                                   "-d" if debug else "",  "-p" if time_phases else "")
+        start = System.nanoTime()
+        result = subprocess.run(benchmark_cmd,
+                                shell=True,
+                                stdout=subprocess.STDOUT,
+                                cwd=f"{GRCUDA_HOME}/projects/resources/python/benchmark")
+        result.check_returncode()
+        end = System.nanoTime()
+        if debug:
+            BenchmarkResult.log_message(f"Benchmark total execution time: {(end - start) / 1_000_000_000:.2f} seconds")
+
+##############################
+##############################
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(description="Wrap the GrCUDA benchmark to specify additional settings, and run nvprof to collect metrics")
+
+    parser.add_argument("-d", "--debug", action="store_true",
+                        help="If present, print debug messages")
+    parser.add_argument("-i", "--num_iter", metavar="N", type=int, default=BenchmarkResult.DEFAULT_NUM_ITER,
+                        help="Number of times each benchmark is executed")
+    parser.add_argument("-g", "--num_blocks", metavar="N", type=int,
+                        help="Number of blocks in each kernel, when applicable")
+    parser.add_argument("-p", "--time_phases", action="store_true",
+                        help="Measure the execution time of each phase of the benchmark;"
+                             " note that this introduces overheads, and might influence the total execution time")
+
+    # Parse the input arguments;
+    args = parser.parse_args()
+
+    debug = args.debug if args.debug else BenchmarkResult.DEFAULT_DEBUG
+    num_iter = args.num_iter if args.num_iter else BenchmarkResult.DEFAULT_NUM_ITER
+    time_phases = args.time_phases
+    num_blocks = args.num_blocks
+
+    def tot_benchmark_count():
+        tot = 0
+        for b in benchmarks:
+            tot += len(num_elem[b]) * len(exec_policies) * len(prefetch) * len(use_metrics)
+        return tot
+
+    output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+
+    # Execute each test;
+    i = 0
+    tot_benchmarks = tot_benchmark_count()
+    for b in benchmarks:
+        for n in num_elem[b]:
+            for exec_policy in exec_policies:
+                # GrCUDA Benchmarks;
+                for new_stream_policy in new_stream_policies:
+                    for parent_stream_policy in parent_stream_policies:
+                        for dependency_policy in dependency_policies:
+                            for p in prefetch:
+                                nb = num_blocks if num_blocks else block_dim_dict[b]
+                                execute_grcuda_benchmark(b, n, exec_policy, new_stream_policy,
+                                                         parent_stream_policy, dependency_policy, num_iter,
+                                                         debug, time_phases, num_blocks=nb, prefetch=p)
+                                i += 1
diff --git a/projects/resources/python/benchmark/benchmark_result.py b/projects/resources/python/benchmark/benchmark_result.py
new file mode 100644
index 00000000..b0b6eb2f
--- /dev/null
+++ b/projects/resources/python/benchmark/benchmark_result.py
@@ -0,0 +1,278 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+from datetime import datetime
+import json
+import numpy as np
+
+class BenchmarkResult:
+
+    DEFAULT_RES_FOLDER = "../../../../grcuda-data/results/scheduling_multi_gpu"
+    DEFAULT_NUM_ITER = 20
+    DEFAULT_DEBUG = True
+    DEFAULT_CPU_VALIDATION = False
+    DEFAULT_REALLOC = False
+    DEFAULT_REINIT = False
+    DEFAULT_RANDOM_INIT = False
+    DEFAULT_NUM_GPU = 1
+    DEFAULT_MEM_ADVISE = "none"
+    DEFAULT_EXEC_POLICY = "async"
+
+    def __init__(self,
+                 num_iterations: int = DEFAULT_NUM_ITER,
+                 cpu_validation: bool = DEFAULT_CPU_VALIDATION,
+                 debug: bool = DEFAULT_DEBUG,
+                 random_init: bool = DEFAULT_RANDOM_INIT,
+                 output_path: str = "",
+                 ):
+        self.debug = debug
+        self.random_init = random_init
+        self.num_iterations = num_iterations
+        self.cpu_validation = cpu_validation
+        self._results = {"num_iterations": num_iterations,
+                         "cpu_validation": cpu_validation,
+                         "random_init": random_init,
+                         "benchmarks": {}}
+        # Used to store the results of the benchmark currently being executed;
+        self._dict_current = {}
+
+        # If true, use the provided output path as it is, without adding extensions or creating folders;
+        self._output_path = output_path if output_path else self.default_output_file_name()
+        output_folder = os.path.dirname(output_path) if output_path else self.DEFAULT_RES_FOLDER
+        if not os.path.exists(output_folder):
+            if self.debug:
+                BenchmarkResult.log_message(f"creating result folder: {output_folder}")
+                os.makedirs(output_folder)
+        if self.debug:
+            BenchmarkResult.log_message(f"storing results in {self._output_path}")
+
+    @staticmethod
+    def create_block_size_key(block_size: dict) -> str:
+        return f"{block_size['block_size_1d']},{block_size['block_size_2d']}"
+
+    def default_output_file_name(self) -> str:
+        output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+        file_name = f"{output_date}_{self.num_iterations}.json"
+        return os.path.join(self.DEFAULT_RES_FOLDER, file_name)
+
+    def start_new_benchmark(self, name: str, size: int, number_of_gpus: int,
+                            block_size: dict, num_blocks: int, exec_policy: str,
+                            dependency_policy: str, new_stream_policy: str, parent_stream_policy: str,
+                            device_selection: str, mem_advise: str, prefetch: str,
+                            stream_attach: str, timing: bool, iteration: int, 
+                            time_phases: bool, realloc: bool, reinit: bool) -> None:
+        """
+        Benchmark results are stored in a nested dictionary with the following structure.
+        self.results["benchmarks"]->{name}->{size}->{number_of_gpus}->{num_blocks}->{exec_policy}->{dependency_policy}->
+        {new_stream_policy}->{parent_stream_policy}->{device_selection}->{mem_advise}->{prefetch}->{stream_attach}->{timing}->{realloc}->{reinit}->{block_size}_{actual result}
+
+        :param name: name of the benchmark
+        :param size: size of the input data
+        :param number_of_gpus: number of GPU used in the benchmark
+        :param num_blocks: number of GPU thread blocks 
+        :param exec_policy: current execution policy used in the benchmark
+        :param dependency_policy: current dependency policy used in the benchmark
+        :param new_stream_policy: current new stream policy used in the benchmark
+        :param parent_stream_policy: current parent stream policy used in the benchmark
+        :param device_selection: current choose device device_selection used in the benchmark
+        :param mem_advise: memory advise flag used in the computation
+        .param prefetch: current prefetcher used in the benchmark
+        :param stream_attach: if stream attachment are forced
+        :param timing: if kernel timing is enabled
+        :param realloc: if reallocation is performed
+        :param reinit: if re-initialization is performed
+        :param block_size: dictionary that specifies the number of threads per block
+        :param iteration: current iteration
+        :param time_phases: if True, measure the execution time of each phase of the benchmark.
+         Note that this introduces overheads, and might influence the total execution time
+        """
+
+        # First dictionary: benchmark name;
+        if name in self._results["benchmarks"]:
+            dict_size = self._results["benchmarks"][name]
+        else:
+            dict_size = {}
+            self._results["benchmarks"][name] = dict_size
+        # Add intermediate dictionaries;
+        curr_dict = dict_size
+        for x in [size, number_of_gpus, num_blocks, exec_policy, dependency_policy, new_stream_policy, parent_stream_policy, device_selection, mem_advise, prefetch, stream_attach, timing, realloc, reinit]:
+            if x in curr_dict:
+                new_dict = curr_dict[x]
+            else:
+                new_dict = {}
+                curr_dict[x] = new_dict
+            curr_dict = new_dict
+        # Final dictionary: block size options;
+        dict_block = curr_dict
+        self._dict_current = {"phases": [], "iteration": iteration, "time_phases": time_phases}
+        if BenchmarkResult.create_block_size_key(block_size) in dict_block:
+            dict_block[BenchmarkResult.create_block_size_key(block_size)] += [self._dict_current]
+        else:
+            dict_block[BenchmarkResult.create_block_size_key(block_size)] = [self._dict_current]
+
+        if self.debug:
+            BenchmarkResult.log_message(
+                f"starting benchmark={name}, iter={iteration + 1}/{self.num_iterations}, size={size}, number_of_gpus={number_of_gpus}, num_blocks={num_blocks}, "
+                f"exec_policy={exec_policy}, dependency_policy={dependency_policy}, new_stream_policy={new_stream_policy}, parent_stream_policy={parent_stream_policy}, "
+                f"device_selection={device_selection}, realloc={realloc}, reinit={reinit}, prefetch={prefetch}, stream_attach={stream_attach}, mem_advise={mem_advise}, "
+                f"block_size={BenchmarkResult.create_block_size_key(block_size)}, timing={timing}, time_phases={time_phases}")
+
+    def add_to_benchmark(self, key: str, message: object) -> None:
+        """
+        Add an key-value pair in the current benchmark entry, e.g. ("allocation_time_ms", 10);
+        :param key: the key used to identify the message, e.g. "allocation_time_ms"
+        :param message: the value of the message, possibly a string, a number,
+        or any object that can be represented as JSON
+        """
+        self._dict_current[key] = message
+
+    def add_total_time(self, total_time: float) -> None:
+        """
+        Add to the current benchmark entry the execution time of a benchmark iteration,
+         and compute the amount of overhead w.r.t. the single phases
+        :param total_time: execution time of the benchmark iteration
+        """
+        self._dict_current["total_time_sec"] = total_time
+
+        # Keep only phases related to GPU computation;
+        blacklisted_phases = ["allocation", "initialization", "reset_result"]
+        filtered_phases = [x for x in self._dict_current["phases"] if x["name"] not in blacklisted_phases]
+        tot_time_phases = sum([x["time_sec"] if "time_sec" in x else 0 for x in filtered_phases])
+        self._dict_current["overhead_sec"] = total_time - tot_time_phases
+        self._dict_current["computation_sum_phases_sec"] = tot_time_phases
+        if self.debug:
+            BenchmarkResult.log_message(f"\ttotal execution time: {total_time:.4f} sec," +
+                                        f" overhead: {total_time - tot_time_phases:.4f} sec, " +
+                                        f" computation: {self._dict_current['computation_sec']:.4f} sec")
+
+    def add_computation_time(self, computation_time: float) -> None:
+        """
+        Add to the current benchmark entry the GPU computation time of the benchmark iteration
+        :param computation_time: execution time of the GPU computation in the benchmark iteration, in seconds
+        """
+        self._dict_current["computation_sec"] = computation_time
+
+    def add_phase(self, phase: dict) -> None:
+        """
+        Add a dictionary that represents a phase of a benchmark, to provide fine-grained profiling;
+        :param phase: a dictionary that contains information about a phase of the algorithm,
+        with information such as name, duration, description, etc...
+        """
+        self._dict_current["phases"] += [phase]
+        if self.debug and "name" in phase and "time_sec" in phase:
+            BenchmarkResult.log_message(f"\t\t{phase['name']}: {phase['time_sec']:.4f} sec")
+
+    def print_current_summary(self, name: str, size: int, number_of_gpus: int,
+                            num_blocks: int, exec_policy: str,
+                            dependency_policy: str, new_stream_policy: str, parent_stream_policy: str,
+                            device_selection: str, mem_advise: str, prefetch: str,
+                            stream_attach: str, timing: bool, block_size: dict,
+                            time_phases: bool, realloc: bool, reinit: bool, skip: int = 0) -> None:
+        """
+        Print a summary of the benchmark with the provided settings;
+
+        :param name: name of the benchmark
+        :param size: size of the input data
+        :param number_of_gpus: number of GPU used in the benchmark
+        :param num_blocks: number of GPU thread blocks 
+        :param exec_policy: current execution policy used in the benchmark
+        :param dependency_policy: current dependency policy used in the benchmark
+        :param new_stream_policy: current new stream policy used in the benchmark
+        :param parent_stream_policy: current parent stream policy used in the benchmark
+        :param device_selection: current choose device device_selection used in the benchmark
+        :param mem_advise: memory advise flag used in the computation
+        .param prefetch: current prefetcher used in the benchmark
+        :param stream_attach: if stream attachment are forced
+        :param timing: if kernel timing is enabled
+        :param realloc: if reallocation is performed
+        :param reinit: if re-initialization is performed
+        :param block_size: dictionary that specifies the number of threads per block
+        :param time_phases: if True, measure the execution time of each phase of the benchmark.
+        :param skip: skip the first N iterations when computing the summary statistics
+        """
+        try:
+            var_list = [size, number_of_gpus, num_blocks, exec_policy, dependency_policy, new_stream_policy, parent_stream_policy, device_selection, mem_advise, prefetch, stream_attach, timing, realloc, reinit]
+            results_filtered = self._results["benchmarks"][name]
+            for x in var_list:
+                results_filtered = results_filtered[x]
+            results_filtered = results_filtered[BenchmarkResult.create_block_size_key(block_size)]
+        except KeyError as e:
+            results_filtered = []
+            BenchmarkResult.log_message(f"WARNING: benchmark with signature"
+                                        f" [{name}]" + "".join([f"[{x}]" for x in var_list]) + f"[{BenchmarkResult.create_block_size_key(block_size)}] not found, exception {e}")
+            print(self._results)
+        # Retrieve execution times;
+        exec_times = [x["total_time_sec"] for x in results_filtered][skip:]
+        mean_time = np.mean(exec_times) if exec_times else np.nan
+        std_time = np.std(exec_times) if exec_times else np.nan
+
+        comp_exec_times = [x["computation_sec"] for x in results_filtered][skip:]
+        comp_mean_time = np.mean(comp_exec_times) if comp_exec_times else np.nan
+        comp_std_time = np.std(comp_exec_times) if comp_exec_times else np.nan
+
+        BenchmarkResult.log_message(f"summary of benchmark={name}, size={size}, number_of_gpus={number_of_gpus}, " +
+                                    f" num_blocks={num_blocks}, exec_policy={exec_policy}, dependency_policy={dependency_policy}, " +
+                                    f" new_stream_policy={new_stream_policy}, parent_stream_policy={parent_stream_policy}, device_selection={device_selection}, " + 
+                                    f" prefetch={prefetch}, stream_attach={stream_attach}, timing={timing}, mem_advise={mem_advise}, " +
+                                    f" realloc={realloc}, reinit={reinit}, block_size=({BenchmarkResult.create_block_size_key(block_size)});" +
+                                    f" mean total time={mean_time:.4f}±{std_time:.4f} sec;" +
+                                    f" mean computation time={comp_mean_time:.4f}±{comp_std_time:.4f} sec")
+
+    def save_to_file(self) -> None:
+        with open(self._output_path, "w+") as f:
+            json_result = json.dumps(self._results, ensure_ascii=False, indent=4)
+            f.write(json_result)
+
+    @staticmethod
+    def create_block_size_list(block_size_1d, block_size_2d) -> list:
+        """
+        Utility method used to create a list of dictionaries {"block_size_1d": N, "block_size_2d": N} to pass to the benchmark execution.
+        The method ensures that the output is a valid list of tuples even if one list is missing or if they have different lengths
+        """
+        if (not block_size_1d) and block_size_2d:  # Only 2D block size;
+            block_size = [{"block_size_2d": b} for b in block_size_2d]
+        elif (not block_size_2d) and block_size_1d:  # Only 1D block size;
+            block_size = [{"block_size_1d": b} for b in block_size_1d]
+        elif block_size_1d and block_size_2d:  # Both 1D and 2D size;
+            # Ensure they have the same size;
+            if len(block_size_2d) > len(block_size_1d):
+                block_size_1d = block_size_1d + [block_size_1d[-1]] * (len(block_size_2d) - len(block_size_1d))
+            elif len(block_size_1d) > len(block_size_2d):
+                block_size_2d = block_size_2d + [block_size_2d[-1]] * (len(block_size_1d) - len(block_size_2d))
+            block_size = [{"block_size_1d": x[0], "block_size_2d": x[1]} for x in zip(block_size_1d, block_size_2d)]
+        else:
+            block_size = [{}]
+        return block_size
+
+    @staticmethod
+    def log_message(message: str) -> None:
+        date = datetime.now()
+        date_str = date.strftime("%Y-%m-%d-%H-%M-%S-%f")
+        print(f"[{date_str} grcuda-python] {message}")
diff --git a/projects/resources/python/benchmark/benchmark_wrapper.py b/projects/resources/python/benchmark/benchmark_wrapper.py
new file mode 100644
index 00000000..256c0b3e
--- /dev/null
+++ b/projects/resources/python/benchmark/benchmark_wrapper.py
@@ -0,0 +1,502 @@
+# Copyright (c) 2020, 2021, 2022, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import subprocess
+import time
+import os
+from datetime import datetime
+from benchmark_result import BenchmarkResult
+from pathlib import Path
+
+##############################
+##############################
+
+V100 = "V100"
+A100 = "A100"
+GPU = V100
+
+BANDWIDTH_MATRIX = f"{os.getenv('GRCUDA_HOME')}/projects/resources/connection_graph/datasets/connection_graph.csv"
+
+DEFAULT_NUM_BLOCKS = 32  # GTX 960, 8 SM
+# DEFAULT_NUM_BLOCKS = 176  # GTX 1660 Super, 22 SM
+# DEFAULT_NUM_BLOCKS = 448  # P100, 56 SM
+if GPU == V100:
+    DEFAULT_NUM_BLOCKS = 640  # V100, 80 SM
+elif GPU == A100:
+    DEFAULT_NUM_BLOCKS = 640
+
+HEAP_SIZE = 26 
+# HEAP_SIZE = 140 # P100
+if GPU == V100:
+    HEAP_SIZE = 470 # 2 x V100
+elif GPU == A00:
+    HEAP_SIZE = 470
+
+# Benchmark settings;
+benchmarks = [
+    # Single GPU;
+    # "b1",
+    # "b5",
+    # "b6",
+    # "b7",
+    # "b8",
+    # "b10",
+    # Multi GPU;
+    "b1m",
+    "b5m",
+    "b6m",
+    "b9m",
+    "b11m",
+]
+
+# GTX 960
+num_elem = {
+    "b1": [20_000_000, 60_000_000, 80_000_000, 100_000_000, 120_000_000],
+    "b5": [2_000_000, 6_000_000, 8_000_000, 10_000_000, 12_000_000],
+    "b6": [200_000, 500_000, 800_000, 1_000_000, 1_200_000],
+    "b7": [4_000_000, 7_000_000, 10_000_000, 15_000_000, 20_000_000], 
+    "b8": [1600, 2400, 3200, 4000, 4800],
+    "b10": [3000, 4000, 5000, 6000, 7000],
+}
+
+# GTX 1660 Super
+# num_elem = {
+#     "b1": [60_000_000, 80_000_000, 100_000_000, 120_000_000, 200_000_000],
+#     "b5": [6_000_000, 8_000_000, 10_000_000, 12_000_000, 20_000_000],
+#     "b6": [500_000, 800_000, 1_000_000, 1_200_000, 2_000_000],
+#     "b7": [7_000_000, 10_000_000, 15_000_000, 20_000_000, 40_000_000],
+#     "b8": [3200, 4000, 4800, 8000, 10000],
+#     "b10": [6000, 7000, 10000, 12000, 14000],
+# }
+
+# P100
+# num_elem = {
+#     "b1": [120_000_000, 200_000_000, 500_000_000, 600_000_000, 700_000_000],
+#     "b5": [12_000_000, 20_000_000, 50_000_000, 60_000_000, 70_000_000],
+#     "b6": [1_200_000, 2_000_000, 4_000_000, 5_000_000, 6_000_000],
+#     "b7": [20_000_000, 40_000_000, 60_000_000, 100_000_000, 140_000_000],
+#     "b8": [4800, 8000, 10000, 12000, 16000],
+#     "b10": [7000, 10000, 12000, 14000, 16000],
+#}
+
+# V100
+if GPU == V100 or GPU == A100:
+    num_elem = {
+        # Single GPU;
+        "b1": [160_000_000, 250_000_000, 500_000_000, 800_000_000, 950_000_000],
+        "b5": [10_000_000, 16_000_000, 21_000_000, 28_000_000, 35_000_000], # out of core 50_000_000, 80_000_000, 95_000_000],
+        "b6": [1_600_000, 2_500_000, 5_000_000, 6_500_000, 8_000_000],
+        "b7": [25_000_000, 50_000_000, 80_000_000, 130_000_000, 180_000_000], 
+        "b8": [6400, 10000, 13000, 16000, 20000],
+        "b10": [12000, 16000, 18000, 20000, 22000], 
+        # Multi GPU;
+        "b1m": [160_000_000, 250_000_000, 500_000_000, 800_000_000, 950_000_000],
+        "b5m": [10_000_000, 16_000_000, 21_000_000, 28_000_000, 35_000_000],  # out of core 50_000_000, 80_000_000, 95_000_000]
+        "b6m": [1_000_000, 1_200_000, 1_400_000, 1_600_000, 1_800_000],
+        "b9m": [20000, 30000, 40000, 50000, 60000],
+        "b11m": [20000, 30000, 40000, 50000, 60000],
+    }
+# num_elem = {k: [int(v[0] / 100)] for (k, v) in num_elem.items()}  # Use this for small sizes, for debugging;
+
+# 960
+block_dim_dict = {
+    # Single GPU;
+    "b1": DEFAULT_NUM_BLOCKS,
+    "b5": DEFAULT_NUM_BLOCKS,
+    "b6": 32,
+    "b7": DEFAULT_NUM_BLOCKS,
+    "b8": 12,
+    "b10": 16,
+    "b11": DEFAULT_NUM_BLOCKS,
+    # Multi GPU;
+    "b1m": 64,
+    "b5m": 64,
+    "b6m": 64,
+    "b9m": 64,
+    "b11m": 64,
+}
+
+# P100
+# block_dim_dict = {
+#     "b1": DEFAULT_NUM_BLOCKS,
+#     "b5": DEFAULT_NUM_BLOCKS,
+#     "b6": 64,
+#     "b7": DEFAULT_NUM_BLOCKS,
+#     "b8": 32,
+#     "b10": DEFAULT_NUM_BLOCKS,
+#     "b11": DEFAULT_NUM_BLOCKS,
+# }
+
+# V100
+if GPU == V100 or GPU == A100:
+    block_dim_dict = {
+        # Single GPU;
+        "b1": DEFAULT_NUM_BLOCKS,
+        "b5": DEFAULT_NUM_BLOCKS,
+        "b6": 64,
+        "b7": DEFAULT_NUM_BLOCKS,
+        "b8": 32,
+        "b10": DEFAULT_NUM_BLOCKS,
+        "b11": DEFAULT_NUM_BLOCKS,
+        # Multi GPU;
+        "b1m": 64,
+        "b5m": 64,
+        "b6m": 64,
+        "b9m": 64,
+        "b11m": 64,
+    }
+
+# 1660
+# block_dim_dict = {
+#     "b1": DEFAULT_NUM_BLOCKS,
+#     "b5": DEFAULT_NUM_BLOCKS,
+#     "b6": 32,
+#     "b7": DEFAULT_NUM_BLOCKS,
+#     "b8": 16,
+#     "b10": DEFAULT_NUM_BLOCKS,
+#     "b11": DEFAULT_NUM_BLOCKS,
+# }
+
+cuda_exec_policies = ["sync", "async"]  # ["sync", "async", "cudagraph", "cudagraphmanual", "cudagraphsingle"]
+
+exec_policies = ["sync", "async"]
+
+dependency_policies = ["with-const"]  #, "no-const"]
+
+new_stream_policies = ["always-new"]  #, "reuse"]
+
+parent_stream_policies = ["disjoint", "multigpu-disjoint"]  # ["same-as-parent", "disjoint", "multigpu-early-disjoint", "multigpu-disjoint"]
+
+choose_device_policies = ["round-robin", "stream-aware", "min-transfer-size", "minmax-transfer-time"]  # ["single-gpu", "round-robin", "stream-aware", "min-transfer-size", "minmin-transfer-time", "minmax-transfer-time"]
+
+memory_advise = ["none"]
+
+prefetch = ["false"]
+
+stream_attach =  [False]
+
+time_computation = [False]
+
+num_gpus = [1, 2, 4, 8]
+
+block_sizes1d_dict = {
+    "b1": 32,
+    "b5": 1024,
+    "b6": 32,
+    "b7": 32,
+    "b8": 32,
+    "b10": 32,
+    "b11": 256, 
+    # Multi GPU;
+    "b1m": 32,
+    "b5m": 1024,
+    "b6m": 32,
+    "b9m": 32,
+    "b11m": 256,
+}
+
+block_sizes2d_dict = {
+    "b1": 8,
+    "b5": 8,
+    "b6": 8,
+    "b7": 8,
+    "b8": 16,
+    "b10": 8,
+    "b11": 16,
+    # Multi GPU;
+    "b1m": 8,
+    "b5m": 8,
+    "b6m": 8,
+    "b9m": 8,
+    "b11m": 8,
+}
+
+##############################
+##############################
+
+CUDA_CMD = "./b -k {} -p {} -n {} -b {} -c {} -t {} -m {} -g {} {} {} | tee {}"
+
+def execute_cuda_benchmark(benchmark, size, block_size, exec_policy, num_iter, debug, prefetch=False, stream_attach=False, num_blocks=DEFAULT_NUM_BLOCKS, num_gpus=1, output_date=None, mock=False):
+    if debug:
+        BenchmarkResult.log_message("")
+        BenchmarkResult.log_message("")
+        BenchmarkResult.log_message("#" * 30)
+        BenchmarkResult.log_message(f"Benchmark {i + 1}/{tot_benchmarks}")
+        BenchmarkResult.log_message(f"benchmark={b}, size={n},"
+                                    f" block size={block_size}, "
+                                    f" prefetch={prefetch}, "
+                                    f" num blocks={num_blocks}, "
+                                    f" num GPUs={num_gpus}, "
+                                    f" exec policy={exec_policy}")
+        BenchmarkResult.log_message("#" * 30)
+        BenchmarkResult.log_message("")
+        BenchmarkResult.log_message("")
+
+    do_prefetch = prefetch is not None and prefetch and prefetch != "none" and prefetch != "false"
+
+    if not output_date:
+        output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+    file_name = f"cuda_{output_date}_{benchmark}_{exec_policy}_{size}_gpu{num_gpus}_{block_size['block_size_1d']}_{block_size['block_size_2d']}_{prefetch}_{num_iter}_{num_blocks}.csv"
+    # Create a folder if it doesn't exist;
+    output_folder_path = os.path.join(BenchmarkResult.DEFAULT_RES_FOLDER, output_date + "_cuda")
+    if not os.path.exists(output_folder_path):
+        if debug:
+            BenchmarkResult.log_message(f"creating result folder: {output_folder_path}")
+        Path(output_folder_path).mkdir(parents=True, exist_ok=True)
+    output_path = os.path.join(output_folder_path, file_name)
+
+    benchmark_cmd = CUDA_CMD.format(benchmark, exec_policy, size, block_size["block_size_1d"],
+                                    block_size["block_size_2d"], num_iter, num_gpus, num_blocks, "-r" if do_prefetch else "", "-a" if stream_attach else "", output_path)
+    if not mock:
+        start = time.time()
+        result = subprocess.run(benchmark_cmd,
+                                shell=True,
+                                stdout=None,
+                                cwd=f"{os.getenv('GRCUDA_HOME')}/projects/resources/cuda/bin")
+        result.check_returncode()
+        end = time.time()
+        if debug:
+            BenchmarkResult.log_message(f"Benchmark total execution time: {(end - start):.2f} seconds")
+    else:
+        # Just print the command, for debugging;
+        if debug:
+            BenchmarkResult.log_message(benchmark_cmd)
+
+
+##############################
+##############################
+
+GRAALPYTHON_CMD = "graalpython --vm.XX:MaxHeapSize={}G --jvm --polyglot --experimental-options " \
+                  "--grcuda.ExecutionPolicy={} --grcuda.DependencyPolicy={} --grcuda.RetrieveNewStreamPolicy={} " \
+                  "--grcuda.NumberOfGPUs={} --grcuda.RetrieveParentStreamPolicy={} " \
+                  "--grcuda.DeviceSelectionPolicy={} --grcuda.MemAdvisePolicy={} --grcuda.InputPrefetch={} --grcuda.BandwidthMatrix={} {} {} " \
+                  "benchmark_main.py -i {} -n {} -g {} --number_of_gpus {} --reinit false --realloc false " \
+                  "-b {} --block_size_1d {} --block_size_2d {} --execution_policy {} --dependency_policy {} --new_stream {} "\
+                  "--parent_stream {} --device_selection {} --memory_advise_policy {} --prefetch {} --no_cpu_validation {} {} {} {} -o {}"
+
+def execute_grcuda_benchmark(benchmark, size, num_gpus, block_sizes, exec_policy, dependency_policy, new_stream_policy,
+                      parent_stream_policy, choose_device_policy, memory_advise, prefetch, num_iter, bandwidth_matrix, time_phases, debug, stream_attach=False,
+                      time_computation=False, num_blocks=DEFAULT_NUM_BLOCKS, output_date=None, mock=False):
+    if debug:
+        BenchmarkResult.log_message("#" * 30)
+        BenchmarkResult.log_message(f"Benchmark {i + 1}/{tot_benchmarks}")
+        BenchmarkResult.log_message(f"benchmark={benchmark}, size={n},"
+                                    f"gpus={num_gpus}, "
+                                    f"block-sizes={block_sizes}, "
+                                    f"num-blocks={num_blocks}, "
+                                    f"exec-policy={exec_policy}, "
+                                    f"dependency-policy={dependency_policy}, "
+                                    f"new-stream-policy={new_stream_policy}, "
+                                    f"parent-stream-policy={parent_stream_policy}, "
+                                    f"choose-device-policy={choose_device_policy}, "
+                                    f"mem-advise={memory_advise}, "
+                                    f"prefetch={prefetch}, "
+                                    f"stream-attachment={stream_attach}, "
+                                    f"time-computation={time_computation}, "
+                                    f"bandwidth-matrix={bandwidth_matrix}, "
+                                    f"time-phases={time_phases}")
+        BenchmarkResult.log_message("")
+
+    if not output_date:
+        output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+    file_name = f"{output_date}_{benchmark}_{size}_{num_gpus}_{num_blocks}_{exec_policy}_{dependency_policy}_" \
+                f"{new_stream_policy}_{parent_stream_policy}_{choose_device_policy}_" \
+                f"{memory_advise}_{prefetch}_{stream_attach}.json"
+    # Create a folder if it doesn't exist;
+    output_folder_path = os.path.join(BenchmarkResult.DEFAULT_RES_FOLDER, output_date + "_grcuda")
+    if not os.path.exists(output_folder_path):
+        if debug:
+            BenchmarkResult.log_message(f"creating result folder: {output_folder_path}")
+        if not mock:
+            Path(output_folder_path).mkdir(parents=True, exist_ok=True)
+    output_path = os.path.join(output_folder_path, file_name)
+    b1d_size = " ".join([str(b['block_size_1d']) for b in block_sizes])
+    b2d_size = " ".join([str(b['block_size_2d']) for b in block_sizes])
+
+    benchmark_cmd = GRAALPYTHON_CMD.format(HEAP_SIZE, exec_policy, dependency_policy, new_stream_policy,
+                                           num_gpus, parent_stream_policy, choose_device_policy, memory_advise, prefetch, bandwidth_matrix,
+                                           "--grcuda.ForceStreamAttach" if stream_attach else "", 
+                                           "--grcuda.EnableComputationTimers" if time_computation else "",
+                                           num_iter, size, num_blocks, num_gpus, benchmark, b1d_size, b2d_size, exec_policy, dependency_policy,
+                                           new_stream_policy, parent_stream_policy, choose_device_policy, memory_advise, prefetch,
+                                           "-d" if debug else "",
+                                           "-p" if time_phases else "", 
+                                           "--force_stream_attach" if stream_attach else "", 
+                                           "--timing" if time_computation else "",
+                                           output_path)    
+    if debug:
+        BenchmarkResult.log_message(benchmark_cmd)
+        BenchmarkResult.log_message("#" * 30)
+        BenchmarkResult.log_message("")
+        BenchmarkResult.log_message("")
+    if not mock:
+        start = time.time()
+        result = subprocess.run(benchmark_cmd,
+                                shell=True,
+                                stdout=None, #subprocess.STDOUT,
+                                cwd=f"{os.getenv('GRCUDA_HOME')}/projects/resources/python/benchmark")
+        result.check_returncode()
+        end = time.time()
+        if debug:
+            BenchmarkResult.log_message(f"Benchmark total execution time: {(end - start):.2f} seconds")
+
+##############################
+##############################
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(description="Wrap the GrCUDA benchmark to specify additional settings")
+
+    parser.add_argument("-d", "--debug", action="store_true",
+                        help="If present, print debug messages")
+    parser.add_argument("-c", "--cuda_test", action="store_true",
+                        help="If present, run performance tests using CUDA")
+    parser.add_argument("-i", "--num_iter", metavar="N", type=int, default=BenchmarkResult.DEFAULT_NUM_ITER,
+                        help="Number of times each benchmark is executed")
+    parser.add_argument("-g", "--num_blocks", metavar="N", type=int,
+                        help="Number of blocks in each kernel, when applicable")
+    parser.add_argument("-p", "--time_phases", action="store_true",
+                        help="Measure the execution time of each phase of the benchmark;"
+                             " note that this introduces overheads, and might influence the total execution time")
+    parser.add_argument("-m", "--mock", action="store_true",
+                        help="If present, simply print the benchmark CMD without executing it")
+    parser.add_argument("--gpus", metavar="N", type=int, nargs="*",
+                        help="Specify the maximum number of GPUs to use in the computation")
+
+    # Parse the input arguments;
+    args = parser.parse_args()
+
+    debug = args.debug if args.debug else BenchmarkResult.DEFAULT_DEBUG
+    num_iter = args.num_iter if args.num_iter else BenchmarkResult.DEFAULT_NUM_ITER
+    use_cuda = args.cuda_test
+    time_phases = args.time_phases
+    num_blocks = args.num_blocks
+    mock = args.mock
+    gpus = args.gpus
+
+    if gpus is not None:
+        num_gpus = gpus
+
+    if debug:
+        BenchmarkResult.log_message(f"using block sizes: {block_sizes1d_dict} {block_sizes2d_dict}; using low-level CUDA benchmarks: {use_cuda}")
+
+    def tot_benchmark_count():
+        tot = 0
+        if use_cuda:
+            for b in benchmarks:
+                for e in cuda_exec_policies:
+                    if e == "sync":
+                        tot += len(num_elem[b]) * len(prefetch) * len(stream_attach) 
+                    else:
+                        tot += len(num_elem[b]) * len(prefetch) * len(num_gpus) * len(stream_attach) 
+        else:
+            for b in benchmarks:
+                for e in exec_policies:
+                    if e == "sync":
+                        tot += len(num_elem[b]) * len(memory_advise) * len(prefetch) * len(stream_attach) * len(time_computation)
+                    else:
+                        for n in num_gpus:
+                            if n == 1:
+                                tot += len(num_elem[b]) * len(memory_advise) * len(prefetch) * len(stream_attach) * len(time_computation)
+                            else:
+                                tot += len(num_elem[b]) * len(dependency_policies) * len(new_stream_policies) * len(parent_stream_policies) * len(choose_device_policies) * len(memory_advise) * len(prefetch) * len(stream_attach) * len(time_computation)
+        return tot
+
+    output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
+
+    # Execute each test;
+    i = 0
+    tot_benchmarks = tot_benchmark_count()
+    for b in benchmarks:
+        for n in num_elem[b]:
+            if use_cuda:
+                # CUDA Benchmarks;
+                for e in cuda_exec_policies:
+                    if e == "sync":
+                        ng = [1]
+                    else:
+                        ng = num_gpus
+                    block_sizes = BenchmarkResult.create_block_size_list([block_sizes1d_dict[b]], [block_sizes2d_dict[b]])
+                    for block_size in block_sizes:
+                        for p in prefetch:
+                            for a in stream_attach:
+                                for num_gpu in ng:
+                                    nb = num_blocks if num_blocks else block_dim_dict[b]
+                                    execute_cuda_benchmark(b, n, block_size, e, num_iter, debug, num_gpus=num_gpu, num_blocks=nb, prefetch=p, stream_attach=a, mock=mock, output_date=output_date)
+                                    i += 1
+            # GrCUDA Benchmarks;
+            else:
+                for exec_policy in exec_policies:
+                    if exec_policy == "sync":
+                        dp = [dependency_policies[0]]
+                        nsp = [new_stream_policies[0]]
+                        psp = [parent_stream_policies[0]]
+                        cdp = [choose_device_policies[0]]
+                        ng = [1]
+                    else:
+                        dp = dependency_policies
+                        nsp = new_stream_policies
+                        psp = parent_stream_policies
+                        cdp = choose_device_policies
+                        ng = num_gpus
+                    for num_gpu in ng:
+                        if exec_policy == "async" and num_gpu == 1:
+                            dp = [dependency_policies[0]]
+                            nsp = [new_stream_policies[0]]
+                            psp = [parent_stream_policies[0]]
+                            cdp = [choose_device_policies[0]]
+                        else:
+                            dp = dependency_policies
+                            nsp = new_stream_policies
+                            psp = parent_stream_policies
+                            cdp = choose_device_policies
+                        for m in memory_advise:
+                            for p in prefetch:
+                                for s in stream_attach:
+                                    for t in time_computation:
+                                        # Select the correct connection graph;
+                                        # if GPU == V100:
+                                        #     BANDWIDTH_MATRIX = f"{os.getenv('GRCUDA_HOME')}/projects/resources/connection_graph/datasets/connection_graph_{num_gpu}_v100.csv"
+                                        # elif GPU == A100:
+                                        #      BANDWIDTH_MATRIX = f"{os.getenv('GRCUDA_HOME')}/projects/resources/connection_graph/datasets/connection_graph_8_a100.csv"
+                                        BANDWIDTH_MATRIX = f"{os.getenv('GRCUDA_HOME')}/projects/resources/connection_graph/datasets/connection_graph.csv"
+
+                                        for dependency_policy in dp:
+                                            for new_stream_policy in nsp:
+                                                for parent_stream_policy in psp:
+                                                    for choose_device_policy in cdp:
+                                                        nb = num_blocks if num_blocks else block_dim_dict[b]
+                                                        block_sizes = BenchmarkResult.create_block_size_list([block_sizes1d_dict[b]], [block_sizes2d_dict[b]])
+                                                        execute_grcuda_benchmark(b, n, num_gpu, block_sizes,
+                                                            exec_policy, dependency_policy, new_stream_policy, parent_stream_policy, choose_device_policy, 
+                                                            m, p, num_iter, BANDWIDTH_MATRIX, time_phases, debug, s, t, nb, output_date=output_date, mock=mock)
+                                                        i += 1 
+
diff --git a/projects/resources/python/examples/__init__.py b/projects/resources/python/examples/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/projects/resources/python/examples/pipeline_1.py b/projects/resources/python/examples/pipeline_1.py
new file mode 100644
index 00000000..8c8c7766
--- /dev/null
+++ b/projects/resources/python/examples/pipeline_1.py
@@ -0,0 +1,160 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import time
+import math
+
+NUM_THREADS_PER_BLOCK = 128
+
+SQUARE_KERNEL = """
+    extern "C" __global__ void square(float* x, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            x[idx] = x[idx] * x[idx];
+        }
+    }
+    """
+
+DIFF_KERNEL = """
+    extern "C" __global__ void diff(float* x, float* y, float* z, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            z[idx] = x[idx] - y[idx];
+        }
+    }
+    """
+
+REDUCE_KERNEL = """
+    extern "C" __global__ void reduce(float *x, float *res, int n) {
+        __shared__ float cache[%d];
+        int i = blockIdx.x * blockDim.x + threadIdx.x;
+        if (i < n) {
+            cache[threadIdx.x] = x[i];
+        }
+        __syncthreads();
+    
+        // Perform tree reduction;
+        i = %d / 2;
+        while (i > 0) {
+            if (threadIdx.x < i) {
+                cache[threadIdx.x] += cache[threadIdx.x + i];
+            }
+            __syncthreads();
+            i /= 2;
+        }
+        if (threadIdx.x == 0) {
+            atomicAdd(res, cache[0]);
+        }
+    }
+    """ % (NUM_THREADS_PER_BLOCK, NUM_THREADS_PER_BLOCK)
+
+# Compute the sum of difference of squares of 2 vectors, using multiple GrCUDA kernels.
+# Structure of the computation:
+#   A: x^2 ──┐
+#            ├─> C: z=x-y ──> D: sum(z)
+#   B: x^2 ──┘
+if __name__ == "__main__":
+    N = 1000000
+    NUM_BLOCKS = (N + NUM_THREADS_PER_BLOCK - 1) // NUM_THREADS_PER_BLOCK
+
+    start_tot = time.time()
+    time_cumulative = 0
+
+    # Allocate 2 vectors;
+    start = time.time()
+    x = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    y = polyglot.eval(language="grcuda", string=f"float[{N}]")
+
+    # Allocate a support vector;
+    z = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    res = polyglot.eval(language="grcuda", string=f"float[1]")
+    end = time.time()
+    time_cumulative += end - start
+    print(f"time to allocate arrays: {end - start:.4f} sec")
+
+    # Fill the 2 vectors;
+    start = time.time()
+    for i in range(N):
+        x[i] = 1 / (i + 1)
+        y[i] = 2 / (i + 1)
+    res[0] = 0
+    end = time.time()
+    time_cumulative += end - start
+    print(f"time to fill arrays: {end - start:.4f} sec")
+
+    # A. B. Compute the squares of each vector;
+
+    # First, build the kernels;
+    build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+    square_kernel = build_kernel(SQUARE_KERNEL, "square", "pointer, sint32")
+    diff_kernel = build_kernel(DIFF_KERNEL, "diff", "pointer, pointer, pointer, sint32")
+    reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "pointer, pointer, sint32")
+
+    # Call the kernel. The 2 computations are independent, and can be done in parallel;
+    start = time.time()
+    square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(x, N)
+    square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(y, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"square, time: {end - start:.4f} sec")
+
+    # C. Compute the difference of the 2 vectors. This must be done after the 2 previous computations;
+    start = time.time()
+    diff_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(x, y, z, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"diff, time: {end - start:.4f} sec")
+
+    # D. Compute the sum of the result;
+    start = time.time()
+    reduce_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(z, res, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"reduce, time: {end - start:.4f} sec")
+    print(f"overheads, time: {end - start_tot - time_cumulative:.4f} sec")
+    print(f"total time: {end - start_tot:.4f} sec")
+
+    result = res[0]
+    print(f"result={result:.4f}")
+
+    ##############################
+    # Validate the result ########
+    ##############################
+    import numpy as np
+    x_g = 1 / np.linspace(1, N, N)
+    y_g = 2 / np.linspace(1, N, N)
+
+    x_g = x_g**2
+    y_g = y_g**2
+    x_g -= y_g
+    res_g = np.sum(x_g)
+
+    print(f"result in python={res_g:.4f}, difference={np.abs(res_g - result)}")
diff --git a/projects/resources/python/examples/pipeline_2.py b/projects/resources/python/examples/pipeline_2.py
new file mode 100644
index 00000000..a0362711
--- /dev/null
+++ b/projects/resources/python/examples/pipeline_2.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import time
+import math
+
+NUM_THREADS_PER_BLOCK = 128
+
+SQUARE_KERNEL = """
+    extern "C" __global__ void square(float* x, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            x[idx] = x[idx] * x[idx];
+        }
+    }
+    """
+
+DIFF_KERNEL = """
+    extern "C" __global__ void diff(float* x, float* y, float* z, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            z[idx] = x[idx] - y[idx];
+        }
+    }
+    """
+
+ADDTWO_KERNEL = """
+    extern "C" __global__ void addtwo(float* a, float* b, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            b[idx] = a[idx] + 2.0;
+        }
+    }
+    """
+
+REDUCE_KERNEL = """
+    extern "C" __global__ void reduce(float *x, float *y, float *res, int n) {
+        __shared__ float cache[%d];
+        int i = blockIdx.x * blockDim.x + threadIdx.x;
+        if (i < n) {
+            cache[threadIdx.x] = x[i] + y[i];
+        }
+        __syncthreads();
+    
+        // Perform tree reduction;
+        i = %d / 2;
+        while (i > 0) {
+            if (threadIdx.x < i) {
+                cache[threadIdx.x] += cache[threadIdx.x + i];
+            }
+            __syncthreads();
+            i /= 2;
+        }
+        if (threadIdx.x == 0) {
+            atomicAdd(res, cache[0]);
+        }
+    }
+    """ % (NUM_THREADS_PER_BLOCK, NUM_THREADS_PER_BLOCK)
+
+# Compute a complex graph of interconnected computations using GrCUDA.
+# Structure of the computation:
+#   A: x^2 ──┐
+#            ├─> C: z=x-y ───┐
+#   B: y^2 ──┘               │
+#                            ├-> F: sum(z+b)
+#                            │
+#   D: a^2 ────> E: b=a+2  ──┘
+if __name__ == "__main__":
+    N = 1000000
+    NUM_BLOCKS = (N + NUM_THREADS_PER_BLOCK - 1) // NUM_THREADS_PER_BLOCK
+
+    start_tot = time.time()
+    time_cumulative = 0
+
+    # Allocate 2 vectors;
+    start = time.time()
+    x = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    y = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    a = polyglot.eval(language="grcuda", string=f"float[{N}]")
+
+    # Allocate support vectors;
+    z = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    b = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    res = polyglot.eval(language="grcuda", string=f"float[1]")
+    end = time.time()
+    time_cumulative += end - start
+    print(f"time to allocate arrays: {end - start:.4f} sec")
+
+    # Fill the 2 vectors;
+    start = time.time()
+    for i in range(N):
+        x[i] = 1 / (i + 1)
+        y[i] = 2 / (i + 1)
+        a[i] = 4 / (i + 1)
+    res[0] = 0
+    end = time.time()
+    time_cumulative += end - start
+    print(f"time to fill arrays: {end - start:.4f} sec")
+
+    # First, build the kernels;
+    build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+    square_kernel = build_kernel(SQUARE_KERNEL, "square", "pointer, sint32")
+    diff_kernel = build_kernel(DIFF_KERNEL, "diff", "pointer, pointer, pointer, sint32")
+    addtwo_kernel = build_kernel(ADDTWO_KERNEL, "addtwo", "pointer, pointer, sint32")
+    reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "pointer, pointer, pointer, sint32")
+
+    # A. B. Compute the squares of each vector;
+
+    # Call the kernels. The 2 computations are independent, and can be done in parallel;
+    start = time.time()
+    square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(x, N)
+    square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(y, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"square, time: {end - start:.4f} sec")
+
+    # D. Compute the other branch of the computation;
+    start = time.time()
+    square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(a, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"square - other branch, time: {end - start:.4f} sec")
+
+    # C. Compute the difference of the 2 vectors. This must be done after the 2 previous computations;
+    start = time.time()
+    diff_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(x, y, z, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"diff, time: {end - start:.4f} sec")
+
+    # E. Continue computing the other branch;
+    start = time.time()
+    addtwo_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(a, b, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"add two - other branch, time: {end - start:.4f} sec")
+
+    # F. Join the two branches and compute the sum of the result;
+    start = time.time()
+    reduce_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(z, b, res, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"reduce, time: {end - start:.4f} sec")
+    print(f"overheads, time: {end - start_tot - time_cumulative:.4f} sec")
+    print(f"total time: {end - start_tot:.4f} sec")
+
+    result = res[0]
+    print(f"result={result:.4f}")
+
+    ##############################
+    # Validate the result ########
+    ##############################
+    import numpy as np
+    x_g = 1 / np.linspace(1, N, N)
+    y_g = 2 / np.linspace(1, N, N)
+    a_g = 4 / np.linspace(1, N, N)
+
+    x_g = x_g**2
+    y_g = y_g**2
+    a_g = a_g**2
+    x_g -= y_g
+    a_g += 2
+    res_g = np.sum(x_g + a_g)
+
+    print(f"result in python={res_g:.4f}, difference={np.abs(res_g - result)}")
diff --git a/projects/resources/python/examples/pipeline_3.py b/projects/resources/python/examples/pipeline_3.py
new file mode 100644
index 00000000..505b97fd
--- /dev/null
+++ b/projects/resources/python/examples/pipeline_3.py
@@ -0,0 +1,159 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# coding=utf-8
+import polyglot
+import time
+import math
+
+NUM_THREADS_PER_BLOCK = 128
+
+SQUARE_KERNEL = """
+    extern "C" __global__ void square(float* x, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            x[idx] = x[idx] * x[idx];
+        }
+    }
+    """
+
+DIFF_KERNEL = """
+    extern "C" __global__ void diff(float* x, float* y, float* z, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+            z[idx] = x[idx] - y[idx];
+        }
+    }
+    """
+
+REDUCE_KERNEL = """
+    extern "C" __global__ void reduce(float *x, float *y, float *res, int n) {
+        __shared__ float cache[%d];
+        int i = blockIdx.x * blockDim.x + threadIdx.x;
+        if (i < n) {
+            cache[threadIdx.x] = x[i] + y[i];
+        }
+        __syncthreads();
+    
+        // Perform tree reduction;
+        i = %d / 2;
+        while (i > 0) {
+            if (threadIdx.x < i) {
+                cache[threadIdx.x] += cache[threadIdx.x + i];
+            }
+            __syncthreads();
+            i /= 2;
+        }
+        if (threadIdx.x == 0) {
+            atomicAdd(res, cache[0]);
+        }
+    }
+    """ % (NUM_THREADS_PER_BLOCK, NUM_THREADS_PER_BLOCK)
+
+# Compute a pipeline of GrCUDA kernels using loops to build a dynamic graph.
+# Structure of the computation:
+#   A: x^2 ─ [5 times] ─┐
+#                       ├─> C: res=sum(x+y)
+#   B: x^2 ─ [5 times] ─┘
+if __name__ == "__main__":
+    N = 1000000
+    NUM_ITER = 5
+    NUM_BLOCKS = (N + NUM_THREADS_PER_BLOCK - 1) // NUM_THREADS_PER_BLOCK
+
+    start_tot = time.time()
+    time_cumulative = 0
+
+    # Allocate 2 vectors;
+    start = time.time()
+    x = polyglot.eval(language="grcuda", string=f"float[{N}]")
+    y = polyglot.eval(language="grcuda", string=f"float[{N}]")
+
+    # Allocate a support vector;
+    res = polyglot.eval(language="grcuda", string=f"float[1]")
+    end = time.time()
+    time_cumulative += end - start
+    print(f"time to allocate arrays: {end - start:.4f} sec")
+
+    # Fill the 2 vectors;
+    start = time.time()
+    for i in range(N):
+        x[i] = 1 / (i + 1)
+        y[i] = 2 / (i + 1)
+    res[0] = 0
+    end = time.time()
+    time_cumulative += end - start
+    print(f"time to fill arrays: {end - start:.4f} sec")
+
+    # A. B. Compute the squares of each vector;
+
+    # First, build the kernels;
+    build_kernel = polyglot.eval(language="grcuda", string="buildkernel")
+    square_kernel = build_kernel(SQUARE_KERNEL, "square", "pointer, sint32")
+    reduce_kernel = build_kernel(REDUCE_KERNEL, "reduce", "pointer, pointer, pointer, sint32")
+
+    # Call the kernel. The 2 computations are independent, and can be done in parallel;
+    for i in range(NUM_ITER):
+        start = time.time()
+        square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(x, N)
+        square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(y, N)
+        end = time.time()
+        time_cumulative += end - start
+        print(f"iter {i}) time: {end - start:.4f} sec")
+
+        square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(y, N)
+        a = y[0]
+        b=a+1
+        square_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(y, b)
+
+    # C. Compute the sum of the result;
+    start = time.time()
+    reduce_kernel(NUM_BLOCKS, NUM_THREADS_PER_BLOCK)(x, y, res, N)
+    end = time.time()
+    time_cumulative += end - start
+    print(f"reduce, time: {end - start:.4f} sec")
+    print(f"overheads, time: {end - start_tot - time_cumulative:.4f} sec")
+    print(f"total time: {end - start_tot:.4f} sec")
+
+    result = res[0]
+    print(f"result={result:.4f}")
+
+    ##############################
+    # Validate the result ########
+    ##############################
+    import numpy as np
+    x_g = 1 / np.linspace(1, N, N)
+    y_g = 2 / np.linspace(1, N, N)
+
+    for i in range(NUM_ITER):
+        x_g = x_g**2
+        y_g = y_g**2
+    res_g = np.sum(x_g + y_g)
+
+    print(f"result in python={res_g:.4f}, difference={np.abs(res_g - result)}")
+
diff --git a/projects/resources/python/other/data/nb_class_log_prior.csv b/projects/resources/python/other/data/nb_class_log_prior.csv
new file mode 100644
index 00000000..da1b0bc0
--- /dev/null
+++ b/projects/resources/python/other/data/nb_class_log_prior.csv
@@ -0,0 +1,5 @@
+-1.635610786738123323e+00
+-1.592908610482888676e+00
+-1.621173168652520147e+00
+-1.597838776590747578e+00
+-1.600313004657099469e+00
diff --git a/projects/resources/python/other/data/nb_feat_log_prob.csv b/projects/resources/python/other/data/nb_feat_log_prob.csv
new file mode 100644
index 00000000..3dcf0d04
--- /dev/null
+++ b/projects/resources/python/other/data/nb_feat_log_prob.csv
@@ -0,0 +1,5 @@
+-6.921954055381462467e+00,-6.897021752413513340e+00,-6.924655716615632173e+00,-6.906464693213795414e+00,-6.908364029017446839e+00,-6.886550452546217116e+00,-6.909886099326515208e+00,-6.904190241892125712e+00,-6.930081035096447906e+00,-6.903622436444864618e+00,-6.877477156340543019e+00,-6.914466260005090348e+00,-6.871595867562971804e+00,-6.946733393644134225e+00,-6.920798431215875013e+00,-6.888039656097685537e+00,-6.896645883211569128e+00,-6.903244078513752768e+00,-6.898150208429443708e+00,-6.919259673458899584e+00,-6.904568958048896476e+00,-6.893644004784732715e+00,-6.909886099326515208e+00,-6.896082344171576395e+00,-6.914848888668645444e+00,-6.901354433204677008e+00,-6.923111016559495567e+00,-6.877108560685798011e+00,-6.938666453685231161e+00,-6.920413519740760577e+00,-6.904190241892125712e+00,-6.911982735947036716e+00,-6.907413910181126937e+00,-6.924076174437720610e+00,-6.893082154409432505e+00,-6.906464693213795414e+00,-6.884692061377657879e+00,-6.888039656097685537e+00,-6.867572453210787131e+00,-6.891959399800471786e+00,-6.929692534703061568e+00,-6.938862436047445570e+00,-6.924462498568466629e+00,-6.899656800051518601e+00,-6.910648004125278021e+00,-6.932415212386544567e+00,-6.913510328376995417e+00,-6.909886099326515208e+00,-6.933389396584388464e+00,-6.914657556036280184e+00,-6.900599573761397210e+00,-6.933779336148697681e+00,-6.954666795031672066e+00,-6.911410489865613016e+00,-6.910648004125278021e+00,-6.875635535193893588e+00,-6.893082154409432505e+00,-6.894393630007243701e+00,-6.930081035096447906e+00,-6.892707762791431492e+00,-6.933194483807238129e+00,-6.921568698921973350e+00,-6.909124774584755713e+00,-6.912555309681545523e+00,-6.906654464539917271e+00,-6.886736481753338524e+00,-6.879506864574956637e+00,-6.900976932256440932e+00,-6.897209740006886136e+00,-6.937295651551453446e+00,-6.873246486919853737e+00,-6.906274957893996813e+00,-6.894956217926441866e+00,-6.923496967934415380e+00,-6.916572531726810524e+00,-6.939450613694047476e+00,-6.902865863683222969e+00,-6.912937207719412669e+00,-6.903054953217679923e+00,-6.915040257916190924e+00,-6.914466260005090348e+00,-6.883949670846160274e+00,-6.897962043973153712e+00,-6.928333970690124133e+00,-6.911219813918805244e+00,-6.885063463424639707e+00,-6.909695713795240835e+00,-6.892333511290058112e+00,-6.907413910181126937e+00,-6.912364415340983115e+00,-6.914275000561072915e+00,-6.915806101338651146e+00,-6.904947817685933131e+00,-6.913128211444565352e+00,-6.898714914326804504e+00,-6.914466260005090348e+00,-6.933974362965502181e+00,-6.906844271886031450e+00,-6.907983873148964449e+00,-6.903054953217679923e+00,-6.920028756365487865e+00,-6.893644004784732715e+00,-6.890837904355821664e+00,-6.924076174437720610e+00,-6.935145323286652896e+00,-6.878768312104162064e+00,-6.908173933018396440e+00,-6.902865863683222969e+00,-6.919067495116806299e+00,-6.904758369925586337e+00,-6.922725214085575729e+00,-6.906464693213795414e+00,-6.904568958048896476e+00,-6.931052946923367486e+00,-6.888412303763633560e+00,-6.936904338691832095e+00,-6.900599573761397210e+00,-6.922146789312868975e+00,-6.871595867562971804e+00,-6.897021752413513340e+00,-6.901165664917188636e+00,-6.886178497920392161e+00,-6.901732076713695818e+00,-6.917147740255755295e+00,-6.913701441612168708e+00,-6.909124774584755713e+00,-6.885249216188245924e+00,-6.879322175314111831e+00,-6.926783582782670479e+00,-6.921954055381462467e+00,-6.910266979163534273e+00,-6.923303973627147556e+00,-6.918491181586304606e+00,-6.935926725944110416e+00,-6.893082154409432505e+00,-6.903054953217679923e+00,-6.884320797218947519e+00,-6.914083777690233745e+00,-6.914466260005090348e+00,-6.912746240469715531e+00,-6.923303973627147556e+00,-6.915423106312903911e+00,-6.921954055381462467e+00,-6.928333970690124133e+00,-6.894018747153742410e+00,-6.943971505914925402e+00,-6.893082154409432505e+00,-6.944168530839215592e+00,-6.896082344171576395e+00,-6.894206171013349760e+00,-6.868120148340862485e+00,-6.921761358589268553e+00,-6.911791950865850964e+00,-6.905326820911998453e+00,-6.890464351867354509e+00,-6.883578682157059703e+00,-6.900599573761397210e+00,-6.881355635669113013e+00,-6.889531080684355047e+00,-6.902109862891215641e+00,-6.922146789312868975e+00,-6.950097401961492949e+00,-6.892520619532721327e+00,-6.924076174437720610e+00,-6.891211596437898379e+00,-6.914083777690233745e+00,-6.933779336148697681e+00,-6.929110066965012749e+00,-6.894393630007243701e+00,-6.884877745158714646e+00,-6.916572531726810524e+00,-6.904758369925586337e+00,-6.913319251659110876e+00,-6.886736481753338524e+00,-6.882466541171252672e+00,-6.878952899091304474e+00,-6.914657556036280184e+00,-6.891772396527368727e+00,-6.941413708160370177e+00,-6.903811669106987736e+00,-6.897585821244863524e+00,-6.923689999495673320e+00,-6.880061137106467939e+00,-6.949304851890287793e+00,-6.905326820911998453e+00,-6.900222357612072699e+00,-6.927170954470643238e+00,-6.911601202176669290e+00,-6.922532368650593071e+00,-6.914848888668645444e+00,-6.873980970977846994e+00,-6.905705967835970682e+00,-6.915997653872517859e+00,-6.913892591378591135e+00,-6.882096102196836540e+00,-6.935145323286652896e+00,-6.900599573761397210e+00,-6.917147740255755295e+00,-6.894956217926441866e+00,-6.891959399800471786e+00,-6.920990942526509926e+00,-6.902676809896858501e+00,-6.907223994692948565e+00,-6.888785090347633755e+00,-6.913510328376995417e+00,-6.958457323260997640e+00,-6.927364696600475824e+00,-6.931052946923367486e+00,-6.872329139823142086e+00,-6.877845887908142331e+00,-6.904568958048896476e+00,-6.916955967314329712e+00,-6.919067495116806299e+00,-6.885992572475959861e+00,-6.901732076713695818e+00,-6.914466260005090348e+00,-6.866113397690227771e+00,-6.900976932256440932e+00,-6.904190241892125712e+00,-6.933389396584388464e+00,-6.894768653450602969e+00,-6.899845283701118603e+00,-6.907413910181126937e+00,-6.955862252436446624e+00,-6.887294777106074406e+00,-6.908364029017446839e+00,-6.925235594857454657e+00,-6.923689999495673320e+00,-6.910266979163534273e+00,-6.896645883211569128e+00,-6.907983873148964449e+00,-6.926589953195463778e+00,-6.906654464539917271e+00,-6.926589953195463778e+00,-6.889904284644648769e+00,-6.894393630007243701e+00,-6.956859559789478453e+00,-6.912746240469715531e+00,-6.924655716615632173e+00,-6.876924313794443577e+00,-6.885435003462351489e+00,-6.896833800152833760e+00,-6.912937207719412669e+00,-6.887294777106074406e+00,-6.908554161159859319e+00,-6.904000937584887865e+00,-6.893082154409432505e+00,-6.892707762791431492e+00,-6.882651812130559321e+00,-6.915806101338651146e+00,-6.912937207719412669e+00,-6.891211596437898379e+00,-6.900976932256440932e+00,-6.948315046801063488e+00,-6.935145323286652896e+00,-6.908934533929754096e+00,-6.893082154409432505e+00,-6.891772396527368727e+00,-6.895143817589312718e+00,-6.930275341907513820e+00,-6.893644004784732715e+00,-6.910838571062491553e+00,-6.903622436444864618e+00,-6.923689999495673320e+00,-6.907983873148964449e+00,-6.927170954470643238e+00,-6.896082344171576395e+00,-6.881910934156310944e+00,-6.898903220513219026e+00,-6.935340616715649276e+00,-6.868485445148579061e+00,-6.898714914326804504e+00,-6.886736481753338524e+00,-6.887853384326632522e+00,-6.943774519801808154e+00,-6.933974362965502181e+00,-6.917915200001790055e+00,-6.908554161159859319e+00,-6.922918096717092240e+00,-6.894206171013349760e+00,-6.908364029017446839e+00,-6.920028756365487865e+00,-6.933779336148697681e+00,-6.908934533929754096e+00,-6.908173933018396440e+00,-6.929886766033185097e+00,-6.925428962352981443e+00,-6.927752293504552128e+00,-6.906274957893996813e+00,-6.875635535193893588e+00,-6.895331452452420962e+00,-6.923496967934415380e+00,-6.912364415340983115e+00,-6.877845887908142331e+00,-6.912937207719412669e+00,-6.890277627938139915e+00,-6.904000937584887865e+00,-6.882466541171252672e+00,-6.884135216815705505e+00,-6.896458001576444730e+00,-6.929692534703061568e+00,-6.878768312104162064e+00,-6.935340616715649276e+00,-6.920221119547766975e+00,-6.907793849395423180e+00,-6.895331452452420962e+00,-6.893644004784732715e+00,-6.904947817685933131e+00,-6.923883068325311640e+00,-6.920990942526509926e+00,-6.920028756365487865e+00,-6.908934533929754096e+00,-6.920221119547766975e+00,-6.885806681593372502e+00,-6.899845283701118603e+00,-6.912937207719412669e+00,-6.887480944843742847e+00,-6.884506412068668979e+00,-6.916764231142714436e+00,-6.909505364503717217e+00,-6.913701441612168708e+00,-6.902865863683222969e+00,-6.922532368650593071e+00,-6.889531080684355047e+00,-6.866660294066997849e+00,-6.926783582782670479e+00,-6.939842924409161284e+00,-6.902676809896858501e+00,-6.906464693213795414e+00,-6.937687117596773589e+00,-6.891024732941138708e+00,-6.895331452452420962e+00,-6.906085258566859508e+00,-6.933974362965502181e+00,-6.905326820911998453e+00,-6.920605956958713278e+00,-6.902109862891215641e+00,-6.897397762946237521e+00,-6.930081035096447906e+00,-6.916189243105788975e+00,-6.919451888740605128e+00,-6.894206171013349760e+00,-6.919259673458899584e+00,-6.915997653872517859e+00,-6.903622436444864618e+00,-6.919451888740605128e+00,-6.871412633480918686e+00,-6.907603861744050278e+00,-6.888412303763633560e+00,-6.909505364503717217e+00,-6.917723279839988137e+00,-6.919644140976130231e+00,-6.920413519740760577e+00,-6.904000937584887865e+00,-6.920028756365487865e+00,-6.913510328376995417e+00,-6.908744329459375777e+00,-6.935535948291619235e+00,-6.889717665254352141e+00,-6.898526643592932217e+00,-6.909695713795240835e+00,-6.911791950865850964e+00,-6.893831358415258848e+00,-6.915997653872517859e+00,-6.934364530742593402e+00,-6.926977249869553432e+00,-6.954467690999157625e+00,-6.915423106312903911e+00,-6.925815809553164115e+00,-6.951684389003553122e+00,-6.909124774584755713e+00,-6.905705967835970682e+00,-6.892894941079296345e+00,-6.910076521111344405e+00,-6.893456686249017196e+00,-6.870863132597049727e+00,-6.936122172050469104e+00,-6.932025804281934711e+00,-6.919451888740605128e+00,-6.922146789312868975e+00,-6.896270155234200061e+00,-6.894768653450602969e+00,-6.920798431215875013e+00,-6.920028756365487865e+00,-6.913892591378591135e+00,-6.915231663792935635e+00,-6.892333511290058112e+00,-6.936708739669203538e+00,-6.913319251659110876e+00,-6.915231663792935635e+00,-6.867572453210787131e+00,-6.906844271886031450e+00,-6.910648004125278021e+00,-6.917339549981097235e+00,-6.908934533929754096e+00,-6.897585821244863524e+00,-6.959257163335632157e+00,-6.907223994692948565e+00,-6.949106812479797668e+00,-6.874899835326242936e+00,-6.925428962352981443e+00,-6.916764231142714436e+00,-6.885063463424639707e+00,-6.889717665254352141e+00,-6.883578682157059703e+00,-6.929886766033185097e+00,-6.914657556036280184e+00,-6.921761358589268553e+00,-6.907223994692948565e+00,-6.934559671732577257e+00,-6.895894568375325662e+00,-6.885620825259783473e+00,-6.893644004784732715e+00,-6.921376076365277186e+00,-6.904190241892125712e+00,-6.912555309681545523e+00,-6.924848972003228909e+00,-6.900788235208990429e+00,-6.875451559478168662e+00,-6.927946148307912111e+00,-6.874899835326242936e+00,-6.908554161159859319e+00,-6.897962043973153712e+00,-6.921761358589268553e+00,-6.913892591378591135e+00,-6.894206171013349760e+00,-6.906274957893996813e+00,-6.900222357612072699e+00,-6.904190241892125712e+00,-6.932609973318117014e+00,-6.933974362965502181e+00,-6.891211596437898379e+00,-6.887853384326632522e+00,-6.909124774584755713e+00,-6.930275341907513820e+00,-6.932025804281934711e+00,-6.902298809514356037e+00,-6.906085258566859508e+00,-6.896458001576444730e+00,-6.881540700905137697e+00,-6.887294777106074406e+00,-6.892707762791431492e+00,-6.928527938298142175e+00,-6.887667147246292032e+00,-6.893644004784732715e+00,-6.867207489703085699e+00,-6.885063463424639707e+00,-6.880800645367198598e+00,-6.892146438050339441e+00,-6.894956217926441866e+00,-6.886178497920392161e+00,-6.852537299955292482e+00,-6.884877745158714646e+00,-6.894018747153742410e+00,-6.879322175314111831e+00,-6.922725214085575729e+00,-6.881725800396642612e+00,-6.957857862757483858e+00,-6.920221119547766975e+00,-6.942396702439857847e+00,-6.908554161159859319e+00,-6.894206171013349760e+00,-6.908934533929754096e+00,-6.889158015953297465e+00,-6.940039137497203470e+00,-6.916380869052529690e+00,-6.915997653872517859e+00,-6.915040257916190924e+00,-6.899468351921308695e+00,-6.934559671732577257e+00,-6.871779135225905222e+00,-6.914848888668645444e+00,-6.911601202176669290e+00,-6.878768312104162064e+00,-6.928721943536892525e+00,-6.899279939297104036e+00,-6.895706827832203345e+00,-6.921183490904883939e+00,-6.883207831049528380e+00,-6.893082154409432505e+00,-6.909315051438150945e+00,-6.927364696600475824e+00,-6.890090938868242532e+00,-6.899656800051518601e+00,-6.910076521111344405e+00,-6.905326820911998453e+00,-6.913510328376995417e+00,-6.903811669106987736e+00,-6.911219813918805244e+00,-6.890277627938139915e+00,-6.905137301343534872e+00,-6.897585821244863524e+00,-6.912746240469715531e+00,-6.913701441612168708e+00,-6.909886099326515208e+00,-6.900976932256440932e+00,-6.906274957893996813e+00,-6.922339560397801961e+00,-6.907793849395423180e+00,-6.925235594857454657e+00,-6.922725214085575729e+00,-6.926977249869553432e+00,-6.929304185183617903e+00,-6.898338408298259949e+00,-6.956061634353167378e+00,-6.896645883211569128e+00,-6.886550452546217116e+00,-6.904758369925586337e+00,-6.875635535193893588e+00,-6.886550452546217116e+00,-6.924655716615632173e+00,-6.926977249869553432e+00,-6.899845283701118603e+00,-6.934950067989731792e+00,-6.908744329459375777e+00,-6.912364415340983115e+00,-6.893831358415258848e+00,-6.902865863683222969e+00,-6.893644004784732715e+00,-6.892707762791431492e+00,-6.940039137497203470e+00,-6.896645883211569128e+00,-6.926783582782670479e+00,-6.897209740006886136e+00,-6.908554161159859319e+00,-6.943380663947957387e+00,-6.885435003462351489e+00,-6.897397762946237521e+00,-6.911219813918805244e+00,-6.908364029017446839e+00,-6.910076521111344405e+00,-6.875819544762910240e+00,-6.921568698921973350e+00,-6.884320797218947519e+00,-6.904947817685933131e+00,-6.939646749813142890e+00,-6.904190241892125712e+00,-6.917147740255755295e+00,-6.892333511290058112e+00,-6.938666453685231161e+00,-6.892146438050339441e+00,-6.887108644020381476e+00,-6.912746240469715531e+00,-6.925042264745687959e+00,-6.914466260005090348e+00,-6.924076174437720610e+00,-6.927170954470643238e+00,-6.901732076713695818e+00,-6.911982735947036716e+00,-6.929110066965012749e+00,-6.914848888668645444e+00,-6.904000937584887865e+00,-6.893269402794961920e+00,-6.899091562165532210e+00,-6.904190241892125712e+00,-6.906844271886031450e+00,-6.873980970977846994e+00,-6.897021752413513340e+00,-6.919259673458899584e+00,-6.897585821244863524e+00,-6.925428962352981443e+00,-6.921183490904883939e+00,-6.906274957893996813e+00,-6.911219813918805244e+00,-6.855598434964734977e+00,-6.913510328376995417e+00,-6.926589953195463778e+00,-6.892520619532721327e+00,-6.916572531726810524e+00,-6.885435003462351489e+00,-6.893456686249017196e+00,-6.908934533929754096e+00,-6.899279939297104036e+00,-6.893082154409432505e+00,-6.901354433204677008e+00,-6.938862436047445570e+00,-6.909695713795240835e+00,-6.904947817685933131e+00,-6.913892591378591135e+00,-6.919259673458899584e+00,-6.924655716615632173e+00,-6.930469686481053770e+00,-6.915614585490128974e+00,-6.889158015953297465e+00,-6.894581124148601248e+00,-6.919836430179683973e+00,-6.909505364503717217e+00,-6.914848888668645444e+00,-6.927946148307912111e+00,-6.909886099326515208e+00,-6.907603861744050278e+00,-6.893082154409432505e+00,-6.908364029017446839e+00,-6.917339549981097235e+00,-6.907034115265814123e+00,-6.905137301343534872e+00,-6.912937207719412669e+00,-6.914848888668645444e+00,-6.878768312104162064e+00,-6.921761358589268553e+00,-6.891398494859149793e+00,-6.923496967934415380e+00,-6.927170954470643238e+00,-6.865384667470330271e+00,-6.939058456826275645e+00,-6.929886766033185097e+00,-6.921954055381462467e+00,-6.925815809553164115e+00,-6.898714914326804504e+00,-6.889904284644648769e+00,-6.891398494859149793e+00,-6.899279939297104036e+00,-6.934559671732577257e+00,-6.910266979163534273e+00,-6.888971535766277299e+00,-6.908364029017446839e+00,-6.888039656097685537e+00,-6.904000937584887865e+00,-6.913128211444565352e+00,-6.912173557434115878e+00,-6.899845283701118603e+00,-6.915614585490128974e+00,-6.928721943536892525e+00,-6.962061652353265018e+00,-6.909505364503717217e+00,-6.897397762946237521e+00,-6.907793849395423180e+00,-6.911982735947036716e+00,-6.902676809896858501e+00,-6.944365594589974933e+00,-6.917915200001790055e+00,-6.896833800152833760e+00,-6.934364530742593402e+00,-6.883578682157059703e+00,-6.901920951962155826e+00,-6.886736481753338524e+00,-6.915614585490128974e+00,-6.911601202176669290e+00,-6.896833800152833760e+00,-6.913128211444565352e+00,-6.849485506903674192e+00,-6.922725214085575729e+00,-6.937491365418404499e+00,-6.868668143605285437e+00,-6.905326820911998453e+00,-6.903622436444864618e+00,-6.903811669106987736e+00,-6.910076521111344405e+00,-6.902865863683222969e+00,-6.905326820911998453e+00,-6.907983873148964449e+00,-6.926589953195463778e+00,-6.892146438050339441e+00,-6.931831157079374250e+00,-6.866477962039871485e+00,-6.912364415340983115e+00,-6.922725214085575729e+00,-6.878399240315511420e+00,-6.909505364503717217e+00,-6.944957018954184491e+00,-6.907034115265814123e+00,-6.900599573761397210e+00,-6.913701441612168708e+00,-6.897773914916067284e+00,-6.910838571062491553e+00,-6.869582136905362901e+00,-6.920028756365487865e+00,-6.872145771343669551e+00,-6.904758369925586337e+00,-6.947523907519292052e+00,-6.873430057365528256e+00,-6.943577572484578297e+00,-6.923111016559495567e+00,-6.905516376404936096e+00,-6.915997653872517859e+00,-6.934559671732577257e+00,-6.916189243105788975e+00,-6.876187665510670399e+00,-6.916189243105788975e+00,-6.866113397690227771e+00,-6.949106812479797668e+00,-6.888785090347633755e+00,-6.899845283701118603e+00,-6.928527938298142175e+00,-6.937099975980922650e+00,-6.871229432967444595e+00,-6.911601202176669290e+00,-6.885435003462351489e+00,-6.909886099326515208e+00,-6.924848972003228909e+00,-6.925815809553164115e+00,-6.877108560685798011e+00,-6.901543237132358399e+00,-6.916764231142714436e+00,-6.925235594857454657e+00,-6.888039656097685537e+00,-6.896833800152833760e+00,-6.898526643592932217e+00,-6.916189243105788975e+00,-6.894768653450602969e+00,-6.926589953195463778e+00,-6.912746240469715531e+00,-6.905895595218732197e+00,-6.920221119547766975e+00,-6.897773914916067284e+00,-6.885435003462351489e+00,-6.924462498568466629e+00,-6.899468351921308695e+00,-6.901920951962155826e+00,-6.883207831049528380e+00,-6.907223994692948565e+00,-6.934754850810000093e+00,-6.905895595218732197e+00,-6.898714914326804504e+00,-6.926202806462018913e+00,-6.921954055381462467e+00,-6.903811669106987736e+00,-6.915806101338651146e+00,-6.890277627938139915e+00,-6.885620825259783473e+00,-6.911219813918805244e+00,-6.928140040698243851e+00,-6.903622436444864618e+00,-6.892146438050339441e+00,-6.895331452452420962e+00,-6.932415212386544567e+00,-6.918683249194693019e+00,-6.941610229729244352e+00,-6.925042264745687959e+00,-6.897585821244863524e+00,-6.932220489379405493e+00,-6.915614585490128974e+00,-6.890277627938139915e+00,-6.917339549981097235e+00,-6.887853384326632522e+00,-6.915997653872517859e+00,-6.904758369925586337e+00,-6.875083709556786715e+00,-6.865566800245627377e+00,-6.910266979163534273e+00,-6.927558476273594223e+00,-6.943577572484578297e+00,-6.897962043973153712e+00,-6.924848972003228909e+00,-6.918683249194693019e+00,-6.905705967835970682e+00,-6.904190241892125712e+00,-6.895706827832203345e+00,-6.954069601821926128e+00,-6.887294777106074406e+00,-6.910076521111344405e+00,-6.873980970977846994e+00,-6.953870616645664526e+00,-6.926009289286765735e+00,-6.933584347359934341e+00,-6.908554161159859319e+00,-6.914083777690233745e+00,-6.932025804281934711e+00,-6.936904338691832095e+00,-6.894581124148601248e+00,-6.881725800396642612e+00,-6.931052946923367486e+00,-6.913319251659110876e+00,-6.897397762946237521e+00,-6.893456686249017196e+00,-6.926202806462018913e+00,-6.909695713795240835e+00,-6.917339549981097235e+00,-6.888971535766277299e+00,-6.885063463424639707e+00,-6.888412303763633560e+00,-6.938274604150436176e+00,-6.887108644020381476e+00,-6.914083777690233745e+00,-6.904190241892125712e+00,-6.936317656363476303e+00,-6.920413519740760577e+00,-6.901920951962155826e+00,-6.885249216188245924e+00,-6.895894568375325662e+00,-6.918299150860796942e+00,-6.859754942259103316e+00,-6.907413910181126937e+00,-6.878583759183088731e+00,-6.923303973627147556e+00,-6.888412303763633560e+00,-6.914848888668645444e+00,-6.912746240469715531e+00,-6.904947817685933131e+00,-6.928721943536892525e+00,-6.945351496276852998e+00,-6.915806101338651146e+00,-6.914657556036280184e+00,-6.933194483807238129e+00,-6.922725214085575729e+00,-6.907034115265814123e+00,-6.904568958048896476e+00,-6.894768653450602969e+00,-6.890464351867354509e+00,-6.922146789312868975e+00,-6.938274604150436176e+00,-6.927558476273594223e+00,-6.927364696600475824e+00,-6.895894568375325662e+00,-6.918491181586304606e+00,-6.908934533929754096e+00,-6.895143817589312718e+00,-6.890464351867354509e+00,-6.900599573761397210e+00,-6.897962043973153712e+00,-6.912555309681545523e+00,-6.873980970977846994e+00,-6.913128211444565352e+00,-6.891959399800471786e+00,-6.879876345458523446e+00,-6.878768312104162064e+00,-6.903054953217679923e+00,-6.897585821244863524e+00,-6.913701441612168708e+00,-6.929886766033185097e+00,-6.911982735947036716e+00,-6.860298371309701437e+00,-6.899845283701118603e+00,-6.901354433204677008e+00,-6.861748960865966751e+00,-6.904379582042269092e+00,-6.901543237132358399e+00,-6.917915200001790055e+00,-6.934364530742593402e+00,-6.911410489865613016e+00,-6.943577572484578297e+00,-6.918683249194693019e+00,-6.926977249869553432e+00,-6.898526643592932217e+00,-6.899279939297104036e+00,-6.923496967934415380e+00,-6.891211596437898379e+00,-6.895706827832203345e+00,-6.897397762946237521e+00,-6.876371776714357509e+00,-6.938666453685231161e+00,-6.909695713795240835e+00,-6.908173933018396440e+00,-6.882281304530915023e+00,-6.904758369925586337e+00,-6.903244078513752768e+00,-6.897585821244863524e+00,-6.885063463424639707e+00,-6.925235594857454657e+00,-6.944759838632140969e+00,-6.918491181586304606e+00,-6.897021752413513340e+00,-6.917339549981097235e+00,-6.916764231142714436e+00,-6.913510328376995417e+00,-6.897021752413513340e+00,-6.917531396504466912e+00,-6.889904284644648769e+00,-6.915231663792935635e+00,-6.920028756365487865e+00,-6.924076174437720610e+00,-6.900599573761397210e+00,-6.918299150860796942e+00,-6.909124774584755713e+00,-6.941217225204633579e+00,-6.898150208429443708e+00,-6.907793849395423180e+00,-6.921954055381462467e+00,-6.910838571062491553e+00,-6.935535948291619235e+00,-6.926396361093416942e+00,-6.921376076365277186e+00,-6.917915200001790055e+00,-6.926589953195463778e+00,-6.911029174322383284e+00,-6.889531080684355047e+00,-6.898150208429443708e+00,-6.905516376404936096e+00,-6.928333970690124133e+00,-6.869399271402430784e+00,-6.906274957893996813e+00,-6.916572531726810524e+00,-6.925042264745687959e+00,-6.914466260005090348e+00,-6.888785090347633755e+00,-6.926589953195463778e+00,-6.891398494859149793e+00,-6.899656800051518601e+00,-6.922146789312868975e+00,-6.882466541171252672e+00,-6.919451888740605128e+00,-6.916572531726810524e+00,-6.902109862891215641e+00,-6.909315051438150945e+00,-6.883578682157059703e+00,-6.895331452452420962e+00,-6.886922545573765930e+00,-6.890651110668905233e+00,-6.950692227001336931e+00,-6.908744329459375777e+00,-6.921568698921973350e+00,-6.915040257916190924e+00,-6.935926725944110416e+00,-6.892333511290058112e+00,-6.933779336148697681e+00,-6.917147740255755295e+00,-6.908934533929754096e+00,-6.880430822877775299e+00,-6.950890580666571950e+00,-6.933194483807238129e+00,-6.926009289286765735e+00,-6.921761358589268553e+00,-6.907223994692948565e+00,-6.898714914326804504e+00,-6.919451888740605128e+00,-6.905705967835970682e+00,-6.935340616715649276e+00,-6.896833800152833760e+00,-6.913892591378591135e+00,-6.894581124148601248e+00,-6.917915200001790055e+00,-6.897773914916067284e+00,-6.893269402794961920e+00,-6.908173933018396440e+00,-6.899468351921308695e+00,-6.911601202176669290e+00,-6.911601202176669290e+00,-6.894206171013349760e+00,-6.876187665510670399e+00,-6.876924313794443577e+00,-6.911601202176669290e+00,-6.903433239584968462e+00,-6.882281304530915023e+00,-6.870863132597049727e+00,-6.923496967934415380e+00,-6.944562697182512068e+00,-6.909124774584755713e+00,-6.898714914326804504e+00,-6.886550452546217116e+00,-6.913701441612168708e+00,-6.906085258566859508e+00,-6.860842095836011012e+00,-6.919259673458899584e+00,-6.931052946923367486e+00,-6.896082344171576395e+00,-6.896082344171576395e+00,-6.920798431215875013e+00,-6.907603861744050278e+00,-6.974170781143014253e+00,-6.882466541171252672e+00,-6.911029174322383284e+00,-6.934754850810000093e+00,-6.888598679684404757e+00,-6.899279939297104036e+00,-6.889904284644648769e+00,-6.904758369925586337e+00,-6.921568698921973350e+00,-6.896270155234200061e+00,-6.936122172050469104e+00,-6.917531396504466912e+00,-6.892520619532721327e+00,-6.894768653450602969e+00,-6.912937207719412669e+00,-6.907034115265814123e+00,-6.932999609013675624e+00,-6.930275341907513820e+00,-6.943380663947957387e+00,-6.877845887908142331e+00,-6.895143817589312718e+00,-6.923496967934415380e+00,-6.892707762791431492e+00,-6.895519122528977363e+00,-6.929498341091420599e+00,-6.929692534703061568e+00,-6.888598679684404757e+00,-6.896270155234200061e+00,-6.876555921821221773e+00,-6.926783582782670479e+00,-6.882466541171252672e+00,-6.920221119547766975e+00,-6.911601202176669290e+00,-6.874532188263277277e+00,-6.897397762946237521e+00,-6.906654464539917271e+00,-6.883022457056968690e+00,-6.904947817685933131e+00,-6.927946148307912111e+00,-6.866842659345154587e+00,-6.895519122528977363e+00,-6.906274957893996813e+00,-6.903622436444864618e+00,-6.927946148307912111e+00,-6.919451888740605128e+00,-6.935340616715649276e+00,-6.906274957893996813e+00,-6.906844271886031450e+00,-6.872879447091913008e+00,-6.896833800152833760e+00,-6.878030304690591734e+00,-6.888225962572374073e+00
+-6.888608149554835691e+00,-6.947849305793029728e+00,-6.869148841033506869e+00,-6.919815326411356438e+00,-6.892904611588340558e+00,-6.895599304676764518e+00,-6.910094938091818406e+00,-6.892187251442800644e+00,-6.917606194437817990e+00,-6.912287520735235447e+00,-6.892366543240679988e+00,-6.938229312713234620e+00,-6.929817261987105681e+00,-6.879893109957397002e+00,-6.925544941937999965e+00,-6.932800112819171190e+00,-6.868973663665974172e+00,-6.896679218388626964e+00,-6.904452917483274277e+00,-6.878123824287078136e+00,-6.909365144226185862e+00,-6.943499417137472207e+00,-6.941802424178135311e+00,-6.915401931972944283e+00,-6.871428943748950857e+00,-6.909547542767503714e+00,-6.891828764254878337e+00,-6.885041811966045699e+00,-6.900648909635556549e+00,-6.894161231446727101e+00,-6.897039448940072148e+00,-6.891291274319874205e+00,-6.909365144226185862e+00,-6.872834672323435257e+00,-6.897580038195512131e+00,-6.884152211338031790e+00,-6.910094938091818406e+00,-6.896499151763489266e+00,-6.893622486708904873e+00,-6.874066307915036589e+00,-6.883796592557613891e+00,-6.921660005038123842e+00,-6.923138201907434564e+00,-6.911738924214207103e+00,-6.923323130248229873e+00,-6.928886942820556172e+00,-6.910642633221893760e+00,-6.878300612053825702e+00,-6.919262585724782255e+00,-6.903908601040550153e+00,-6.906269448409672052e+00,-6.917974044359199937e+00,-6.915035026813811214e+00,-6.887715372064949548e+00,-6.901191454182727725e+00,-6.960821635741828217e+00,-6.930003429724774122e+00,-6.895059784836323757e+00,-6.920183990010002617e+00,-6.884152211338031790e+00,-6.915952542246559531e+00,-6.906087646754775378e+00,-6.926101167038090978e+00,-6.895059784836323757e+00,-6.920921725196542695e+00,-6.926472155727191549e+00,-6.915218462565952962e+00,-6.888429530281340973e+00,-6.897039448940072148e+00,-6.927214546258689154e+00,-6.921475383972335749e+00,-6.889501724808646088e+00,-6.912470453142221416e+00,-6.933547217822169983e+00,-6.860948498474250812e+00,-6.908271451080345571e+00,-6.918342029643941515e+00,-6.894340877544397728e+00,-6.909729974584116974e+00,-6.898842550762884329e+00,-6.891649568841801354e+00,-6.910460034846511945e+00,-6.894700266586900383e+00,-6.945388348564254244e+00,-6.916136146396528517e+00,-6.951647259465786988e+00,-6.903545887942630799e+00,-6.943876918085708283e+00,-6.919999641221574294e+00,-6.930375869207663797e+00,-6.894520555920589544e+00,-6.908818148132684911e+00,-6.920921725196542695e+00,-6.913202517596584329e+00,-6.916503455858878269e+00,-6.917790102484312698e+00,-6.943310720090021704e+00,-6.911738924214207103e+00,-6.952599005992375680e+00,-6.913385617478081002e+00,-6.864953030682171331e+00,-6.898481670210429684e+00,-6.885041811966045699e+00,-6.910825264945533064e+00,-6.886645090121897184e+00,-6.899384115899326986e+00,-6.933360389236852939e+00,-6.899023039888904307e+00,-6.895059784836323757e+00,-6.922953307758806574e+00,-6.913202517596584329e+00,-6.899203561597129664e+00,-6.887358484123973668e+00,-6.928515057356991136e+00,-6.898120919845766252e+00,-6.901010573292301942e+00,-6.912470453142221416e+00,-6.915768971800885012e+00,-6.913385617478081002e+00,-6.934294881408400002e+00,-6.908453650224512188e+00,-6.913935118361949961e+00,-6.935604639290463780e+00,-6.894880009554940514e+00,-6.906269448409672052e+00,-6.926657701696736780e+00,-6.889501724808646088e+00,-6.909365144226185862e+00,-6.894880009554940514e+00,-6.883263401395954162e+00,-6.915401931972944283e+00,-6.899925974487496561e+00,-6.883618830580431336e+00,-6.928515057356991136e+00,-6.910094938091818406e+00,-6.921660005038123842e+00,-6.940672693310474983e+00,-6.916136146396528517e+00,-6.911007930029610336e+00,-6.920368372789173605e+00,-6.897760299574999365e+00,-6.955459692600443944e+00,-6.935417425960327620e+00,-6.902277427140134591e+00,-6.878477431080012749e+00,-6.923508092793841939e+00,-6.893981617615978053e+00,-6.908089285126658652e+00,-6.901553314146871898e+00,-6.912653419019594736e+00,-6.923323130248229873e+00,-6.938604829052607670e+00,-6.943876918085708283e+00,-6.909365144226185862e+00,-6.883263401395954162e+00,-6.907907152351361546e+00,-6.864604173186918956e+00,-6.906996985728047989e+00,-6.901372367796971830e+00,-6.886466821113492287e+00,-6.926843282099978794e+00,-6.927585948305670982e+00,-6.910642633221893760e+00,-6.877593648433006024e+00,-6.892545867189873832e+00,-6.871077820149839255e+00,-6.912470453142221416e+00,-6.904997530367753100e+00,-6.926843282099978794e+00,-6.902458537346189971e+00,-6.916870900287042545e+00,-6.889501724808646088e+00,-6.932800112819171190e+00,-6.894161231446727101e+00,-6.915401931972944283e+00,-6.895959146350312707e+00,-6.913385617478081002e+00,-6.922953307758806574e+00,-6.922768447789707125e+00,-6.884685876742496546e+00,-6.918158020074924863e+00,-6.905542440017054417e+00,-6.964862045278833236e+00,-6.938604829052607670e+00,-6.899925974487496561e+00,-6.924433419037342219e+00,-6.911190628486316712e+00,-6.913202517596584329e+00,-6.890575070171085059e+00,-6.896319117556503286e+00,-6.917790102484312698e+00,-6.932240150135383416e+00,-6.911921756283462059e+00,-6.904452917483274277e+00,-6.913751917848475870e+00,-6.880424506657529804e+00,-6.885397873924587131e+00,-6.948228452717001957e+00,-6.933360389236852939e+00,-6.926286644178565766e+00,-6.931121164565436032e+00,-6.911556125566407971e+00,-6.880601701661094083e+00,-6.895779209327784542e+00,-6.922029349455987912e+00,-6.939356285033865035e+00,-6.907725052742373251e+00,-6.908271451080345571e+00,-6.906633150905042484e+00,-6.914118352444003079e+00,-6.910825264945533064e+00,-6.911373360327846882e+00,-6.915218462565952962e+00,-6.867923243401323319e+00,-6.904815959781688051e+00,-6.938792640115231336e+00,-6.911921756283462059e+00,-6.902458537346189971e+00,-6.887893863810180761e+00,-6.927957488343382764e+00,-6.927585948305670982e+00,-6.900829725113853286e+00,-6.877416985593574950e+00,-6.914301620106936497e+00,-6.901191454182727725e+00,-6.875475751418271386e+00,-6.897580038195512131e+00,-6.934855996171089387e+00,-6.899384115899326986e+00,-6.903727228046491859e+00,-6.903908601040550153e+00,-6.914668256224700826e+00,-6.883085734166208169e+00,-6.870200550157615993e+00,-6.896859317443594151e+00,-6.942744842493103974e+00,-6.909365144226185862e+00,-6.927585948305670982e+00,-6.881665531548659231e+00,-6.941425705394250301e+00,-6.908089285126658652e+00,-6.902096349729045954e+00,-6.921106244064120006e+00,-6.905179133927681789e+00,-6.920552789571623009e+00,-6.913751917848475870e+00,-6.907178952792511595e+00,-6.945577438098711198e+00,-6.912653419019594736e+00,-6.895059784836323757e+00,-6.894161231446727101e+00,-6.898481670210429684e+00,-6.935979171130048471e+00,-6.898842550762884329e+00,-6.905724142570475621e+00,-6.904271445746998026e+00,-6.911007930029610336e+00,-6.905360770473453869e+00,-6.908271451080345571e+00,-6.922214072833256182e+00,-6.895239592442671039e+00,-6.914484921363062142e+00,-6.902639680359092367e+00,-6.918526073078711036e+00,-6.881310795823052828e+00,-6.885397873924587131e+00,-6.908089285126658652e+00,-6.905179133927681789e+00,-6.921106244064120006e+00,-6.909912439688172370e+00,-6.922214072833256182e+00,-6.926472155727191549e+00,-6.902639680359092367e+00,-6.938604829052607670e+00,-6.905905878145714993e+00,-6.911556125566407971e+00,-6.912287520735235447e+00,-6.906451283122420293e+00,-6.897399809304332052e+00,-6.906815051769573444e+00,-6.909547542767503714e+00,-6.911373360327846882e+00,-6.931680500834328740e+00,-6.903183306357803417e+00,-6.908089285126658652e+00,-6.947849305793029728e+00,-6.907725052742373251e+00,-6.896139115756000137e+00,-6.949366756767062725e+00,-6.891649568841801354e+00,-6.917238479780245086e+00,-6.889322945868329384e+00,-6.909912439688172370e+00,-6.903002064853001940e+00,-6.895419432385606839e+00,-6.909365144226185862e+00,-6.914668256224700826e+00,-6.909000446920902760e+00,-6.896139115756000137e+00,-6.901553314146871898e+00,-6.871780190679151801e+00,-6.901191454182727725e+00,-6.914118352444003079e+00,-6.913019451234516666e+00,-6.893802036040565184e+00,-6.905905878145714993e+00,-6.906633150905042484e+00,-6.912836418379608716e+00,-6.928143310140814748e+00,-6.897760299574999365e+00,-6.897580038195512131e+00,-6.951076646040890594e+00,-6.913935118361949961e+00,-6.904090006936739243e+00,-6.902277427140134591e+00,-6.909365144226185862e+00,-6.901915305101050890e+00,-6.934481884681503061e+00,-6.921660005038123842e+00,-6.885754062708405954e+00,-6.871604551792250604e+00,-6.895059784836323757e+00,-6.907907152351361546e+00,-6.881842946611662626e+00,-6.895779209327784542e+00,-6.881488147956176249e+00,-6.919999641221574294e+00,-6.927400230039745921e+00,-6.911921756283462059e+00,-6.909912439688172370e+00,-6.897940593454510605e+00,-6.917606194437817990e+00,-6.909182778948029124e+00,-6.916319784263173176e+00,-6.924803789411946298e+00,-6.894340877544397728e+00,-6.909000446920902760e+00,-6.917422320207274211e+00,-6.917422320207274211e+00,-6.888786800738873950e+00,-6.917606194437817990e+00,-6.932053565565386322e+00,-6.883441100196955986e+00,-6.892366543240679988e+00,-6.906633150905042484e+00,-6.931867015802696486e+00,-6.926657701696736780e+00,-6.892187251442800644e+00,-6.907542986287614539e+00,-6.927214546258689154e+00,-6.906269448409672052e+00,-6.901372367796971830e+00,-6.934107913098984000e+00,-6.901553314146871898e+00,-6.917238479780245086e+00,-6.924063185786168972e+00,-6.880424506657529804e+00,-6.921290796985193339e+00,-6.904090006936739243e+00,-6.896859317443594151e+00,-6.903364580717042287e+00,-6.914484921363062142e+00,-6.894700266586900383e+00,-6.914851624704173361e+00,-6.915401931972944283e+00,-6.889322945868329384e+00,-6.921290796985193339e+00,-6.903364580717042287e+00,-6.889680535716587784e+00,-6.910094938091818406e+00,-6.899203561597129664e+00,-6.914118352444003079e+00,-6.934481884681503061e+00,-6.912470453142221416e+00,-6.896139115756000137e+00,-6.919631045566829286e+00,-6.917422320207274211e+00,-6.867048735256391367e+00,-6.897940593454510605e+00,-6.924248285277673887e+00,-6.941237399207835779e+00,-6.912653419019594736e+00,-6.903727228046491859e+00,-6.868973663665974172e+00,-6.918710150391701674e+00,-6.887893863810180761e+00,-6.919078406702253048e+00,-6.896679218388626964e+00,-6.900468126845584749e+00,-6.907907152351361546e+00,-6.903183306357803417e+00,-6.893622486708904873e+00,-6.933920979740181068e+00,-6.903183306357803417e+00,-6.895779209327784542e+00,-6.935230247672462767e+00,-6.909000446920902760e+00,-6.903183306357803417e+00,-6.919262585724782255e+00,-6.893442969609418824e+00,-6.922768447789707125e+00,-6.925544941937999965e+00,-6.890038253481089114e+00,-6.918158020074924863e+00,-6.910642633221893760e+00,-6.897399809304332052e+00,-6.939168368092600403e+00,-6.893802036040565184e+00,-6.896139115756000137e+00,-6.927400230039745921e+00,-6.931680500834328740e+00,-6.907542986287614539e+00,-6.914484921363062142e+00,-6.883263401395954162e+00,-6.897039448940072148e+00,-6.924063185786168972e+00,-6.945577438098711198e+00,-6.876710646168127994e+00,-6.921290796985193339e+00,-6.890933108126933604e+00,-6.935791887675993195e+00,-6.866699146020177835e+00,-6.884685876742496546e+00,-6.924063185786168972e+00,-6.927957488343382764e+00,-6.885575952457690718e+00,-6.866699146020177835e+00,-6.901553314146871898e+00,-6.881488147956176249e+00,-6.910460034846511945e+00,-6.903364580717042287e+00,-6.868098236839514925e+00,-6.913385617478081002e+00,-6.895059784836323757e+00,-6.937291138331634244e+00,-6.924063185786168972e+00,-6.899564702807266414e+00,-6.898842550762884329e+00,-6.901734293244279783e+00,-6.924803789411946298e+00,-6.909912439688172370e+00,-6.887536912173336745e+00,-6.912287520735235447e+00,-6.881488147956176249e+00,-6.889501724808646088e+00,-6.918158020074924863e+00,-6.897940593454510605e+00,-6.914851624704173361e+00,-6.890396099253381479e+00,-6.918158020074924863e+00,-6.922214072833256182e+00,-6.914118352444003079e+00,-6.931121164565436032e+00,-6.906087646754775378e+00,-6.933920979740181068e+00,-6.902820856190732712e+00,-6.898842550762884329e+00,-6.894340877544397728e+00,-6.911921756283462059e+00,-6.916136146396528517e+00,-6.900468126845584749e+00,-6.907542986287614539e+00,-6.918342029643941515e+00,-6.959862034862128510e+00,-6.924803789411946298e+00,-6.918526073078711036e+00,-6.915218462565952962e+00,-6.894161231446727101e+00,-6.920552789571623009e+00,-6.914301620106936497e+00,-6.898842550762884329e+00,-6.924618587077867815e+00,-6.882552921792356670e+00,-6.903183306357803417e+00,-6.901010573292301942e+00,-6.904815959781688051e+00,-6.874770781350301618e+00,-6.906269448409672052e+00,-6.903364580717042287e+00,-6.905542440017054417e+00,-6.911190628486316712e+00,-6.944443436843187101e+00,-6.939544237294544615e+00,-6.917422320207274211e+00,-6.910277469807210693e+00,-6.894340877544397728e+00,-6.900468126845584749e+00,-6.904090006936739243e+00,-6.931867015802696486e+00,-6.870200550157615993e+00,-6.926843282099978794e+00,-6.900287376732125111e+00,-6.915401931972944283e+00,-6.929817261987105681e+00,-6.899023039888904307e+00,-6.922768447789707125e+00,-6.910642633221893760e+00,-6.907725052742373251e+00,-6.917606194437817990e+00,-6.944821294395387312e+00,-6.905179133927681789e+00,-6.941425705394250301e+00,-6.929631128901412751e+00,-6.907907152351361546e+00,-6.915401931972944283e+00,-6.918894261595388784e+00,-6.890754073125112456e+00,-6.922583621987499214e+00,-6.915035026813811214e+00,-6.887715372064949548e+00,-6.905724142570475621e+00,-6.874770781350301618e+00,-6.918894261595388784e+00,-6.895599304676764518e+00,-6.902277427140134591e+00,-6.892904611588340558e+00,-6.891470405533967281e+00,-6.929631128901412751e+00,-6.910460034846511945e+00,-6.908818148132684911e+00,-6.894700266586900383e+00,-6.892545867189873832e+00,-6.880247343046471897e+00,-6.892725223301914284e+00,-6.887180087905500514e+00,-6.876357663460614944e+00,-6.883974386139733781e+00,-6.876710646168127994e+00,-6.934294881408400002e+00,-6.899203561597129664e+00,-6.904815959781688051e+00,-6.887715372064949548e+00,-6.894340877544397728e+00,-6.926657701696736780e+00,-6.877240353958388397e+00,-6.907178952792511595e+00,-6.909365144226185862e+00,-6.902458537346189971e+00,-6.916687161196033884e+00,-6.903364580717042287e+00,-6.931307575228665030e+00,-6.904271445746998026e+00,-6.892007991784705467e+00,-6.889680535716587784e+00,-6.890038253481089114e+00,-6.885219827097801115e+00,-6.888250942906992691e+00,-6.920368372789173605e+00,-6.933734081318929654e+00,-6.927400230039745921e+00,-6.893622486708904873e+00,-6.906087646754775378e+00,-6.901734293244279783e+00,-6.867923243401323319e+00,-6.922583621987499214e+00,-6.896319117556503286e+00,-6.894161231446727101e+00,-6.916319784263173176e+00,-6.862165572431319305e+00,-6.890396099253381479e+00,-6.952789464044565548e+00,-6.889501724808646088e+00,-6.941049128473963492e+00,-6.913935118361949961e+00,-6.922029349455987912e+00,-6.885397873924587131e+00,-6.919078406702253048e+00,-6.888429530281340973e+00,-6.891470405533967281e+00,-6.942367768582149878e+00,-6.916136146396528517e+00,-6.896499151763489266e+00,-6.881665531548659231e+00,-6.926657701696736780e+00,-6.919815326411356438e+00,-6.929072937427248391e+00,-6.920368372789173605e+00,-6.917790102484312698e+00,-6.909547542767503714e+00,-6.926101167038090978e+00,-6.918342029643941515e+00,-6.895959146350312707e+00,-6.941802424178135311e+00,-6.901191454182727725e+00,-6.899564702807266414e+00,-6.884685876742496546e+00,-6.935230247672462767e+00,-6.938041607410008638e+00,-6.882020393156357940e+00,-6.926472155727191549e+00,-6.916687161196033884e+00,-6.922768447789707125e+00,-6.897399809304332052e+00,-6.927771701069277199e+00,-6.908818148132684911e+00,-6.941237399207835779e+00,-6.914668256224700826e+00,-6.875299462303667397e+00,-6.894880009554940514e+00,-6.893802036040565184e+00,-6.908089285126658652e+00,-6.902639680359092367e+00,-6.903727228046491859e+00,-6.913568750891283088e+00,-6.932613423749273807e+00,-6.952599005992375680e+00,-6.899384115899326986e+00,-6.924989026052283947e+00,-6.923138201907434564e+00,-6.946144921325895893e+00,-6.891470405533967281e+00,-6.896319117556503286e+00,-6.910825264945533064e+00,-6.911556125566407971e+00,-6.904452917483274277e+00,-6.885397873924587131e+00,-6.918158020074924863e+00,-6.906269448409672052e+00,-6.936728655894381035e+00,-6.926286644178565766e+00,-6.897219612889754714e+00,-6.894161231446727101e+00,-6.884685876742496546e+00,-6.892366543240679988e+00,-6.890038253481089114e+00,-6.908271451080345571e+00,-6.886823390915807863e+00,-6.902458537346189971e+00,-6.915035026813811214e+00,-6.885754062708405954e+00,-6.898481670210429684e+00,-6.893981617615978053e+00,-6.892725223301914284e+00,-6.877063753516429401e+00,-6.879893109957397002e+00,-6.899023039888904307e+00,-6.904997530367753100e+00,-6.899203561597129664e+00,-6.911921756283462059e+00,-6.925915724293007614e+00,-6.917790102484312698e+00,-6.893442969609418824e+00,-6.886823390915807863e+00,-6.925544941937999965e+00,-6.938604829052607670e+00,-6.901553314146871898e+00,-6.926657701696736780e+00,-6.908271451080345571e+00,-6.913019451234516666e+00,-6.897039448940072148e+00,-6.885932204688034020e+00,-6.901553314146871898e+00,-6.919999641221574294e+00,-6.917422320207274211e+00,-6.874242379752802279e+00,-6.913751917848475870e+00,-6.893084032060697197e+00,-6.900106659283363086e+00,-6.909547542767503714e+00,-6.898662094207310247e+00,-6.906633150905042484e+00,-6.913935118361949961e+00,-6.897940593454510605e+00,-6.894161231446727101e+00,-6.879539002304971618e+00,-6.912653419019594736e+00,-6.945199294777889776e+00,-6.913202517596584329e+00,-6.899384115899326986e+00,-6.911738924214207103e+00,-6.881842946611662626e+00,-6.916870900287042545e+00,-6.905905878145714993e+00,-6.890217160360535331e+00,-6.883263401395954162e+00,-6.926472155727191549e+00,-6.899023039888904307e+00,-6.902639680359092367e+00,-6.934294881408400002e+00,-6.890217160360535331e+00,-6.915218462565952962e+00,-6.912836418379608716e+00,-6.902639680359092367e+00,-6.912836418379608716e+00,-6.948607743447890783e+00,-6.907360952975015067e+00,-6.894161231446727101e+00,-6.908818148132684911e+00,-6.880247343046471897e+00,-6.901734293244279783e+00,-6.908089285126658652e+00,-6.887893863810180761e+00,-6.915401931972944283e+00,-6.916687161196033884e+00,-6.915585435047136187e+00,-6.896859317443594151e+00,-6.899384115899326986e+00,-6.936353843296290123e+00,-6.904997530367753100e+00,-6.925359602302588513e+00,-6.889859378603588880e+00,-6.929072937427248391e+00,-6.893442969609418824e+00,-6.877063753516429401e+00,-6.914668256224700826e+00,-6.914301620106936497e+00,-6.928700982801423436e+00,-6.878831162954943679e+00,-6.911007930029610336e+00,-6.925730315930559655e+00,-6.871780190679151801e+00,-6.905905878145714993e+00,-6.892007991784705467e+00,-6.904634422157517548e+00,-6.918526073078711036e+00,-6.898301278760495592e+00,-6.910277469807210693e+00,-6.929258966634369798e+00,-6.911921756283462059e+00,-6.935791887675993195e+00,-6.905542440017054417e+00,-6.928143310140814748e+00,-6.916319784263173176e+00,-6.926472155727191549e+00,-6.890217160360535331e+00,-6.942556287764535483e+00,-6.899925974487496561e+00,-6.900829725113853286e+00,-6.927957488343382764e+00,-6.906996985728047989e+00,-6.936916114888274976e+00,-6.911007930029610336e+00,-6.902458537346189971e+00,-6.915218462565952962e+00,-6.918710150391701674e+00,-6.913751917848475870e+00,-6.874418482597317137e+00,-6.883085734166208169e+00,-6.898842550762884329e+00,-6.929258966634369798e+00,-6.905179133927681789e+00,-6.956223926493199983e+00,-6.902639680359092367e+00,-6.907725052742373251e+00,-6.923693089556925528e+00,-6.876534139239772969e+00,-6.940672693310474983e+00,-6.904452917483274277e+00,-6.924803789411946298e+00,-6.895059784836323757e+00,-6.903908601040550153e+00,-6.885932204688034020e+00,-6.895419432385606839e+00,-6.898301278760495592e+00,-6.892545867189873832e+00,-6.902820856190732712e+00,-6.892725223301914284e+00,-6.875828422908435655e+00,-6.921660005038123842e+00,-6.919078406702253048e+00,-6.924248285277673887e+00,-6.888608149554835691e+00,-6.926101167038090978e+00,-6.919631045566829286e+00,-6.916503455858878269e+00,-6.894880009554940514e+00,-6.922768447789707125e+00,-6.911007930029610336e+00,-6.931867015802696486e+00,-6.925915724293007614e+00,-6.911190628486316712e+00,-6.922953307758806574e+00,-6.908271451080345571e+00,-6.885397873924587131e+00,-6.901915305101050890e+00,-6.906269448409672052e+00,-6.908271451080345571e+00,-6.897039448940072148e+00,-6.900287376732125111e+00,-6.890933108126933604e+00,-6.906633150905042484e+00,-6.902639680359092367e+00,-6.897760299574999365e+00,-6.924803789411946298e+00,-6.901734293244279783e+00,-6.922583621987499214e+00,-6.888965483844861737e+00,-6.904090006936739243e+00,-6.899203561597129664e+00,-6.909729974584116974e+00,-6.902277427140134591e+00,-6.906087646754775378e+00,-6.914484921363062142e+00,-6.913202517596584329e+00,-6.916687161196033884e+00,-6.898842550762884329e+00,-6.901915305101050890e+00,-6.942367768582149878e+00,-6.868273260905771949e+00,-6.921660005038123842e+00,-6.905179133927681789e+00,-6.914301620106936497e+00,-6.883441100196955986e+00,-6.909912439688172370e+00,-6.960245764721019412e+00,-6.896319117556503286e+00,-6.881133475138129896e+00,-6.914668256224700826e+00,-6.896319117556503286e+00,-6.880247343046471897e+00,-6.901553314146871898e+00,-6.920921725196542695e+00,-6.938980486457476005e+00,-6.893442969609418824e+00,-6.916319784263173176e+00,-6.914484921363062142e+00,-6.899023039888904307e+00,-6.896139115756000137e+00,-6.876181218819663599e+00,-6.893622486708904873e+00,-6.910825264945533064e+00,-6.918710150391701674e+00,-6.913202517596584329e+00,-6.947470302566964406e+00,-6.870200550157615993e+00,-6.916319784263173176e+00,-6.919446798675474852e+00,-6.932053565565386322e+00,-6.917974044359199937e+00,-6.899745322332728747e+00,-6.934668922931370716e+00,-6.916687161196033884e+00,-6.896139115756000137e+00,-6.914668256224700826e+00,-6.874242379752802279e+00,-6.925730315930559655e+00,-6.909365144226185862e+00,-6.921475383972335749e+00,-6.915401931972944283e+00,-6.919999641221574294e+00,-6.927214546258689154e+00,-6.898481670210429684e+00,-6.911007930029610336e+00,-6.909365144226185862e+00,-6.901372367796971830e+00,-6.923138201907434564e+00,-6.913751917848475870e+00,-6.901915305101050890e+00,-6.908271451080345571e+00,-6.899564702807266414e+00,-6.922214072833256182e+00,-6.917974044359199937e+00,-6.888250942906992691e+00,-6.892904611588340558e+00,-6.893802036040565184e+00,-6.882730494375616459e+00,-6.922214072833256182e+00,-6.895239592442671039e+00,-6.910094938091818406e+00,-6.901191454182727725e+00,-6.935791887675993195e+00,-6.900648909635556549e+00,-6.909365144226185862e+00,-6.901372367796971830e+00,-6.901372367796971830e+00,-6.889144198884208592e+00,-6.922768447789707125e+00,-6.876887184256679220e+00,-6.905179133927681789e+00,-6.901372367796971830e+00,-6.903908601040550153e+00,-6.903545887942630799e+00,-6.910460034846511945e+00,-6.869324049093526341e+00,-6.885754062708405954e+00,-6.925174297011590596e+00,-6.884685876742496546e+00,-6.878300612053825702e+00,-6.937103609029632523e+00,-6.908818148132684911e+00,-6.910094938091818406e+00,-6.904634422157517548e+00,-6.908271451080345571e+00,-6.888072387420402620e+00,-6.934107913098984000e+00,-6.878831162954943679e+00,-6.962936004621791852e+00,-6.922398830339554721e+00,-6.918710150391701674e+00,-6.895419432385606839e+00,-6.963128441839744553e+00,-6.899564702807266414e+00,-6.920368372789173605e+00,-6.877593648433006024e+00,-6.898120919845766252e+00,-6.918894261595388784e+00,-6.918158020074924863e+00,-6.865127505078497450e+00,-6.906087646754775378e+00,-6.910094938091818406e+00,-6.915401931972944283e+00,-6.910277469807210693e+00,-6.902639680359092367e+00,-6.938417053256356937e+00,-6.906269448409672052e+00,-6.947280854806617612e+00,-6.917422320207274211e+00,-6.917790102484312698e+00,-6.915768971800885012e+00,-6.896679218388626964e+00,-6.883974386139733781e+00,-6.919999641221574294e+00,-6.912287520735235447e+00,-6.897399809304332052e+00,-6.900106659283363086e+00,-6.914118352444003079e+00,-6.906451283122420293e+00,-6.909547542767503714e+00,-6.904815959781688051e+00,-6.927400230039745921e+00,-6.922583621987499214e+00,-6.944443436843187101e+00,-6.919262585724782255e+00,-6.896859317443594151e+00,-6.883618830580431336e+00,-6.870902304572380714e+00,-6.876534139239772969e+00,-6.936353843296290123e+00,-6.889859378603588880e+00,-6.888072387420402620e+00,-6.896139115756000137e+00,-6.902458537346189971e+00,-6.912104621786394176e+00,-6.923138201907434564e+00,-6.898662094207310247e+00,-6.882908098496498539e+00,-6.900648909635556549e+00,-6.911921756283462059e+00,-6.875123204261438303e+00,-6.896499151763489266e+00,-6.888786800738873950e+00,-6.910642633221893760e+00,-6.921844660195143106e+00,-6.884330068163757588e+00,-6.897940593454510605e+00,-6.897580038195512131e+00,-6.918894261595388784e+00,-6.882552921792356670e+00,-6.887180087905500514e+00,-6.929631128901412751e+00,-6.886466821113492287e+00,-6.901010573292301942e+00,-6.925915724293007614e+00,-6.900829725113853286e+00,-6.925730315930559655e+00,-6.911190628486316712e+00,-6.919815326411356438e+00,-6.919446798675474852e+00,-6.912104621786394176e+00,-6.921290796985193339e+00,-6.914668256224700826e+00,-6.900829725113853286e+00,-6.895239592442671039e+00,-6.880424506657529804e+00,-6.942556287764535483e+00,-6.933173595549936508e+00,-6.918710150391701674e+00,-6.957371373549676719e+00,-6.891649568841801354e+00,-6.877770342487711019e+00,-6.900829725113853286e+00,-6.897760299574999365e+00,-6.893263484730537627e+00,-6.885932204688034020e+00,-6.915035026813811214e+00,-6.900468126845584749e+00,-6.963898561246308461e+00,-6.919446798675474852e+00,-6.879539002304971618e+00,-6.921106244064120006e+00,-6.912653419019594736e+00,-6.900106659283363086e+00,-6.907360952975015067e+00,-6.894880009554940514e+00,-6.898120919845766252e+00,-6.931307575228665030e+00,-6.902458537346189971e+00,-6.886288583879261793e+00,-6.885219827097801115e+00,-6.878654281376698876e+00,-6.918894261595388784e+00,-6.887536912173336745e+00,-6.897760299574999365e+00,-6.912104621786394176e+00,-6.908635882571259046e+00,-6.911373360327846882e+00,-6.920552789571623009e+00,-6.944632347772246916e+00,-6.888250942906992691e+00,-6.917790102484312698e+00
+-6.892322400755601208e+00,-6.888457483825305872e+00,-6.901585156867840354e+00,-6.945335001821746346e+00,-6.923981877351176095e+00,-6.901585156867840354e+00,-6.931432480161788590e+00,-6.896757817378373900e+00,-6.932967694433763484e+00,-6.924172226642699712e+00,-6.904380797492107646e+00,-6.949426580837190670e+00,-6.899539976272098585e+00,-6.897498969904427568e+00,-6.893983377422415515e+00,-6.879679080708800853e+00,-6.874231455106562194e+00,-6.915642177764647514e+00,-6.919613814190993750e+00,-6.910371081222784539e+00,-6.932775663708255820e+00,-6.914133312898977479e+00,-6.917909752432427339e+00,-6.913756452144562914e+00,-6.950403238791569294e+00,-6.886072380410430682e+00,-6.915642177764647514e+00,-6.917909752432427339e+00,-6.906248909374827605e+00,-6.933928401588064006e+00,-6.920941206061254292e+00,-6.910183340679662223e+00,-6.897313630269016116e+00,-6.925696326766264121e+00,-6.907933199096476073e+00,-6.904754140785598793e+00,-6.931240743990173314e+00,-6.939712107704913535e+00,-6.904754140785598793e+00,-6.896387447003769822e+00,-6.933928401588064006e+00,-6.890296057610369118e+00,-6.889192507746672689e+00,-6.913944864768767573e+00,-6.890296057610369118e+00,-6.906435912647930664e+00,-6.925696326766264121e+00,-6.931816062828556113e+00,-6.901399058421224808e+00,-6.893060272030547608e+00,-6.901585156867840354e+00,-6.908870142854702578e+00,-6.903261603195092633e+00,-6.903448048613736177e+00,-6.880589910537686649e+00,-6.915830946052135886e+00,-6.923030674007318197e+00,-6.919613814190993750e+00,-6.949231363657458971e+00,-6.915076086608856087e+00,-6.937201726933034607e+00,-6.881866467654599973e+00,-6.863603596808228957e+00,-6.860383558247768576e+00,-6.917909752432427339e+00,-6.886805652670600963e+00,-6.893614033004551445e+00,-6.906061941065411602e+00,-6.891032434668680651e+00,-6.909807965299879839e+00,-6.926840928188441993e+00,-6.922650445865855318e+00,-6.926077715024128167e+00,-6.909620330436771596e+00,-6.904754140785598793e+00,-6.899911516309810366e+00,-6.870435698176857287e+00,-6.895832148516571891e+00,-6.915642177764647514e+00,-6.893798688161570709e+00,-6.914133312898977479e+00,-6.934120653823589109e+00,-6.881684002550544577e+00,-6.880407678190939791e+00,-6.915264748056449307e+00,-6.882414062812939548e+00,-6.925887002713071894e+00,-6.902516168945144415e+00,-6.925696326766264121e+00,-6.884973479200944269e+00,-6.888090174362956120e+00,-6.921510628113273000e+00,-6.898982924916127857e+00,-6.909807965299879839e+00,-6.888457483825305872e+00,-6.924362612173974085e+00,-6.900283194440831380e+00,-6.906810024137516990e+00,-6.914133312898977479e+00,-6.900469085323418739e+00,-6.895277158214657476e+00,-6.896757817378373900e+00,-6.943780698031076781e+00,-6.895462120760269542e+00,-6.895462120760269542e+00,-6.914510315730963086e+00,-6.892322400755601208e+00,-6.888273812229600779e+00,-6.943198456384351402e+00,-6.920561771414318386e+00,-6.916775322361814915e+00,-6.922270362242882058e+00,-6.927031822529004401e+00,-6.907745915642420798e+00,-6.903634528800756343e+00,-6.887722999767312615e+00,-6.905501245788597586e+00,-6.877859906112142596e+00,-6.910371081222784539e+00,-6.896943054018711550e+00,-6.935660003752342817e+00,-6.931432480161788590e+00,-6.947281285036359222e+00,-6.927413720566871547e+00,-6.939132229463091051e+00,-6.898611729663164382e+00,-6.925887002713071894e+00,-6.906061941065411602e+00,-6.897313630269016116e+00,-6.892875753162970298e+00,-6.901399058421224808e+00,-6.904007593531813924e+00,-6.908307871262717725e+00,-6.872060687249790689e+00,-6.920372108066191075e+00,-6.919424330533392009e+00,-6.889560222404245593e+00,-6.885889146328377564e+00,-6.888824928253470148e+00,-6.911686252854345014e+00,-6.905127623516364110e+00,-6.886622284191128429e+00,-6.891400826641902455e+00,-6.890848289561816387e+00,-6.897684343896987258e+00,-6.853619559515086834e+00,-6.904754140785598793e+00,-6.921130977387376149e+00,-6.913944864768767573e+00,-6.935852589212736063e+00,-6.923981877351176095e+00,-6.912438556820612590e+00,-6.922650445865855318e+00,-6.900655010767851039e+00,-6.899911516309810366e+00,-6.895832148516571891e+00,-6.893983377422415515e+00,-6.894537649953926817e+00,-6.879861180317789149e+00,-6.925505687169842162e+00,-6.919803333759457331e+00,-6.906248909374827605e+00,-6.908495260001201288e+00,-6.903634528800756343e+00,-6.910371081222784539e+00,-6.923411046777212974e+00,-6.919424330533392009e+00,-6.890664178358129277e+00,-6.870255306726923195e+00,-6.903261603195092633e+00,-6.910558857019035273e+00,-6.928560290537692623e+00,-6.891032434668680651e+00,-6.881866467654599973e+00,-6.896572615044295418e+00,-6.942034989121053101e+00,-6.887355959939371886e+00,-6.901957457691201725e+00,-6.920561771414318386e+00,-6.864141279409228247e+00,-6.886805652670600963e+00,-6.861276973834756987e+00,-6.917909752432427339e+00,-6.867193620409098642e+00,-6.931049044574269402e+00,-6.931049044574269402e+00,-6.895462120760269542e+00,-6.924933986344361969e+00,-6.935082469806172156e+00,-6.886805652670600963e+00,-6.903448048613736177e+00,-6.928177954459627585e+00,-6.924172226642699712e+00,-6.918098949292323496e+00,-6.914698870459531577e+00,-6.872964601258729544e+00,-6.932391712849248933e+00,-6.914321796548577481e+00,-6.888273812229600779e+00,-6.918098949292323496e+00,-6.913568075012991088e+00,-6.875499915909058402e+00,-6.888824928253470148e+00,-6.905501245788597586e+00,-6.890664178358129277e+00,-6.919045470896355354e+00,-6.874593708325519970e+00,-6.926840928188441993e+00,-6.901771289953533284e+00,-6.898982924916127857e+00,-6.907371453926755223e+00,-6.919613814190993750e+00,-6.910183340679662223e+00,-6.870796578729311932e+00,-6.913944864768767573e+00,-6.894168100799683785e+00,-6.855217429362381765e+00,-6.919424330533392009e+00,-6.902143660093750910e+00,-6.935467455373968804e+00,-6.878405311088847895e+00,-6.945723955541156514e+00,-6.903261603195092633e+00,-6.918666754739584590e+00,-6.908682683860808638e+00,-6.930857381899988567e+00,-6.919234882773045214e+00,-6.914133312898977479e+00,-6.905688109285357257e+00,-6.902702475419832950e+00,-6.909807965299879839e+00,-6.903821043769124088e+00,-6.893798688161570709e+00,-6.929708176640394512e+00,-6.910746668081658939e+00,-6.910558857019035273e+00,-6.873145482149155328e+00,-6.898611729663164382e+00,-6.913944864768767573e+00,-6.918666754739584590e+00,-6.922650445865855318e+00,-6.901957457691201725e+00,-6.897684343896987258e+00,-6.889560222404245593e+00,-6.922650445865855318e+00,-6.886622284191128429e+00,-6.896757817378373900e+00,-6.891769354377784040e+00,-6.890664178358129277e+00,-6.932583669851469566e+00,-6.903075192531863635e+00,-6.882231497773638296e+00,-6.896017213752596575e+00,-6.869353837270759655e+00,-6.908682683860808638e+00,-6.917342376530681847e+00,-6.955693738052092456e+00,-6.914510315730963086e+00,-6.895832148516571891e+00,-6.901212994600797401e+00,-6.915830946052135886e+00,-6.924553033958803283e+00,-6.868453179729916869e+00,-6.911498265260972218e+00,-6.924933986344361969e+00,-6.923981877351176095e+00,-6.932583669851469566e+00,-6.926268463713309842e+00,-6.926650070281574756e+00,-6.923601287432214590e+00,-6.920182480683429560e+00,-6.878223476376099654e+00,-6.902516168945144415e+00,-6.902329897174091400e+00,-6.922650445865855318e+00,-6.897684343896987258e+00,-6.902329897174091400e+00,-6.889376348173701814e+00,-6.899539976272098585e+00,-6.916019749979817277e+00,-6.936623302160327853e+00,-6.912062334092322402e+00,-6.909807965299879839e+00,-6.905127623516364110e+00,-6.919613814190993750e+00,-6.904754140785598793e+00,-6.904567451715701409e+00,-6.936430568228921345e+00,-6.901585156867840354e+00,-6.940872873940875820e+00,-6.920561771414318386e+00,-6.906435912647930664e+00,-6.903261603195092633e+00,-6.932391712849248933e+00,-6.887722999767312615e+00,-6.915642177764647514e+00,-6.917720591361211646e+00,-6.882596661188321363e+00,-6.899168574225116757e+00,-6.922840541864905717e+00,-6.890664178358129277e+00,-6.918288181954446614e+00,-6.907558667256891383e+00,-6.891032434668680651e+00,-6.916586375738674519e+00,-6.917342376530681847e+00,-6.947670996654697007e+00,-6.932583669851469566e+00,-6.905314417203280541e+00,-6.904940864714813387e+00,-6.891216613691209858e+00,-6.987024246362922852e+00,-6.905688109285357257e+00,-6.919803333759457331e+00,-6.886438949329489745e+00,-6.911310313000292638e+00,-6.858242611845689396e+00,-6.934697632395225853e+00,-6.901212994600797401e+00,-6.910371081222784539e+00,-6.945918489147450359e+00,-6.895092229873862166e+00,-6.891032434668680651e+00,-6.908307871262717725e+00,-6.874593708325519970e+00,-6.896572615044295418e+00,-6.925505687169842162e+00,-6.937780486474606434e+00,-6.920372108066191075e+00,-6.895462120760269542e+00,-6.904940864714813387e+00,-6.928369104226050013e+00,-6.880225479046773174e+00,-6.927795764506569753e+00,-6.878587178871470087e+00,-6.910558857019035273e+00,-6.945529459770826364e+00,-6.909245166298061847e+00,-6.894722475756134727e+00,-6.925696326766264121e+00,-6.905875007706608670e+00,-6.905127623516364110e+00,-6.895277158214657476e+00,-6.884973479200944269e+00,-6.909057636996060126e+00,-6.884241548701663049e+00,-6.916208589561154696e+00,-6.913756452144562914e+00,-6.919234882773045214e+00,-6.899168574225116757e+00,-6.880407678190939791e+00,-6.903261603195092633e+00,-6.915264748056449307e+00,-6.927795764506569753e+00,-6.897498969904427568e+00,-6.919803333759457331e+00,-6.945918489147450359e+00,-6.902888816611092437e+00,-6.912062334092322402e+00,-6.938359581172770518e+00,-6.912250427763526162e+00,-6.897869752259435217e+00,-6.898982924916127857e+00,-6.925124516972736899e+00,-6.857529980424118321e+00,-6.864499895156301434e+00,-6.910558857019035273e+00,-6.904754140785598793e+00,-6.925887002713071894e+00,-6.905875007706608670e+00,-6.916019749979817277e+00,-6.886438949329489745e+00,-6.880954474887330363e+00,-6.918098949292323496e+00,-6.884058649752821779e+00,-6.893060272030547608e+00,-6.892875753162970298e+00,-6.900097338107242351e+00,-6.889376348173701814e+00,-6.954908192057258631e+00,-6.904194178101811019e+00,-6.923220842306834655e+00,-6.897128324978018199e+00,-6.929899619160362789e+00,-6.921700507540407443e+00,-6.881319172192613465e+00,-6.892138017976430220e+00,-6.896572615044295418e+00,-6.916775322361814915e+00,-6.910746668081658939e+00,-6.922840541864905717e+00,-6.903261603195092633e+00,-6.911686252854345014e+00,-6.920182480683429560e+00,-6.904380797492107646e+00,-6.914698870459531577e+00,-6.929516770763649802e+00,-6.938939011415925506e+00,-6.911498265260972218e+00,-6.878951013694475591e+00,-6.885889146328377564e+00,-6.894722475756134727e+00,-6.931816062828556113e+00,-6.911874275793696398e+00,-6.918477450432346743e+00,-6.902888816611092437e+00,-6.891216613691209858e+00,-6.874231455106562194e+00,-6.892506817538050612e+00,-6.884424481108649019e+00,-6.889192507746672689e+00,-6.922650445865855318e+00,-6.898055195004518581e+00,-6.925124516972736899e+00,-6.879679080708800853e+00,-6.884973479200944269e+00,-6.922270362242882058e+00,-6.885889146328377564e+00,-6.873145482149155328e+00,-6.933544007964265177e+00,-6.893060272030547608e+00,-6.906061941065411602e+00,-6.877496467983482020e+00,-6.913756452144562914e+00,-6.905875007706608670e+00,-6.927222753317174408e+00,-6.898982924916127857e+00,-6.905314417203280541e+00,-6.887539463013563790e+00,-6.839352474714404551e+00,-6.918477450432346743e+00,-6.911686252854345014e+00,-6.874593708325519970e+00,-6.947476121861134502e+00,-6.915642177764647514e+00,-6.921890423028585815e+00,-6.920372108066191075e+00,-6.880043313093086255e+00,-6.880407678190939791e+00,-6.914133312898977479e+00,-6.879497014254042142e+00,-6.903075192531863635e+00,-6.910746668081658939e+00,-6.908307871262717725e+00,-6.930091098337587852e+00,-6.911874275793696398e+00,-6.919234882773045214e+00,-6.909245166298061847e+00,-6.909620330436771596e+00,-6.966558042443999454e+00,-6.914698870459531577e+00,-6.896757817378373900e+00,-6.922270362242882058e+00,-6.908120517632191593e+00,-6.888090174362956120e+00,-6.896572615044295418e+00,-6.883510153532835574e+00,-6.898426183693619151e+00,-6.932583669851469566e+00,-6.922840541864905717e+00,-6.889192507746672689e+00,-6.891032434668680651e+00,-6.889928072325627539e+00,-6.954908192057258631e+00,-6.913568075012991088e+00,-6.898982924916127857e+00,-6.903821043769124088e+00,-6.916019749979817277e+00,-6.918477450432346743e+00,-6.906061941065411602e+00,-6.912626721276902586e+00,-6.914887460747690895e+00,-6.901399058421224808e+00,-6.890848289561816387e+00,-6.896017213752596575e+00,-6.888090174362956120e+00,-6.916586375738674519e+00,-6.897128324978018199e+00,-6.935467455373968804e+00,-6.884058649752821779e+00,-6.898982924916127857e+00,-6.909995635376436240e+00,-6.880772176099112514e+00,-6.894907335725234176e+00,-6.925505687169842162e+00,-6.904567451715701409e+00,-6.911686252854345014e+00,-6.918666754739584590e+00,-6.909995635376436240e+00,-6.900840970786983775e+00,-6.910183340679662223e+00,-6.900097338107242351e+00,-6.905875007706608670e+00,-6.860740828705301553e+00,-6.897684343896987258e+00,-6.953534969673734523e+00,-6.907558667256891383e+00,-6.904194178101811019e+00,-6.918856094889727970e+00,-6.931624253103214173e+00,-6.890296057610369118e+00,-6.886255648073364100e+00,-6.911686252854345014e+00,-6.920751470741455691e+00,-6.893798688161570709e+00,-6.934890032588219455e+00,-6.883875784249889662e+00,-6.895832148516571891e+00,-6.911498265260972218e+00,-6.899168574225116757e+00,-6.896017213752596575e+00,-6.924362612173974085e+00,-6.905875007706608670e+00,-6.931049044574269402e+00,-6.885339645444508605e+00,-6.888824928253470148e+00,-6.905688109285357257e+00,-6.920182480683429560e+00,-6.901585156867840354e+00,-6.908307871262717725e+00,-6.896943054018711550e+00,-6.904567451715701409e+00,-6.913003156440391095e+00,-6.889928072325627539e+00,-6.918288181954446614e+00,-6.941841209447934702e+00,-6.903448048613736177e+00,-6.885889146328377564e+00,-6.907371453926755223e+00,-6.929708176640394512e+00,-6.905501245788597586e+00,-6.900469085323418739e+00,-6.927222753317174408e+00,-6.918477450432346743e+00,-6.921890423028585815e+00,-6.914698870459531577e+00,-6.923791564285609823e+00,-6.912250427763526162e+00,-6.929325401516104321e+00,-6.940679319309477791e+00,-6.942616553545702729e+00,-6.933544007964265177e+00,-6.928560290537692623e+00,-6.903634528800756343e+00,-6.901957457691201725e+00,-6.882596661188321363e+00,-6.916019749979817277e+00,-6.940485802134224613e+00,-6.914510315730963086e+00,-6.956086742576703230e+00,-6.922270362242882058e+00,-6.911498265260972218e+00,-6.893429411938763351e+00,-6.884607446986022339e+00,-6.883875784249889662e+00,-6.846196407719229882e+00,-6.896387447003769822e+00,-6.901585156867840354e+00,-6.907745915642420798e+00,-6.919803333759457331e+00,-6.927222753317174408e+00,-6.921700507540407443e+00,-6.879132980758939198e+00,-6.898055195004518581e+00,-6.899539976272098585e+00,-6.877496467983482020e+00,-6.909807965299879839e+00,-6.943198456384351402e+00,-6.885889146328377564e+00,-6.900469085323418739e+00,-6.909432730773900744e+00,-6.897498969904427568e+00,-6.900840970786983775e+00,-6.912814921145718827e+00,-6.888090174362956120e+00,-6.895277158214657476e+00,-6.923981877351176095e+00,-6.879132980758939198e+00,-6.898426183693619151e+00,-6.929134068883739062e+00,-6.925124516972736899e+00,-6.910371081222784539e+00,-6.884424481108649019e+00,-6.906622950897798319e+00,-6.917153322744317379e+00,-6.924743492010993151e+00,-6.931240743990173314e+00,-6.917531466065138801e+00,-6.902516168945144415e+00,-6.906248909374827605e+00,-6.886805652670600963e+00,-6.874593708325519970e+00,-6.883144656452744314e+00,-6.924933986344361969e+00,-6.901026965393675994e+00,-6.932583669851469566e+00,-6.927031822529004401e+00,-6.904567451715701409e+00,-6.931432480161788590e+00,-6.920751470741455691e+00,-6.903261603195092633e+00,-6.948841043590052280e+00,-6.892138017976430220e+00,-6.896387447003769822e+00,-6.931240743990173314e+00,-6.909057636996060126e+00,-6.908870142854702578e+00,-6.898055195004518581e+00,-6.925315083909950431e+00,-6.911498265260972218e+00,-6.886255648073364100e+00,-6.941066466042922656e+00,-6.888273812229600779e+00,-6.896943054018711550e+00,-6.888273812229600779e+00,-6.902516168945144415e+00,-6.935082469806172156e+00,-6.893060272030547608e+00,-6.906248909374827605e+00,-6.909245166298061847e+00,-6.901399058421224808e+00,-6.933736186306358462e+00,-6.904007593531813924e+00,-6.894722475756134727e+00,-6.895277158214657476e+00,-6.907558667256891383e+00,-6.927413720566871547e+00,-6.892138017976430220e+00,-6.919803333759457331e+00,-6.889928072325627539e+00,-6.898426183693619151e+00,-6.931432480161788590e+00,-6.916775322361814915e+00,-6.884241548701663049e+00,-6.895277158214657476e+00,-6.918288181954446614e+00,-6.918098949292323496e+00,-6.914133312898977479e+00,-6.937394609564551118e+00,-6.901957457691201725e+00,-6.887539463013563790e+00,-6.900840970786983775e+00,-6.940872873940875820e+00,-6.862887136093361207e+00,-6.898611729663164382e+00,-6.903261603195092633e+00,-6.922650445865855318e+00,-6.904940864714813387e+00,-6.909995635376436240e+00,-6.896757817378373900e+00,-6.907558667256891383e+00,-6.896943054018711550e+00,-6.885339645444508605e+00,-6.928369104226050013e+00,-6.926268463713309842e+00,-6.867013812802751360e+00,-6.896572615044295418e+00,-6.922840541864905717e+00,-6.902329897174091400e+00,-6.938166512343132197e+00,-6.934505269212946743e+00,-6.898797310066406396e+00,-6.907371453926755223e+00,-6.892875753162970298e+00,-6.888824928253470148e+00,-6.884607446986022339e+00,-6.889376348173701814e+00,-6.930665755953247853e+00,-6.897684343896987258e+00,-6.951576488828381528e+00,-6.925696326766264121e+00,-6.879314980941442670e+00,-6.883692952180634705e+00,-6.912814921145718827e+00,-6.908682683860808638e+00,-6.881136806914456727e+00,-6.923981877351176095e+00,-6.909432730773900744e+00,-6.923030674007318197e+00,-6.910746668081658939e+00,-6.903261603195092633e+00,-6.912250427763526162e+00,-6.915642177764647514e+00,-6.904007593531813924e+00,-6.903821043769124088e+00,-6.882779292911960667e+00,-6.909432730773900744e+00,-6.879497014254042142e+00,-6.953927126541506354e+00,-6.917531466065138801e+00,-6.911498265260972218e+00,-6.902516168945144415e+00,-6.868093143722427740e+00,-6.894907335725234176e+00,-6.926077715024128167e+00,-6.910371081222784539e+00,-6.890112048041352466e+00,-6.953731028884243770e+00,-6.890112048041352466e+00,-6.911122396059028006e+00,-6.930091098337587852e+00,-6.885522778857710691e+00,-6.893983377422415515e+00,-6.922840541864905717e+00,-6.925124516972736899e+00,-6.920182480683429560e+00,-6.890664178358129277e+00,-6.932199792687447015e+00,-6.899354258006173524e+00,-6.892506817538050612e+00,-6.894722475756134727e+00,-6.926650070281574756e+00,-6.901399058421224808e+00,-6.910934514423903607e+00,-6.928560290537692623e+00,-6.914887460747690895e+00,-6.915830946052135886e+00,-6.884058649752821779e+00,-6.883327388294274485e+00,-6.907933199096476073e+00,-6.915642177764647514e+00,-6.889192507746672689e+00,-6.891953669188001896e+00,-6.933736186306358462e+00,-6.926077715024128167e+00,-6.889008701110736155e+00,-6.877133161894109392e+00,-6.937780486474606434e+00,-6.917153322744317379e+00,-6.897128324978018199e+00,-6.942616553545702729e+00,-6.915642177764647514e+00,-6.916775322361814915e+00,-6.915264748056449307e+00,-6.907933199096476073e+00,-6.938359581172770518e+00,-6.919992889252394974e+00,-6.901026965393675994e+00,-6.897498969904427568e+00,-6.903261603195092633e+00,-6.920561771414318386e+00,-6.943004451145601053e+00,-6.933928401588064006e+00,-6.937587529406954445e+00,-6.904567451715701409e+00,-6.932967694433763484e+00,-6.890296057610369118e+00,-6.906997132380180204e+00,-6.916208589561154696e+00,-6.900283194440831380e+00,-6.930857381899988567e+00,-6.875681256012919462e+00,-6.928942772852549226e+00,-6.897498969904427568e+00,-6.939712107704913535e+00,-6.859669400031377151e+00,-6.942422661155370989e+00,-6.889376348173701814e+00,-6.952947022572036673e+00,-6.891032434668680651e+00,-6.907371453926755223e+00,-6.888273812229600779e+00,-6.885339645444508605e+00,-6.889928072325627539e+00,-6.927795764506569753e+00,-6.913003156440391095e+00,-6.903075192531863635e+00,-6.898611729663164382e+00,-6.911874275793696398e+00,-6.937587529406954445e+00,-6.938166512343132197e+00,-6.897498969904427568e+00,-6.874050377695473557e+00,-6.909057636996060126e+00,-6.904940864714813387e+00,-6.932199792687447015e+00,-6.888090174362956120e+00,-6.901585156867840354e+00,-6.927031822529004401e+00,-6.893244824951620942e+00,-6.946891725234003445e+00,-6.919234882773045214e+00,-6.904380797492107646e+00,-6.906810024137516990e+00,-6.909995635376436240e+00,-6.933736186306358462e+00,-6.879861180317789149e+00,-6.915076086608856087e+00,-6.878223476376099654e+00,-6.927604724292024230e+00,-6.892322400755601208e+00,-6.919234882773045214e+00,-6.930474166719976736e+00,-6.927222753317174408e+00,-6.904940864714813387e+00,-6.910371081222784539e+00,-6.902516168945144415e+00,-6.914698870459531577e+00,-6.942422661155370989e+00,-6.946307669926833128e+00,-6.895647117523353131e+00,-6.910746668081658939e+00,-6.881501570733931317e+00,-6.888824928253470148e+00,-6.909620330436771596e+00,-6.925887002713071894e+00,-6.919424330533392009e+00,-6.908307871262717725e+00,-6.897869752259435217e+00,-6.924553033958803283e+00,-6.922840541864905717e+00,-6.915264748056449307e+00,-6.905314417203280541e+00,-6.938552687285179488e+00,-6.902516168945144415e+00,-6.950207830876927062e+00,-6.892875753162970298e+00,-6.848664674134555597e+00,-6.882779292911960667e+00,-6.913568075012991088e+00,-6.924362612173974085e+00,-6.883875784249889662e+00,-6.898797310066406396e+00,-6.933736186306358462e+00,-6.926650070281574756e+00,-6.908307871262717725e+00,-6.923601287432214590e+00,-6.914510315730963086e+00,-6.876588450123945151e+00,-6.943780698031076781e+00,-6.908682683860808638e+00,-6.879679080708800853e+00,-6.904754140785598793e+00,-6.905688109285357257e+00,-6.887172490532380564e+00,-6.908870142854702578e+00,-6.903261603195092633e+00,-6.920941206061254292e+00,-6.896572615044295418e+00,-6.911874275793696398e+00,-6.912814921145718827e+00,-6.896387447003769822e+00,-6.890296057610369118e+00,-6.893429411938763351e+00,-6.890112048041352466e+00,-6.904940864714813387e+00,-6.901212994600797401e+00,-6.914133312898977479e+00,-6.893983377422415515e+00,-6.947865909431847342e+00,-6.909995635376436240e+00,-6.911686252854345014e+00,-6.912062334092322402e+00,-6.891585073533256889e+00,-6.894537649953926817e+00,-6.910183340679662223e+00,-6.904940864714813387e+00,-6.877678170536903224e+00,-6.886805652670600963e+00,-6.916775322361814915e+00,-6.907371453926755223e+00,-6.912250427763526162e+00,-6.910558857019035273e+00,-6.950794169210935181e+00,-6.908120517632191593e+00,-6.897869752259435217e+00,-6.910558857019035273e+00,-6.896387447003769822e+00,-6.913191427174263382e+00,-6.916775322361814915e+00,-6.941260095630129356e+00,-6.945918489147450359e+00,-6.894722475756134727e+00,-6.907371453926755223e+00,-6.936623302160327853e+00,-6.898611729663164382e+00,-6.895092229873862166e+00,-6.923220842306834655e+00,-6.931816062828556113e+00,-6.927604724292024230e+00,-6.888824928253470148e+00,-6.918288181954446614e+00,-6.912062334092322402e+00,-6.883875784249889662e+00,-6.911686252854345014e+00,-6.923220842306834655e+00,-6.923791564285609823e+00,-6.925315083909950431e+00,-6.931240743990173314e+00,-6.919613814190993750e+00,-6.896017213752596575e+00,-6.915453445103899810e+00,-6.937587529406954445e+00,-6.902329897174091400e+00,-6.888457483825305872e+00,-6.933736186306358462e+00,-6.918856094889727970e+00,-6.904754140785598793e+00,-6.870435698176857287e+00,-6.893244824951620942e+00,-6.891953669188001896e+00,-6.915264748056449307e+00,-6.869353837270759655e+00,-6.922840541864905717e+00,-6.886989054780238817e+00,-6.904194178101811019e+00,-6.924362612173974085e+00,-6.921510628113273000e+00,-6.912062334092322402e+00,-6.937008881498051949e+00,-6.929708176640394512e+00,-6.902888816611092437e+00,-6.941841209447934702e+00,-6.944946199328512648e+00,-6.944557547943906783e+00,-6.910183340679662223e+00,-6.904754140785598793e+00,-6.919045470896355354e+00,-6.881866467654599973e+00,-6.919992889252394974e+00,-6.891769354377784040e+00,-6.911874275793696398e+00,-6.917342376530681847e+00,-6.902329897174091400e+00,-6.881684002550544577e+00,-6.883692952180634705e+00,-6.919613814190993750e+00,-6.856284096130185191e+00,-6.920941206061254292e+00,-6.893983377422415515e+00,-6.904194178101811019e+00,-6.884241548701663049e+00,-6.906435912647930664e+00,-6.902702475419832950e+00,-6.912062334092322402e+00,-6.902329897174091400e+00,-6.896202313244101489e+00,-6.924553033958803283e+00,-6.904007593531813924e+00,-6.914321796548577481e+00,-6.903075192531863635e+00,-6.888090174362956120e+00,-6.921130977387376149e+00,-6.903634528800756343e+00,-6.927222753317174408e+00,-6.930474166719976736e+00,-6.901212994600797401e+00,-6.939132229463091051e+00,-6.890664178358129277e+00,-6.882961957996037938e+00,-6.909807965299879839e+00,-6.934312943027142850e+00,-6.908307871262717725e+00,-6.923601287432214590e+00,-6.890848289561816387e+00,-6.897313630269016116e+00,-6.924362612173974085e+00,-6.876225473713425629e+00,-6.882779292911960667e+00,-6.869534066161939734e+00,-6.881501570733931317e+00,-6.887355959939371886e+00,-6.874412565312617573e+00,-6.958251032649267032e+00,-6.901957457691201725e+00,-6.936045211769432228e+00,-6.913756452144562914e+00,-6.932391712849248933e+00,-6.906997132380180204e+00,-6.907745915642420798e+00,-6.891953669188001896e+00,-6.935274944063333891e+00,-6.869534066161939734e+00,-6.880954474887330363e+00,-6.902702475419832950e+00,-6.927222753317174408e+00,-6.909245166298061847e+00,-6.891769354377784040e+00,-6.898055195004518581e+00,-6.912626721276902586e+00,-6.923981877351176095e+00,-6.921700507540407443e+00,-6.855928414106161384e+00,-6.916208589561154696e+00,-6.911310313000292638e+00,-6.933159762042151897e+00,-6.900469085323418739e+00,-6.891216613691209858e+00,-6.886622284191128429e+00,-6.907745915642420798e+00,-6.913568075012991088e+00,-6.947086486165575892e+00,-6.917531466065138801e+00,-6.908870142854702578e+00,-6.928560290537692623e+00,-6.918288181954446614e+00,-6.918477450432346743e+00,-6.907371453926755223e+00,-6.919613814190993750e+00,-6.908870142854702578e+00
+-6.928560902152458567e+00,-6.913154564945870106e+00,-6.910211763484261027e+00,-6.917030181128522415e+00,-6.902345629445600039e+00,-6.906545398240682587e+00,-6.943828374283246063e+00,-6.943448579367933249e+00,-6.951646127895054761e+00,-6.892193257981583443e+00,-6.899252749694541720e+00,-6.906362432363309267e+00,-6.933624204109005262e+00,-6.927065574771024359e+00,-6.882142922128080542e+00,-6.907827097583037812e+00,-6.900525130126130335e+00,-6.905813735504549911e+00,-6.910946652365396403e+00,-6.916475601208587065e+00,-6.903074758169116976e+00,-6.903986917312906257e+00,-6.906545398240682587e+00,-6.897075285578891268e+00,-6.913154564945870106e+00,-6.905448104787495822e+00,-6.937957701234477526e+00,-6.967078532933866697e+00,-6.875912372377445791e+00,-6.913338777896562704e+00,-6.912418052299798887e+00,-6.908560235445788678e+00,-6.897075285578891268e+00,-6.908010331665090931e+00,-6.918510566298955666e+00,-6.920735261321066645e+00,-6.884646052346200307e+00,-6.869191441524755248e+00,-6.900525130126130335e+00,-6.873608019678243863e+00,-6.884467049392172910e+00,-6.908193599328024348e+00,-6.900343262343508144e+00,-6.906728397600696567e+00,-6.909660951021972863e+00,-6.915736639416230958e+00,-6.910211763484261027e+00,-6.934188379018186410e+00,-6.910946652365396403e+00,-6.915367363193423600e+00,-6.867078392666686071e+00,-6.921849467564470615e+00,-6.898707939002775902e+00,-6.937391396358560058e+00,-6.917955165007256824e+00,-6.908927006034899065e+00,-6.884467049392172910e+00,-6.907643897069563721e+00,-6.892915019109992159e+00,-6.956635583649886101e+00,-6.879289853145674982e+00,-6.910946652365396403e+00,-6.919807703514095465e+00,-6.889851125571400559e+00,-6.917585068778013380e+00,-6.890571197609714815e+00,-6.904717244166620915e+00,-6.892373649431517535e+00,-6.888592245807988235e+00,-6.927626060540017505e+00,-6.919251581523676364e+00,-6.901617031963461102e+00,-6.910028125617616368e+00,-6.891111592110842565e+00,-6.926318748746767895e+00,-6.928373863902590912e+00,-6.904899909250698187e+00,-6.877333079418043837e+00,-6.870955732737517252e+00,-6.926692092040259041e+00,-6.949924792479114544e+00,-6.917030181128522415e+00,-6.930245822517377974e+00,-6.897437867163718650e+00,-6.920364134948279400e+00,-6.905082607707404563e+00,-6.888053210667814952e+00,-6.910762879508130396e+00,-6.893998638504450938e+00,-6.922407036578078987e+00,-6.908010331665090931e+00,-6.892373649431517535e+00,-6.896169406361222443e+00,-6.903439521988591565e+00,-6.930808094109362827e+00,-6.890931428161159999e+00,-6.890031094977087989e+00,-6.924454120199804663e+00,-6.915367363193423600e+00,-6.874316485878617655e+00,-6.913707305632444289e+00,-6.904899909250698187e+00,-6.899616121791563472e+00,-6.916290809560642572e+00,-6.872192591274913553e+00,-6.890751296664682002e+00,-6.920364134948279400e+00,-6.890031094977087989e+00,-6.909844521467647382e+00,-6.903804418909260221e+00,-6.908193599328024348e+00,-6.894360106066672600e+00,-6.884646052346200307e+00,-6.955289817802253083e+00,-6.909477414268224038e+00,-6.919436921159087817e+00,-6.873962190037888220e+00,-6.913523024787917137e+00,-6.925013143786523884e+00,-6.922407036578078987e+00,-6.924454120199804663e+00,-6.899071113148769641e+00,-6.895083433403815576e+00,-6.899071113148769641e+00,-6.898889509588840951e+00,-6.900343262343508144e+00,-6.895988328950133806e+00,-6.878577855963584398e+00,-6.933060347313688254e+00,-6.879824183909121871e+00,-6.901434965508702390e+00,-6.910211763484261027e+00,-6.891652278796087217e+00,-6.924267848428751648e+00,-6.943828374283246063e+00,-6.932872465678563856e+00,-6.887694015261653036e+00,-6.889311411606694691e+00,-6.901981264347746503e+00,-6.905630903435294954e+00,-6.910579140417121735e+00,-6.901617031963461102e+00,-6.904534612442981611e+00,-6.902892426141990612e+00,-6.924454120199804663e+00,-6.907827097583037812e+00,-6.893095540818217515e+00,-6.909293911194032134e+00,-6.891291788525419904e+00,-6.899797857366802845e+00,-6.901252932196102918e+00,-6.916290809560642572e+00,-6.915736639416230958e+00,-6.944968625261978445e+00,-6.929122226893550618e+00,-6.944968625261978445e+00,-6.928373863902590912e+00,-6.886617202523002135e+00,-6.922778922041644023e+00,-6.923150945855457650e+00,-6.934188379018186410e+00,-6.909844521467647382e+00,-6.896169406361222443e+00,-6.887873596837065904e+00,-6.907277596699168853e+00,-6.888412535141677395e+00,-6.896169406361222443e+00,-6.886258522461767839e+00,-6.911130459001332937e+00,-6.899071113148769641e+00,-6.913523024787917137e+00,-6.927812958961268919e+00,-6.938902255947191122e+00,-6.918881005273371798e+00,-6.872546260597786727e+00,-6.901617031963461102e+00,-6.964554504800362977e+00,-6.941551765445653999e+00,-6.932684619336319187e+00,-6.898707939002775902e+00,-6.898889509588840951e+00,-6.901799131572449397e+00,-6.926878815969473635e+00,-6.934941107695051343e+00,-6.878222047384845439e+00,-6.895445293367959749e+00,-6.906179499956323298e+00,-6.909477414268224038e+00,-6.918140264498761738e+00,-6.905630903435294954e+00,-6.903804418909260221e+00,-6.918325398258430070e+00,-6.898344896704362128e+00,-6.930808094109362827e+00,-6.901252932196102918e+00,-6.899979625975863229e+00,-6.937957701234477526e+00,-6.918325398258430070e+00,-6.921477927526758833e+00,-6.874670907289379684e+00,-6.914075969231090468e+00,-6.919622295151647506e+00,-6.875202775044140679e+00,-6.909293911194032134e+00,-6.886437846410961683e+00,-6.876267359956608871e+00,-6.907827097583037812e+00,-6.909844521467647382e+00,-6.903621953805204825e+00,-6.925013143786523884e+00,-6.894540888856644401e+00,-6.926692092040259041e+00,-6.941172834027705463e+00,-6.890751296664682002e+00,-6.909293911194032134e+00,-6.927626060540017505e+00,-6.926505402970361658e+00,-6.914260352010261457e+00,-6.928373863902590912e+00,-6.890211096777591138e+00,-6.918695768633034149e+00,-6.886079230663888495e+00,-6.890391130984577117e+00,-6.901434965508702390e+00,-6.879111806318888966e+00,-6.895445293367959749e+00,-6.892193257981583443e+00,-6.926692092040259041e+00,-6.873076999221478189e+00,-6.892554073428398098e+00,-6.892193257981583443e+00,-6.901617031963461102e+00,-6.916660427010794976e+00,-6.888053210667814952e+00,-6.905813735504549911e+00,-6.885899971005793319e+00,-6.890391130984577117e+00,-6.901252932196102918e+00,-6.944968625261978445e+00,-6.903439521988591565e+00,-6.933060347313688254e+00,-6.901070932013599446e+00,-6.929122226893550618e+00,-6.901070932013599446e+00,-6.916845286979894425e+00,-6.924826767865752686e+00,-6.906362432363309267e+00,-6.885899971005793319e+00,-6.893998638504450938e+00,-6.940415401687006991e+00,-6.892193257981583443e+00,-6.897981986157827095e+00,-6.904352014067599796e+00,-6.911498173658905841e+00,-6.929683866897081046e+00,-6.873608019678243863e+00,-6.910579140417121735e+00,-6.936825412002351143e+00,-6.896894044074089791e+00,-6.946110177897359961e+00,-6.894721704334941137e+00,-6.876800077717586390e+00,-6.913338777896562704e+00,-6.894360106066672600e+00,-6.903074758169116976e+00,-6.943828374283246063e+00,-6.908010331665090931e+00,-6.934564672531562834e+00,-6.919066276232678447e+00,-6.868486595680595386e+00,-6.893998638504450938e+00,-6.901434965508702390e+00,-6.921477927526758833e+00,-6.921663680290365050e+00,-6.892734529983972180e+00,-6.890211096777591138e+00,-6.922407036578078987e+00,-6.896350516567277822e+00,-6.900343262343508144e+00,-6.887694015261653036e+00,-6.917770099771232140e+00,-6.908376900584149993e+00,-6.894179355953212962e+00,-6.919622295151647506e+00,-6.934564672531562834e+00,-6.923150945855457650e+00,-6.890031094977087989e+00,-6.929122226893550618e+00,-6.920549680917824631e+00,-6.915551984259211693e+00,-6.893276095120414837e+00,-6.874848165111345111e+00,-6.909477414268224038e+00,-6.880358800334580138e+00,-6.903804418909260221e+00,-6.901799131572449397e+00,-6.901252932196102918e+00,-6.904534612442981611e+00,-6.905265339548934733e+00,-6.901799131572449397e+00,-6.919066276232678447e+00,-6.903986917312906257e+00,-6.919436921159087817e+00,-6.903986917312906257e+00,-6.930433211255861536e+00,-6.901434965508702390e+00,-6.896712835411820564e+00,-6.899979625975863229e+00,-6.952603707207908101e+00,-6.901981264347746503e+00,-6.902163430301433422e+00,-6.889491283897852369e+00,-6.910762879508130396e+00,-6.928935083634840453e+00,-6.912602129612789525e+00,-6.927999892320071851e+00,-6.878933791187133551e+00,-6.910395435079966120e+00,-6.904169449028298544e+00,-6.908376900584149993e+00,-6.931745916554540088e+00,-6.895988328950133806e+00,-6.897981986157827095e+00,-6.889131571663758891e+00,-6.901434965508702390e+00,-6.908927006034899065e+00,-6.921106525479777005e+00,-6.874139322267559749e+00,-6.903621953805204825e+00,-6.897981986157827095e+00,-6.892012899066854104e+00,-6.912049999296012714e+00,-6.932121291934322471e+00,-6.907094496817672180e+00,-6.900161427630759903e+00,-6.916290809560642572e+00,-6.903621953805204825e+00,-6.907827097583037812e+00,-6.916290809560642572e+00,-6.886617202523002135e+00,-6.914629219590983311e+00,-6.902345629445600039e+00,-6.905082607707404563e+00,-6.909293911194032134e+00,-6.902163430301433422e+00,-6.889491283897852369e+00,-6.912234008865029367e+00,-6.884825087348021455e+00,-6.928186860629487853e+00,-6.903257123447273713e+00,-6.909844521467647382e+00,-6.901252932196102918e+00,-6.872723142176031530e+00,-6.895264347018059681e+00,-6.885362384755055132e+00,-6.930058468886851841e+00,-6.931370682028560992e+00,-6.947062468227397147e+00,-6.902163430301433422e+00,-6.911682081705400549e+00,-6.889131571663758891e+00,-6.901617031963461102e+00,-6.924454120199804663e+00,-6.903804418909260221e+00,-6.899797857366802845e+00,-6.936259747803237730e+00,-6.926692092040259041e+00,-6.897800580261638004e+00,-6.920178623399653617e+00,-6.941551765445653999e+00,-6.898526401378605399e+00,-6.926692092040259041e+00,-6.904717244166620915e+00,-6.925758995023784337e+00,-6.912970385923340899e+00,-6.934941107695051343e+00,-6.874139322267559749e+00,-6.942120431938089808e+00,-6.906179499956323298e+00,-6.917770099771232140e+00,-6.897256559938130138e+00,-6.899979625975863229e+00,-6.898889509588840951e+00,-6.884825087348021455e+00,-6.912786240816476635e+00,-6.920735261321066645e+00,-6.945539238686874839e+00,-6.884109139581623182e+00,-6.904169449028298544e+00,-6.921477927526758833e+00,-6.919436921159087817e+00,-6.937768897306796134e+00,-6.902163430301433422e+00,-6.920178623399653617e+00,-6.880715370136895714e+00,-6.906179499956323298e+00,-6.940604705994244839e+00,-6.902710127353772762e+00,-6.912602129612789525e+00,-6.871662321708798871e+00,-6.947634278020924370e+00,-6.915921328677075763e+00,-6.915367363193423600e+00,-6.886796590809428409e+00,-6.901799131572449397e+00,-6.922407036578078987e+00,-6.902345629445600039e+00,-6.920735261321066645e+00,-6.901434965508702390e+00,-6.904717244166620915e+00,-6.926692092040259041e+00,-6.914260352010261457e+00,-6.908560235445788678e+00,-6.908193599328024348e+00,-6.885541548062889206e+00,-6.888412535141677395e+00,-6.914629219590983311e+00,-6.914075969231090468e+00,-6.880537069342985035e+00,-6.902163430301433422e+00,-6.917030181128522415e+00,-6.921292209260833772e+00,-6.912970385923340899e+00,-6.878399935849246916e+00,-6.905082607707404563e+00,-6.886079230663888495e+00,-6.897075285578891268e+00,-6.895264347018059681e+00,-6.899797857366802845e+00,-6.893276095120414837e+00,-6.878222047384845439e+00,-6.885183253540962056e+00,-6.944778493119565965e+00,-6.906179499956323298e+00,-6.900161427630759903e+00,-6.911682081705400549e+00,-6.901617031963461102e+00,-6.902710127353772762e+00,-6.920178623399653617e+00,-6.889851125571400559e+00,-6.904717244166620915e+00,-6.906179499956323298e+00,-6.925013143786523884e+00,-6.921106525479777005e+00,-6.901617031963461102e+00,-6.895264347018059681e+00,-6.906362432363309267e+00,-6.909844521467647382e+00,-6.939847703687087588e+00,-6.917585068778013380e+00,-6.898889509588840951e+00,-6.917030181128522415e+00,-6.888053210667814952e+00,-6.893276095120414837e+00,-6.924826767865752686e+00,-6.947824953967732142e+00,-6.925758995023784337e+00,-6.900707030990661295e+00,-6.931370682028560992e+00,-6.906362432363309267e+00,-6.878933791187133551e+00,-6.885004154409113397e+00,-6.908010331665090931e+00,-6.886258522461767839e+00,-6.899071113148769641e+00,-6.876444901013444522e+00,-6.921849467564470615e+00,-6.914260352010261457e+00,-6.925945544786474173e+00,-6.927812958961268919e+00,-6.887694015261653036e+00,-6.924640426674493199e+00,-6.876800077717586390e+00,-6.885362384755055132e+00,-6.885720743475966188e+00,-6.972341190046229542e+00,-6.914444768792710860e+00,-6.914260352010261457e+00,-6.914813704417630547e+00,-6.928747975392177239e+00,-6.903986917312906257e+00,-6.899071113148769641e+00,-6.911682081705400549e+00,-6.881607351286037400e+00,-6.935317684615338152e+00,-6.906728397600696567e+00,-6.908560235445788678e+00,-6.905630903435294954e+00,-6.904899909250698187e+00,-6.884288078474469330e+00,-6.898526401378605399e+00,-6.930058468886851841e+00,-6.951454722018310051e+00,-6.892373649431517535e+00,-6.911314299428362062e+00,-6.877688571778701743e+00,-6.889491283897852369e+00,-6.931370682028560992e+00,-6.908560235445788678e+00,-6.887694015261653036e+00,-6.855014254739828061e+00,-6.940794046144388219e+00,-6.900525130126130335e+00,-6.913154564945870106e+00,-6.908193599328024348e+00,-6.898344896704362128e+00,-6.893817953708584412e+00,-6.910395435079966120e+00,-6.923895408945861973e+00,-6.922592962022511287e+00,-6.899434419238142269e+00,-6.946300563428634334e+00,-6.892915019109992159e+00,-6.936071264153637728e+00,-6.893817953708584412e+00,-6.908927006034899065e+00,-6.906545398240682587e+00,-6.907460730112370939e+00,-6.943448579367933249e+00,-6.925945544786474173e+00,-6.940604705994244839e+00,-6.911498173658905841e+00,-6.929122226893550618e+00,-6.930995588250720374e+00,-6.921106525479777005e+00,-6.904352014067599796e+00,-6.883751357824676731e+00,-6.886258522461767839e+00,-6.913707305632444289e+00,-6.883036178105296443e+00,-6.920549680917824631e+00,-6.897800580261638004e+00,-6.889851125571400559e+00,-6.924267848428751648e+00,-6.880180563100349644e+00,-6.883036178105296443e+00,-6.909110441787040813e+00,-6.888592245807988235e+00,-6.910946652365396403e+00,-6.914813704417630547e+00,-6.898526401378605399e+00,-6.920549680917824631e+00,-6.904717244166620915e+00,-6.899616121791563472e+00,-6.893998638504450938e+00,-6.905448104787495822e+00,-6.927252368457940790e+00,-6.894179355953212962e+00,-6.892012899066854104e+00,-6.886617202523002135e+00,-6.925199554449752881e+00,-6.926132129356471268e+00,-6.952795333154648816e+00,-6.873962190037888220e+00,-6.913707305632444289e+00,-6.917215109469317724e+00,-6.949542675546684478e+00,-6.905630903435294954e+00,-6.926878815969473635e+00,-6.930245822517377974e+00,-6.883930232702176966e+00,-6.919436921159087817e+00,-6.914629219590983311e+00,-6.918140264498761738e+00,-6.902710127353772762e+00,-6.884288078474469330e+00,-6.914260352010261457e+00,-6.898707939002775902e+00,-6.907094496817672180e+00,-6.945919828605836344e+00,-6.914075969231090468e+00,-6.911314299428362062e+00,-6.908560235445788678e+00,-6.895626272465367634e+00,-6.903621953805204825e+00,-6.915736639416230958e+00,-6.952603707207908101e+00,-6.860765904582793340e+00,-6.912602129612789525e+00,-6.901070932013599446e+00,-6.889131571663758891e+00,-6.923895408945861973e+00,-6.905448104787495822e+00,-6.937014037863516336e+00,-6.896531659580180218e+00,-6.859193989142536196e+00,-6.908743603925261212e+00,-6.880358800334580138e+00,-6.903074758169116976e+00,-6.925572480055416591e+00,-6.911682081705400549e+00,-6.895264347018059681e+00,-6.924454120199804663e+00,-6.927065574771024359e+00,-6.888592245807988235e+00,-6.914075969231090468e+00,-6.920178623399653617e+00,-6.917030181128522415e+00,-6.924640426674493199e+00,-6.871662321708798871e+00,-6.906728397600696567e+00,-6.896169406361222443e+00,-6.905082607707404563e+00,-6.903804418909260221e+00,-6.903074758169116976e+00,-6.913891620442662145e+00,-6.907643897069563721e+00,-6.906362432363309267e+00,-6.930058468886851841e+00,-6.932496808273695521e+00,-6.928186860629487853e+00,-6.908193599328024348e+00,-6.896894044074089791e+00,-6.917955165007256824e+00,-6.916475601208587065e+00,-6.927812958961268919e+00,-6.969024437420236140e+00,-6.922035289361902599e+00,-6.906545398240682587e+00,-6.895988328950133806e+00,-6.917030181128522415e+00,-6.888592245807988235e+00,-6.928560902152458567e+00,-6.881607351286037400e+00,-6.899797857366802845e+00,-6.897800580261638004e+00,-6.966884150583172897e+00,-6.889491283897852369e+00,-6.906362432363309267e+00,-6.892915019109992159e+00,-6.931745916554540088e+00,-6.911314299428362062e+00,-6.903621953805204825e+00,-6.901252932196102918e+00,-6.912602129612789525e+00,-6.925385999868396425e+00,-6.895083433403815576e+00,-6.907094496817672180e+00,-6.908376900584149993e+00,-6.898707939002775902e+00,-6.912049999296012714e+00,-6.907094496817672180e+00,-6.908193599328024348e+00,-6.932309032477444788e+00,-6.935882816023427822e+00,-6.893637301553816599e+00,-6.938713273616475163e+00,-6.923337009675885056e+00,-6.910762879508130396e+00,-6.879646041929493805e+00,-6.921849467564470615e+00,-6.868310461818404988e+00,-6.917215109469317724e+00,-6.902892426141990612e+00,-6.920549680917824631e+00,-6.900343262343508144e+00,-6.896894044074089791e+00,-6.929122226893550618e+00,-6.886796590809428409e+00,-6.908376900584149993e+00,-6.887694015261653036e+00,-6.921663680290365050e+00,-6.918695768633034149e+00,-6.900525130126130335e+00,-6.889851125571400559e+00,-6.914075969231090468e+00,-6.899797857366802845e+00,-6.915551984259211693e+00,-6.910762879508130396e+00,-6.886079230663888495e+00,-6.903804418909260221e+00,-6.917585068778013380e+00,-6.896712835411820564e+00,-6.905265339548934733e+00,-6.922778922041644023e+00,-6.871662321708798871e+00,-6.888951764057411609e+00,-6.901617031963461102e+00,-6.865847839641579853e+00,-6.915736639416230958e+00,-6.924640426674493199e+00,-6.923709241208193532e+00,-6.908560235445788678e+00,-6.935129378428923630e+00,-6.874139322267559749e+00,-6.943638458795067692e+00,-6.908193599328024348e+00,-6.918325398258430070e+00,-6.880002357628971055e+00,-6.933248264254952886e+00,-6.905448104787495822e+00,-6.881607351286037400e+00,-6.891832572675598456e+00,-6.895807284322138742e+00,-6.931933586631096489e+00,-6.916845286979894425e+00,-6.895083433403815576e+00,-6.919251581523676364e+00,-6.901799131572449397e+00,-6.888592245807988235e+00,-6.913707305632444289e+00,-6.890571197609714815e+00,-6.864618799016321660e+00,-6.894902552513389793e+00,-6.863566536555145703e+00,-6.910028125617616368e+00,-6.921106525479777005e+00,-6.910395435079966120e+00,-6.920364134948279400e+00,-6.943638458795067692e+00,-6.891472017416599982e+00,-6.901799131572449397e+00,-6.916845286979894425e+00,-6.916290809560642572e+00,-6.913154564945870106e+00,-6.900161427630759903e+00,-6.931183117552722095e+00,-6.933060347313688254e+00,-6.907277596699168853e+00,-6.924081611348411158e+00,-6.912049999296012714e+00,-6.885183253540962056e+00,-6.932496808273695521e+00,-6.897981986157827095e+00,-6.910395435079966120e+00,-6.882857463065949588e+00,-6.902710127353772762e+00,-6.931370682028560992e+00,-6.899434419238142269e+00,-6.892734529983972180e+00,-6.909477414268224038e+00,-6.903621953805204825e+00,-6.894179355953212962e+00,-6.880893702727648886e+00,-6.909110441787040813e+00,-6.903439521988591565e+00,-6.932121291934322471e+00,-6.897800580261638004e+00,-6.905448104787495822e+00,-6.879111806318888966e+00,-6.871308964814662801e+00,-6.903439521988591565e+00,-6.919622295151647506e+00,-6.922407036578078987e+00,-6.936636821714191825e+00,-6.913523024787917137e+00,-6.945158793561494903e+00,-6.871485627654093875e+00,-6.931370682028560992e+00,-6.901981264347746503e+00,-6.882142922128080542e+00,-6.927999892320071851e+00,-6.908376900584149993e+00,-6.897256559938130138e+00,-6.905813735504549911e+00,-6.927065574771024359e+00,-6.881964366641490471e+00,-6.888951764057411609e+00,-6.899071113148769641e+00,-6.913338777896562704e+00,-6.887155463951625478e+00,-6.879646041929493805e+00,-6.899252749694541720e+00,-6.937202699311109555e+00,-6.907643897069563721e+00,-6.925572480055416591e+00,-6.953945860606586038e+00,-6.882678779959961801e+00,-6.870955732737517252e+00,-6.888412535141677395e+00,-6.898526401378605399e+00,-6.904169449028298544e+00,-6.883930232702176966e+00,-6.913154564945870106e+00,-6.911498173658905841e+00,-6.912602129612789525e+00,-6.927252368457940790e+00,-6.898163424968085877e+00,-6.898163424968085877e+00,-6.908010331665090931e+00,-6.915921328677075763e+00,-6.917400072014929790e+00,-6.898526401378605399e+00,-6.924454120199804663e+00,-6.877688571778701743e+00,-6.910395435079966120e+00,-6.906911430455604517e+00,-6.944588397120515566e+00,-6.905813735504549911e+00,-6.899071113148769641e+00,-6.899979625975863229e+00,-6.895264347018059681e+00,-6.912602129612789525e+00,-6.918325398258430070e+00,-6.921106525479777005e+00,-6.905996601007482028e+00,-6.886437846410961683e+00,-6.885541548062889206e+00,-6.917955165007256824e+00,-6.895807284322138742e+00,-6.916106052054344033e+00,-6.885183253540962056e+00,-6.903621953805204825e+00,-6.887873596837065904e+00,-6.919251581523676364e+00,-6.920364134948279400e+00,-6.930058468886851841e+00,-6.928747975392177239e+00,-6.926878815969473635e+00,-6.894540888856644401e+00,-6.929496618511551631e+00,-6.894902552513389793e+00,-6.907460730112370939e+00,-6.937014037863516336e+00,-6.946490985213463532e+00,-6.895807284322138742e+00,-6.913891620442662145e+00,-6.918325398258430070e+00,-6.873962190037888220e+00,-6.917955165007256824e+00,-6.920549680917824631e+00,-6.919993146259178829e+00,-6.918510566298955666e+00,-6.890571197609714815e+00,-6.912049999296012714e+00,-6.897619207267579711e+00,-6.920549680917824631e+00,-6.891111592110842565e+00,-6.881607351286037400e+00,-6.906179499956323298e+00,-6.899979625975863229e+00,-6.907277596699168853e+00,-6.901070932013599446e+00,-6.886437846410961683e+00,-6.884646052346200307e+00,-6.895445293367959749e+00,-6.925013143786523884e+00,-6.932309032477444788e+00,-6.912418052299798887e+00,-6.920364134948279400e+00,-6.909844521467647382e+00,-6.905082607707404563e+00,-6.910579140417121735e+00,-6.873962190037888220e+00,-6.912049999296012714e+00,-6.915182776206281190e+00,-6.913891620442662145e+00,-6.938524326993334768e+00,-6.930620635115468886e+00,-6.891291788525419904e+00,-6.879111806318888966e+00,-6.895626272465367634e+00,-6.930433211255861536e+00,-6.914629219590983311e+00,-6.900343262343508144e+00,-6.885720743475966188e+00,-6.921849467564470615e+00,-6.904899909250698187e+00,-6.879646041929493805e+00,-6.917585068778013380e+00,-6.898344896704362128e+00,-6.900707030990661295e+00,-6.893998638504450938e+00,-6.908560235445788678e+00,-6.942120431938089808e+00,-6.912602129612789525e+00,-6.934188379018186410e+00,-6.899434419238142269e+00,-6.897437867163718650e+00,-6.934752872400379076e+00,-6.936259747803237730e+00,-6.885362384755055132e+00,-6.885004154409113397e+00,-6.906179499956323298e+00,-6.887873596837065904e+00,-6.900343262343508144e+00,-6.935317684615338152e+00,-6.903621953805204825e+00,-6.893276095120414837e+00,-6.897437867163718650e+00,-6.907094496817672180e+00,-6.918140264498761738e+00,-6.904534612442981611e+00,-6.943638458795067692e+00,-6.911866023580287788e+00,-6.918140264498761738e+00,-6.923709241208193532e+00,-6.907094496817672180e+00,-6.948015666278788416e+00,-6.940604705994244839e+00,-6.930620635115468886e+00,-6.910946652365396403e+00,-6.911314299428362062e+00,-6.898889509588840951e+00,-6.879824183909121871e+00,-6.892734529983972180e+00,-6.917585068778013380e+00,-6.909293911194032134e+00,-6.903621953805204825e+00,-6.926132129356471268e+00,-6.900343262343508144e+00,-6.873076999221478189e+00,-6.885899971005793319e+00,-6.897981986157827095e+00,-6.922407036578078987e+00,-6.895807284322138742e+00,-6.890391130984577117e+00,-6.893637301553816599e+00,-6.904352014067599796e+00,-6.934188379018186410e+00,-6.925945544786474173e+00,-6.908743603925261212e+00,-6.930058468886851841e+00,-6.911682081705400549e+00,-6.917215109469317724e+00,-6.894540888856644401e+00,-6.902345629445600039e+00,-6.901252932196102918e+00,-6.896894044074089791e+00,-6.909660951021972863e+00,-6.872723142176031530e+00,-6.896712835411820564e+00,-6.943828374283246063e+00,-6.900888964949135840e+00,-6.935317684615338152e+00,-6.950689464663192041e+00,-6.955481959218925425e+00,-6.926318748746767895e+00,-6.910762879508130396e+00,-6.922778922041644023e+00,-6.879111806318888966e+00,-6.882142922128080542e+00,-6.894721704334941137e+00,-6.969803860686507591e+00,-6.914260352010261457e+00,-6.898707939002775902e+00,-6.903621953805204825e+00,-6.898344896704362128e+00,-6.902527861792346897e+00,-6.935506026267651336e+00,-6.902527861792346897e+00,-6.915182776206281190e+00,-6.914813704417630547e+00,-6.904352014067599796e+00,-6.933060347313688254e+00,-6.893456682028354265e+00,-6.902527861792346897e+00,-6.904717244166620915e+00,-6.901799131572449397e+00,-6.892554073428398098e+00,-6.905813735504549911e+00,-6.859543090210840077e+00,-6.926505402970361658e+00,-6.893998638504450938e+00,-6.870602625389215845e+00,-6.864443345028680810e+00,-6.897437867163718650e+00,-6.908193599328024348e+00,-6.919807703514095465e+00,-6.897075285578891268e+00,-6.905265339548934733e+00,-6.893276095120414837e+00,-6.894360106066672600e+00,-6.900888964949135840e+00,-6.914998223285207857e+00,-6.911866023580287788e+00,-6.889851125571400559e+00,-6.932121291934322471e+00,-6.853972045161384585e+00,-6.916106052054344033e+00,-6.906911430455604517e+00,-6.912049999296012714e+00,-6.866375034100673957e+00,-6.893637301553816599e+00,-6.917030181128522415e+00,-6.925945544786474173e+00,-6.949733715761230002e+00,-6.944208313497542306e+00,-6.922221145695491629e+00,-6.891291788525419904e+00,-6.911314299428362062e+00,-6.912418052299798887e+00,-6.883393704029733939e+00,-6.919993146259178829e+00,-6.903986917312906257e+00,-6.903804418909260221e+00,-6.883572514937675635e+00,-6.895264347018059681e+00,-6.906362432363309267e+00,-6.911130459001332937e+00,-6.933812227048356647e+00,-6.912602129612789525e+00,-6.885899971005793319e+00,-6.919993146259178829e+00,-6.902345629445600039e+00,-6.900343262343508144e+00,-6.913154564945870106e+00,-6.886258522461767839e+00,-6.923150945855457650e+00,-6.916845286979894425e+00
+-6.937971695702792374e+00,-6.919619120673425883e+00,-6.880676791041638651e+00,-6.965564938954168284e+00,-6.894152873159416828e+00,-6.899412702099109396e+00,-6.926508636798859087e+00,-6.889280536533192389e+00,-6.892887393464185664e+00,-6.892706741309417851e+00,-6.942518019123534501e+00,-6.928566058267152883e+00,-6.897232864723687129e+00,-6.895238846116823694e+00,-6.903786683922222167e+00,-6.893971992268991045e+00,-6.927817415147778490e+00,-6.914067663040809109e+00,-6.935329187558838981e+00,-6.932693643864606514e+00,-6.900504405264303642e+00,-6.921662401778112539e+00,-6.911119439051613966e+00,-6.922778680963794784e+00,-6.886224903707226730e+00,-6.910199898756934189e+00,-6.905431872118910519e+00,-6.902326563202874965e+00,-6.888560723653453621e+00,-6.948803155516831254e+00,-6.906713336825164973e+00,-6.953399103859510433e+00,-6.889100534732689241e+00,-6.917209704254362990e+00,-6.910751521461001801e+00,-6.895238846116823694e+00,-6.891082338822455355e+00,-6.891262697737184695e+00,-6.927069332075673103e+00,-6.907263039083625600e+00,-6.921848361797245275e+00,-6.911303448620630618e+00,-6.896325999693731390e+00,-6.911119439051613966e+00,-6.902873858664861473e+00,-6.902508961744192817e+00,-6.884431824510656384e+00,-6.925948255725074887e+00,-6.878359292901276234e+00,-6.911671569368390777e+00,-6.889100534732689241e+00,-6.885686642278603387e+00,-6.897595841134206651e+00,-6.895601099335781470e+00,-6.909281203239862279e+00,-6.893610328612245652e+00,-6.926695500295618757e+00,-6.909281203239862279e+00,-6.909464874835567372e+00,-6.896507306919319902e+00,-6.931566248029296773e+00,-6.891803969739573432e+00,-6.863688238771922911e+00,-6.927817415147778490e+00,-6.912961060198263397e+00,-6.909832319263731648e+00,-6.915175491809945285e+00,-6.885507286166562935e+00,-6.944228233317096155e+00,-6.905431872118910519e+00,-6.895782275167421815e+00,-6.936460836114161310e+00,-6.898503858993743521e+00,-6.927256300385089105e+00,-6.867379901574006240e+00,-6.892526121783955517e+00,-6.897414336459963380e+00,-6.877291487140446691e+00,-6.904334779304535985e+00,-6.871615700353387979e+00,-6.916099620884123667e+00,-6.890000867916761251e+00,-6.900140371769200698e+00,-6.883178579337224434e+00,-6.905066040763083279e+00,-6.932693643864606514e+00,-6.923896207621353938e+00,-6.910199898756934189e+00,-6.902691393560806077e+00,-6.903421453823201048e+00,-6.941379499076452575e+00,-6.917950445028973050e+00,-6.925761531795860293e+00,-6.911487492055400139e+00,-6.905066040763083279e+00,-6.897777378758377154e+00,-6.901232870057034674e+00,-6.907996445790500317e+00,-6.887122650423416204e+00,-6.898685561547164724e+00,-6.899049065731464481e+00,-6.890361228281021155e+00,-6.901961865897591863e+00,-6.908730390777574115e+00,-6.917580006054556918e+00,-6.905980870211205769e+00,-6.876580249557120439e+00,-6.907996445790500317e+00,-6.871615700353387979e+00,-6.890721718551688468e+00,-6.919990315926389357e+00,-6.891623513183999350e+00,-6.885327962217369091e+00,-6.881569568531524794e+00,-6.896144725334492520e+00,-6.906713336825164973e+00,-6.896688647023180962e+00,-6.892345534876016089e+00,-6.904517544543097074e+00,-6.939674145749846090e+00,-6.913329791765862709e+00,-6.911119439051613966e+00,-6.902326563202874965e+00,-6.914621424014812945e+00,-6.928940590106737574e+00,-6.921104729117503851e+00,-6.914990768432677015e+00,-6.942328175743751828e+00,-6.906530169867972191e+00,-6.934387124370939404e+00,-6.901779567109374014e+00,-6.947085106034389668e+00,-6.943087765601770656e+00,-6.921848361797245275e+00,-6.929127908642453093e+00,-6.910383739183963314e+00,-6.910751521461001801e+00,-6.882999672457778217e+00,-6.928566058267152883e+00,-6.912961060198263397e+00,-6.901415069201201291e+00,-6.899412702099109396e+00,-6.914806079171832209e+00,-6.907996445790500317e+00,-6.910567613414507093e+00,-6.886943036592667156e+00,-6.940621205201255250e+00,-6.912776745388045541e+00,-6.912408217652163955e+00,-6.900322371951704170e+00,-6.914621424014812945e+00,-6.912224004701471358e+00,-6.907629675201389929e+00,-6.909464874835567372e+00,-6.890000867916761251e+00,-6.899230867386361155e+00,-6.897414336459963380e+00,-6.895419956322879074e+00,-6.916469511770531042e+00,-6.897595841134206651e+00,-6.923896207621353938e+00,-6.868966224282619493e+00,-6.895782275167421815e+00,-6.902691393560806077e+00,-6.913698659346584563e+00,-6.895963483829691043e+00,-6.913514208548312112e+00,-6.904517544543097074e+00,-6.894333786773660933e+00,-6.870908486745410215e+00,-6.899958404704737092e+00,-6.896688647023180962e+00,-6.902508961744192817e+00,-6.900504405264303642e+00,-6.922778680963794784e+00,-6.914806079171832209e+00,-6.920361649016435024e+00,-6.890181031866443817e+00,-6.899594569881731587e+00,-6.905248939711924550e+00,-6.878003230942734803e+00,-6.870731761464400122e+00,-6.916839539526833391e+00,-6.910567613414507093e+00,-6.902691393560806077e+00,-6.853562453931113296e+00,-6.904517544543097074e+00,-6.887481974897278647e+00,-6.890902012431199708e+00,-6.929315262272979226e+00,-6.901232870057034674e+00,-6.919990315926389357e+00,-6.937404855819876204e+00,-6.890902012431199708e+00,-6.904334779304535985e+00,-6.891623513183999350e+00,-6.912224004701471358e+00,-6.896688647023180962e+00,-6.919619120673425883e+00,-6.895419956322879074e+00,-6.917024604762858075e+00,-6.902873858664861473e+00,-6.949950163862810726e+00,-6.914990768432677015e+00,-6.900140371769200698e+00,-6.905431872118910519e+00,-6.887841428531629617e+00,-6.911119439051613966e+00,-6.913883144173231798e+00,-6.933822312155980327e+00,-6.902326563202874965e+00,-6.892164980573818767e+00,-6.919248063155254869e+00,-6.901597301547948149e+00,-6.920733120045966302e+00,-6.915914726735495677e+00,-6.916839539526833391e+00,-6.909648580172722987e+00,-6.914806079171832209e+00,-6.909648580172722987e+00,-6.911855680572077887e+00,-6.924455439623997677e+00,-6.904883175260151162e+00,-6.901779567109374014e+00,-6.907446340339751245e+00,-6.930065028006321626e+00,-6.924828434779385589e+00,-6.917209704254362990e+00,-6.958402000574812618e+00,-6.906163936573273432e+00,-6.909281203239862279e+00,-6.932129787069289506e+00,-6.920918907320071867e+00,-6.922406449431486308e+00,-6.937027140990078777e+00,-6.876580249557120439e+00,-6.923896207621353938e+00,-6.943467777006684827e+00,-6.898140552904370892e+00,-6.899412702099109396e+00,-6.874626950525348335e+00,-6.908363350949633386e+00,-6.918506360914689068e+00,-6.917765208388635401e+00,-6.937593766748936019e+00,-6.881033806397091723e+00,-6.917950445028973050e+00,-6.900868571328050649e+00,-6.891803969739573432e+00,-6.904517544543097074e+00,-6.944418437787474474e+00,-6.905614837996283839e+00,-6.916099620884123667e+00,-6.884431824510656384e+00,-6.902691393560806077e+00,-6.941569162424579886e+00,-6.888560723653453621e+00,-6.888560723653453621e+00,-6.912592464543518389e+00,-6.913145408986691720e+00,-6.935140703909238979e+00,-6.935329187558838981e+00,-6.897414336459963380e+00,-6.890000867916761251e+00,-6.888740628304473645e+00,-6.911119439051613966e+00,-6.909832319263731648e+00,-6.937782713372076415e+00,-6.912776745388045541e+00,-6.909648580172722987e+00,-6.878359292901276234e+00,-6.896870020017239256e+00,-6.912408217652163955e+00,-6.927069332075673103e+00,-6.897958949344442203e+00,-6.940052861906616855e+00,-6.907446340339751245e+00,-6.903604052198582863e+00,-6.885507286166562935e+00,-6.927069332075673103e+00,-6.900868571328050649e+00,-6.903421453823201048e+00,-6.916654508533614631e+00,-6.864214785515111217e+00,-6.896325999693731390e+00,-6.936272139066710807e+00,-6.913698659346584563e+00,-6.888920565327001810e+00,-6.881926902821550840e+00,-6.892887393464185664e+00,-6.912039825678942151e+00,-6.885148670419489747e+00,-6.918877143269696717e+00,-6.907446340339751245e+00,-6.909281203239862279e+00,-6.870025172493118504e+00,-6.912592464543518389e+00,-6.906163936573273432e+00,-6.888920565327001810e+00,-6.920361649016435024e+00,-6.917209704254362990e+00,-6.931378472233046040e+00,-6.926695500295618757e+00,-6.903969349006299439e+00,-6.888560723653453621e+00,-6.886943036592667156e+00,-6.926882398716870171e+00,-6.900504405264303642e+00,-6.914806079171832209e+00,-6.880141506882189617e+00,-6.904700343190896206e+00,-6.884073594164714649e+00,-6.922220385611058902e+00,-6.905066040763083279e+00,-6.869495558216462072e+00,-6.917394838014031322e+00,-6.924082583542125136e+00,-6.906530169867972191e+00,-6.940621205201255250e+00,-6.915175491809945285e+00,-6.915914726735495677e+00,-6.924082583542125136e+00,-6.888021203813012860e+00,-6.903238888783899796e+00,-6.921290585451092880e+00,-6.890361228281021155e+00,-6.904883175260151162e+00,-6.895419956322879074e+00,-6.913329791765862709e+00,-6.899594569881731587e+00,-6.911487492055400139e+00,-6.885866030565029661e+00,-6.932881666803957899e+00,-6.909464874835567372e+00,-6.876047153142897272e+00,-6.934952255779029073e+00,-6.913514208548312112e+00,-6.928004523390441705e+00,-6.883178579337224434e+00,-6.912408217652163955e+00,-6.888920565327001810e+00,-6.879428240090181390e+00,-6.876402519173645089e+00,-6.929127908642453093e+00,-6.902873858664861473e+00,-6.899594569881731587e+00,-6.909464874835567372e+00,-6.941758861751717191e+00,-6.919990315926389357e+00,-6.901779567109374014e+00,-6.888740628304473645e+00,-6.939484841442608243e+00,-6.901597301547948149e+00,-6.893791144090542389e+00,-6.899958404704737092e+00,-6.903421453823201048e+00,-6.916839539526833391e+00,-6.904517544543097074e+00,-6.924641919811017843e+00,-6.909464874835567372e+00,-6.885686642278603387e+00,-6.874449566932865352e+00,-6.885686642278603387e+00,-6.908730390777574115e+00,-6.916099620884123667e+00,-6.895057768705735057e+00,-6.891262697737184695e+00,-6.887661685563589486e+00,-6.900686471719062354e+00,-6.906713336825164973e+00,-6.899776470746262547e+00,-6.904152047463005815e+00,-6.914067663040809109e+00,-6.895782275167421815e+00,-6.911487492055400139e+00,-6.881569568531524794e+00,-6.904517544543097074e+00,-6.890541457172201234e+00,-6.917580006054556918e+00,-6.907996445790500317e+00,-6.915545040964188317e+00,-6.904700343190896206e+00,-6.906530169867972191e+00,-6.893248795708814214e+00,-6.930065028006321626e+00,-6.926135014526625611e+00,-6.891984458865593410e+00,-6.868613490592895943e+00,-6.883715492101801559e+00,-6.887841428531629617e+00,-6.933069725102583902e+00,-6.930815356310141340e+00,-6.910016092120997655e+00,-6.869848603233368323e+00,-6.912408217652163955e+00,-6.903969349006299439e+00,-6.925388188502369147e+00,-6.901232870057034674e+00,-6.876047153142897272e+00,-6.865971946395978520e+00,-6.916654508533614631e+00,-6.920361649016435024e+00,-6.925948255725074887e+00,-6.927817415147778490e+00,-6.891443089187118787e+00,-6.919433574703880652e+00,-6.894514733123561001e+00,-6.936272139066710807e+00,-6.920175965235378257e+00,-6.925388188502369147e+00,-6.919990315926389357e+00,-6.895782275167421815e+00,-6.881212361883681794e+00,-6.929127908642453093e+00,-6.913329791765862709e+00,-6.892345534876016089e+00,-6.884969410761394570e+00,-6.893971992268991045e+00,-6.901232870057034674e+00,-6.895419956322879074e+00,-6.933257818773787662e+00,-6.909832319263731648e+00,-6.893429545822273852e+00,-6.900322371951704170e+00,-6.926321808213542042e+00,-6.895057768705735057e+00,-6.911303448620630618e+00,-6.919804701076667897e+00,-6.915175491809945285e+00,-6.878537371434379821e+00,-6.919804701076667897e+00,-6.894876724077739993e+00,-6.907629675201389929e+00,-6.915729866766396228e+00,-6.877469375604848167e+00,-6.914252215961882442e+00,-6.902873858664861473e+00,-6.910199898756934189e+00,-6.876580249557120439e+00,-6.876402519173645089e+00,-6.881748219715563053e+00,-6.919248063155254869e+00,-6.907446340339751245e+00,-6.919433574703880652e+00,-6.870908486745410215e+00,-6.914067663040809109e+00,-6.850613878384042010e+00,-6.954551398974526677e+00,-6.917394838014031322e+00,-6.912039825678942151e+00,-6.918506360914689068e+00,-6.902873858664861473e+00,-6.905066040763083279e+00,-6.893429545822273852e+00,-6.923709866430094451e+00,-6.894152873159416828e+00,-6.923896207621353938e+00,-6.927256300385089105e+00,-6.909648580172722987e+00,-6.863688238771922911e+00,-6.934575466023252588e+00,-6.885507286166562935e+00,-6.903238888783899796e+00,-6.897958949344442203e+00,-6.911303448620630618e+00,-6.910751521461001801e+00,-6.881926902821550840e+00,-6.931378472233046040e+00,-6.906896537338639064e+00,-6.914621424014812945e+00,-6.921104729117503851e+00,-6.905248939711924550e+00,-6.930440121784162244e+00,-6.894152873159416828e+00,-6.908363350949633386e+00,-6.914252215961882442e+00,-6.939484841442608243e+00,-6.914990768432677015e+00,-6.910383739183963314e+00,-6.931754059091920439e+00,-6.935706261469793077e+00,-6.916839539526833391e+00,-6.922406449431486308e+00,-6.896325999693731390e+00,-6.928566058267152883e+00,-6.916284549224918976e+00,-6.921476476333680239e+00,-6.912039825678942151e+00,-6.897958949344442203e+00,-6.913883144173231798e+00,-6.914252215961882442e+00,-6.891984458865593410e+00,-6.862635976310746955e+00,-6.925014984542075425e+00,-6.899776470746262547e+00,-6.903421453823201048e+00,-6.919990315926389357e+00,-6.896688647023180962e+00,-6.914621424014812945e+00,-6.910199898756934189e+00,-6.909832319263731648e+00,-6.878181246074490218e+00,-6.895238846116823694e+00,-6.892164980573818767e+00,-6.922034356403937494e+00,-6.909464874835567372e+00,-6.912408217652163955e+00,-6.907629675201389929e+00,-6.925948255725074887e+00,-6.960332875860949287e+00,-6.897595841134206651e+00,-6.888380851362295942e+00,-6.932317704010554138e+00,-6.909832319263731648e+00,-6.892526121783955517e+00,-6.923709866430094451e+00,-6.918321021279277616e+00,-6.880855282786869864e+00,-6.906530169867972191e+00,-6.899412702099109396e+00,-6.924828434779385589e+00,-6.906530169867972191e+00,-6.877291487140446691e+00,-6.926695500295618757e+00,-6.887122650423416204e+00,-6.923896207621353938e+00,-6.896688647023180962e+00,-6.966925880157711859e+00,-6.920918907320071867e+00,-6.907079771420692182e+00,-6.885686642278603387e+00,-6.908546854023825290e+00,-6.919619120673425883e+00,-6.909832319263731648e+00,-6.900868571328050649e+00,-6.949567681547954123e+00,-6.904334779304535985e+00,-6.891443089187118787e+00,-6.867908396257320192e+00,-6.929690074871070138e+00,-6.895057768705735057e+00,-6.892345534876016089e+00,-6.883178579337224434e+00,-6.910567613414507093e+00,-6.925201569112072519e+00,-6.922406449431486308e+00,-6.887661685563589486e+00,-6.884610987818490457e+00,-6.926695500295618757e+00,-6.900868571328050649e+00,-6.911855680572077887e+00,-6.912408217652163955e+00,-6.935517706741224586e+00,-6.916099620884123667e+00,-6.897414336459963380e+00,-6.903421453823201048e+00,-6.902873858664861473e+00,-6.915914726735495677e+00,-6.862986607450384469e+00,-6.914806079171832209e+00,-6.910751521461001801e+00,-6.895057768705735057e+00,-6.878893623664723123e+00,-6.937027140990078777e+00,-6.897051425913428346e+00,-6.930815356310141340e+00,-6.878715481685095057e+00,-6.896507306919319902e+00,-6.908363350949633386e+00,-6.921476476333680239e+00,-6.925201569112072519e+00,-6.898685561547164724e+00,-6.907079771420692182e+00,-6.878003230942734803e+00,-6.916099620884123667e+00,-6.890000867916761251e+00,-6.909832319263731648e+00,-6.952631644113475673e+00,-6.898867297122404096e+00,-6.880319903100662771e+00,-6.917950445028973050e+00,-6.902326563202874965e+00,-6.924268994205354133e+00,-6.902144197924718227e+00,-6.910199898756934189e+00,-6.917580006054556918e+00,-6.890902012431199708e+00,-6.888740628304473645e+00,-6.879963142483250138e+00,-6.904334779304535985e+00,-6.938538857075400301e+00,-6.917209704254362990e+00,-6.895057768705735057e+00,-6.926882398716870171e+00,-6.899412702099109396e+00,-6.903969349006299439e+00,-6.891443089187118787e+00,-6.955320334037404351e+00,-6.915545040964188317e+00,-6.926135014526625611e+00,-6.904700343190896206e+00,-6.896507306919319902e+00,-6.881569568531524794e+00,-6.909281203239862279e+00,-6.892526121783955517e+00,-6.863337361575341689e+00,-6.908179881542642065e+00,-6.895963483829691043e+00,-6.901779567109374014e+00,-6.913145408986691720e+00,-6.925761531795860293e+00,-6.897595841134206651e+00,-6.915360249316243824e+00,-6.882820797580277983e+00,-6.884431824510656384e+00,-6.921848361797245275e+00,-6.886943036592667156e+00,-6.902873858664861473e+00,-6.916284549224918976e+00,-6.890361228281021155e+00,-6.911487492055400139e+00,-6.948612115302285730e+00,-6.943467777006684827e+00,-6.919433574703880652e+00,-6.918135715988279699e+00,-6.922034356403937494e+00,-6.915729866766396228e+00,-6.875336799712210123e+00,-6.887841428531629617e+00,-6.912408217652163955e+00,-6.885866030565029661e+00,-6.894333786773660933e+00,-6.908730390777574115e+00,-6.883715492101801559e+00,-6.927256300385089105e+00,-6.929690074871070138e+00,-6.938917143442688840e+00,-6.918877143269696717e+00,-6.891262697737184695e+00,-6.869142637796352702e+00,-6.885327962217369091e+00,-6.921476476333680239e+00,-6.931190731689923723e+00,-6.915729866766396228e+00,-6.899594569881731587e+00,-6.904700343190896206e+00,-6.919990315926389357e+00,-6.921848361797245275e+00,-6.920733120045966302e+00,-6.913698659346584563e+00,-6.912039825678942151e+00,-6.906713336825164973e+00,-6.880676791041638651e+00,-6.928566058267152883e+00,-6.893068078260052189e+00,-6.888740628304473645e+00,-6.911487492055400139e+00,-6.908179881542642065e+00,-6.912961060198263397e+00,-6.923896207621353938e+00,-6.905248939711924550e+00,-6.878181246074490218e+00,-6.909097565373217620e+00,-6.946131907982998399e+00,-6.929127908642453093e+00,-6.875514340769045774e+00,-6.900504405264303642e+00,-6.917209704254362990e+00,-6.920175965235378257e+00,-6.916284549224918976e+00,-6.912408217652163955e+00,-6.937404855819876204e+00,-6.935706261469793077e+00,-6.899049065731464481e+00,-6.929690074871070138e+00,-6.919248063155254869e+00,-6.927630341908059819e+00,-6.943467777006684827e+00,-6.913329791765862709e+00,-6.889100534732689241e+00,-6.912776745388045541e+00,-6.922964848701463225e+00,-6.926508636798859087e+00,-6.888021203813012860e+00,-6.895963483829691043e+00,-6.932505656271233718e+00,-6.907629675201389929e+00,-6.893791144090542389e+00,-6.896325999693731390e+00,-6.896688647023180962e+00,-6.935140703909238979e+00,-6.871792581931632782e+00,-6.925761531795860293e+00,-6.902144197924718227e+00,-6.884969410761394570e+00,-6.913145408986691720e+00,-6.908913961223248634e+00,-6.895782275167421815e+00,-6.919990315926389357e+00,-6.919619120673425883e+00,-6.905431872118910519e+00,-6.918691734907248758e+00,-6.934010547450652595e+00,-6.902326563202874965e+00,-6.871969494802504741e+00,-6.864390362725639960e+00,-6.891984458865593410e+00,-6.896144725334492520e+00,-6.876580249557120439e+00,-6.907629675201389929e+00,-6.905431872118910519e+00,-6.925388188502369147e+00,-6.913145408986691720e+00,-6.920547367282360085e+00,-6.890902012431199708e+00,-6.890541457172201234e+00,-6.912408217652163955e+00,-6.908363350949633386e+00,-6.905980870211205769e+00,-6.897595841134206651e+00,-6.912592464543518389e+00,-6.902873858664861473e+00,-6.907446340339751245e+00,-6.889640637365316067e+00,-6.889100534732689241e+00,-6.906713336825164973e+00,-6.924455439623997677e+00,-6.884610987818490457e+00,-6.898685561547164724e+00,-6.893791144090542389e+00,-6.898685561547164724e+00,-6.868613490592895943e+00,-6.900322371951704170e+00,-6.944989268361437595e+00,-6.906163936573273432e+00,-6.915729866766396228e+00,-6.891262697737184695e+00,-6.902508961744192817e+00,-6.928004523390441705e+00,-6.881033806397091723e+00,-6.892164980573818767e+00,-6.915360249316243824e+00,-6.929690074871070138e+00,-6.936272139066710807e+00,-6.913698659346584563e+00,-6.927069332075673103e+00,-6.906530169867972191e+00,-6.941948597071515792e+00,-6.894514733123561001e+00,-6.912039825678942151e+00,-6.870908486745410215e+00,-6.908730390777574115e+00,-6.862460706833477531e+00,-6.902508961744192817e+00,-6.889640637365316067e+00,-6.902873858664861473e+00,-6.894152873159416828e+00,-6.929127908642453093e+00,-6.905066040763083279e+00,-6.898867297122404096e+00,-6.889100534732689241e+00,-6.945941377354623469e+00,-6.890361228281021155e+00,-6.939863485899989470e+00,-6.929690074871070138e+00,-6.909281203239862279e+00,-6.936838337062397386e+00,-6.890361228281021155e+00,-6.895057768705735057e+00,-6.913883144173231798e+00,-6.908913961223248634e+00,-6.906347036454770105e+00,-6.892345534876016089e+00,-6.914621424014812945e+00,-6.901961865897591863e+00,-6.898503858993743521e+00,-6.905797837356297819e+00,-6.885866030565029661e+00,-6.916284549224918976e+00,-6.928566058267152883e+00,-6.904517544543097074e+00,-6.942328175743751828e+00,-6.932505656271233718e+00,-6.905431872118910519e+00,-6.881926902821550840e+00,-6.928191666649151870e+00,-6.921104729117503851e+00,-6.903056357068507509e+00,-6.920547367282360085e+00,-6.890181031866443817e+00,-6.883357518230070582e+00,-6.908363350949633386e+00,-6.938727982371473146e+00,-6.900140371769200698e+00,-6.931941905434165108e+00,-6.925761531795860293e+00,-6.908546854023825290e+00,-6.920733120045966302e+00,-6.922592547878101854e+00,-6.916839539526833391e+00,-6.922406449431486308e+00,-6.911303448620630618e+00,-6.933634112287164086e+00,-6.889820736420283254e+00,-6.922406449431486308e+00,-6.920918907320071867e+00,-6.893429545822273852e+00,-6.899412702099109396e+00,-6.881748219715563053e+00,-6.909281203239862279e+00,-6.903056357068507509e+00,-6.900686471719062354e+00,-6.886224903707226730e+00,-6.890902012431199708e+00,-6.914621424014812945e+00,-6.919804701076667897e+00,-6.918506360914689068e+00,-6.903056357068507509e+00,-6.901779567109374014e+00,-6.916469511770531042e+00,-6.899230867386361155e+00,-6.905797837356297819e+00,-6.883715492101801559e+00,-6.909648580172722987e+00,-6.909832319263731648e+00,-6.917765208388635401e+00,-6.922778680963794784e+00,-6.908546854023825290e+00,-6.895601099335781470e+00,-6.890721718551688468e+00,-6.880855282786869864e+00,-6.902144197924718227e+00,-6.874804365588351729e+00,-6.919990315926389357e+00,-6.915914726735495677e+00,-6.915729866766396228e+00,-6.923523559955405915e+00,-6.886404388586107927e+00,-6.916469511770531042e+00,-6.939674145749846090e+00,-6.931378472233046040e+00,-6.885327962217369091e+00,-6.933822312155980327e+00,-6.944418437787474474e+00,-6.908179881542642065e+00,-6.900504405264303642e+00,-6.912961060198263397e+00,-6.902144197924718227e+00,-6.879250002855950896e+00,-6.929315262272979226e+00,-6.914806079171832209e+00,-6.895601099335781470e+00,-6.943087765601770656e+00,-6.932129787069289506e+00,-6.908730390777574115e+00,-6.919804701076667897e+00,-6.947466639804757094e+00,-6.908546854023825290e+00,-6.910935463335889040e+00,-6.903238888783899796e+00,-6.919433574703880652e+00,-6.889280536533192389e+00,-6.905614837996283839e+00,-6.912592464543518389e+00,-6.915360249316243824e+00,-6.898867297122404096e+00,-6.912961060198263397e+00,-6.934387124370939404e+00,-6.904517544543097074e+00,-6.920361649016435024e+00,-6.904700343190896206e+00,-6.921104729117503851e+00,-6.930815356310141340e+00,-6.897595841134206651e+00,-6.932881666803957899e+00,-6.886763455017254287e+00,-6.939863485899989470e+00,-6.931754059091920439e+00,-6.912039825678942151e+00,-6.924268994205354133e+00,-6.913329791765862709e+00,-6.910567613414507093e+00,-6.907629675201389929e+00,-6.888920565327001810e+00,-6.898503858993743521e+00,-6.927817415147778490e+00,-6.899594569881731587e+00,-6.923709866430094451e+00,-6.896144725334492520e+00,-6.917580006054556918e+00,-6.913329791765862709e+00,-6.905066040763083279e+00,-6.876758011534302995e+00,-6.913145408986691720e+00,-6.921476476333680239e+00,-6.893248795708814214e+00,-6.912776745388045541e+00,-6.949376495236311513e+00,-6.915175491809945285e+00,-6.902691393560806077e+00,-6.892526121783955517e+00,-6.917394838014031322e+00,-6.905980870211205769e+00,-6.948230144327435909e+00,-6.934575466023252588e+00,-6.920175965235378257e+00,-6.937215980571416196e+00,-6.877291487140446691e+00,-6.918321021279277616e+00,-6.909464874835567372e+00,-6.937593766748936019e+00,-6.940052861906616855e+00,-6.942328175743751828e+00,-6.899776470746262547e+00,-6.892706741309417851e+00,-6.912408217652163955e+00,-6.915729866766396228e+00,-6.925761531795860293e+00,-6.891984458865593410e+00,-6.878715481685095057e+00,-6.907263039083625600e+00,-6.961299713410884493e+00,-6.928566058267152883e+00,-6.917394838014031322e+00,-6.892887393464185664e+00,-6.901415069201201291e+00,-6.919990315926389357e+00,-6.915175491809945285e+00,-6.904152047463005815e+00,-6.916654508533614631e+00,-6.938349767540943347e+00,-6.879606509098586287e+00,-6.915175491809945285e+00,-6.909648580172722987e+00,-6.896144725334492520e+00,-6.927256300385089105e+00,-6.903238888783899796e+00,-6.924828434779385589e+00,-6.917765208388635401e+00,-6.906896537338639064e+00,-6.929877533864964079e+00,-6.914621424014812945e+00,-6.903604052198582863e+00,-6.867027726891725692e+00,-6.917024604762858075e+00,-6.910751521461001801e+00,-6.910567613414507093e+00,-6.900686471719062354e+00,-6.907996445790500317e+00,-6.872854528934086105e+00,-6.888021203813012860e+00,-6.876402519173645089e+00,-6.896870020017239256e+00,-6.892345534876016089e+00,-6.886045451037386300e+00,-6.931941905434165108e+00,-6.956667394762604317e+00,-6.921476476333680239e+00,-6.944418437787474474e+00,-6.893429545822273852e+00,-6.881390949258030076e+00,-6.906347036454770105e+00,-6.930815356310141340e+00,-6.897051425913428346e+00,-6.891803969739573432e+00,-6.921662401778112539e+00,-6.907996445790500317e+00,-6.914990768432677015e+00,-6.918506360914689068e+00,-6.902144197924718227e+00,-6.886224903707226730e+00,-6.919804701076667897e+00,-6.940810724769718831e+00,-6.899958404704737092e+00,-6.920547367282360085e+00,-6.908363350949633386e+00,-6.940810724769718831e+00,-6.896870020017239256e+00,-6.882463143785335191e+00,-6.896870020017239256e+00,-6.926321808213542042e+00,-6.896144725334492520e+00,-6.884073594164714649e+00,-6.909464874835567372e+00,-6.908546854023825290e+00,-6.921662401778112539e+00,-6.876580249557120439e+00,-6.896507306919319902e+00,-6.914252215961882442e+00,-6.879963142483250138e+00,-6.937971695702792374e+00,-6.896325999693731390e+00,-6.927256300385089105e+00,-6.936838337062397386e+00,-6.905614837996283839e+00,-6.898685561547164724e+00,-6.878715481685095057e+00,-6.915175491809945285e+00,-6.909832319263731648e+00,-6.907629675201389929e+00
diff --git a/projects/resources/python/other/data/ridge_coeff.csv b/projects/resources/python/other/data/ridge_coeff.csv
new file mode 100644
index 00000000..48740ddc
--- /dev/null
+++ b/projects/resources/python/other/data/ridge_coeff.csv
@@ -0,0 +1,5 @@
+-2.479909147170334272e-03,4.628334258116833036e-03,-5.079606890835244512e-03,5.222978465457811699e-03,1.145519601222120924e-03,4.016689893436099779e-03,2.290936974855060752e-03,1.857033644726054783e-03,-1.769935701580454147e-03,2.042857104467621972e-03,6.674232964954971391e-03,4.190793900292542722e-03,1.030080343823243545e-02,-9.167956050435476650e-03,-1.970681889783699580e-03,3.815056904913152486e-03,-1.479941233091099182e-03,-1.130554613717112914e-03,2.164652050259801945e-03,-1.628748698094924924e-03,2.237873226153147457e-03,3.543972305957072445e-03,2.860289462412693440e-03,1.535368554006642057e-03,-2.827208781088372258e-03,-4.139330578145890630e-04,-1.620224495447969581e-05,5.544384763372533879e-03,-8.278567069455389921e-03,-1.755890169951782704e-04,-1.059541548869450891e-04,-1.076404661248349251e-03,1.370345068442426015e-03,-2.153257006611755849e-03,2.855551191510983296e-03,-4.549231518519523802e-04,4.951509993829600852e-03,3.252011552657545993e-03,6.145497878171295864e-03,-6.515376032622312861e-04,-4.214168037242137232e-03,-3.429801111616453176e-03,-3.235214326807736400e-03,1.812585302916501419e-04,-1.317060911730074139e-03,-4.699609776204835207e-03,-8.326351429678193047e-04,3.792067110821019122e-03,-4.282418215093820868e-03,1.112500523501755607e-03,2.622115977240143140e-04,-3.702755998817886755e-03,-8.934987356736711489e-03,-4.822083961535189589e-03,-2.417521936395233221e-03,8.131781653933545587e-03,3.381898944304107934e-03,-3.332198717078463639e-04,-1.493106845745394628e-03,2.332780568104433298e-03,-4.294981038662559064e-03,-5.909562772439637575e-03,-2.022018237592226268e-03,-3.925386194508956143e-03,1.277299364250101838e-03,2.567207129608717157e-03,4.083507953151393859e-03,-2.586691612025698254e-03,1.027195555685066382e-03,-5.656161568735707662e-03,7.278046005395701935e-03,1.854229088648799663e-03,2.726532915909003416e-03,7.468012970653752905e-04,-6.736556159708625112e-03,-6.072664882654933705e-03,7.999907128832984432e-05,-4.912944004586750936e-03,1.504324753822174629e-03,-1.909074061823143147e-03,-4.319980188059990281e-04,2.819652476505884412e-03,9.847856338551573378e-04,-5.983760603605419413e-03,-8.133125577941874209e-04,5.895643656361504968e-03,1.039062187152369523e-04,4.733676745667288362e-03,1.343248576748986905e-03,-1.452748477956497530e-03,-8.032263359439230539e-04,-1.193960855792710772e-03,2.765571040795620009e-03,-3.865517803911298162e-04,3.430337291363750717e-03,-1.824198242583732798e-03,-2.937765644595465714e-03,1.002184345285251276e-03,3.929869575713459550e-04,7.060802430322660672e-04,-2.257234906424343025e-03,-2.321523378791341317e-04,1.610995605579654244e-06,-2.311435067352834820e-03,-7.962243825727750857e-03,6.746915405656984596e-03,2.003070350526800969e-04,3.150595856705653425e-04,1.594876069342808468e-03,-1.006455156347864409e-03,-2.022558338334732014e-03,-1.956114856396928316e-03,-3.531784812219152664e-03,-2.721910513437293677e-03,2.745658932863594311e-03,-7.071432146736457945e-03,-1.250785692829398154e-03,-3.325691325143749575e-03,5.829722010286226587e-03,5.944391851982201193e-03,5.302743704690917985e-03,5.338594336883203083e-03,3.009241925351635767e-03,-1.228366269593677944e-03,3.277682109984686367e-03,-8.766469417264879374e-04,5.679875913370372419e-03,5.366732853364235105e-03,-2.153509688842038597e-03,-2.846149643292359994e-03,-1.399689375043841437e-03,-2.747716429981271816e-03,-2.142907503377523577e-03,-2.802001856816438256e-03,6.477121539682076552e-03,2.697720700599251162e-03,3.841619757932801266e-03,2.732569811258741079e-04,-7.628306381644602268e-04,2.649455908888456029e-04,-3.756560788003044261e-03,-1.396545046687510533e-03,-1.864634905697348279e-03,-5.760594958267100177e-03,2.370699735864862084e-03,-8.121847291090283569e-03,3.709287734828195168e-03,-6.346370375693548803e-03,-2.279749404227272500e-03,3.550819462005474440e-03,7.599738776761176193e-03,1.885498637573060528e-03,1.341481162414997087e-03,3.936652729601524390e-04,3.250988444755177902e-04,1.218963531086150092e-03,-1.486413471287891322e-03,5.645868648250900075e-03,3.409641584627932584e-03,2.955853454389993553e-04,6.754669065468254553e-04,-6.839734361000278373e-03,1.537378368708764527e-03,-5.969448037218546416e-03,8.029894670805021975e-05,-4.019545632452086598e-03,-4.039543097014939091e-03,-8.034783199135911089e-03,1.615228224127959918e-03,3.736617722574606289e-04,-3.215194053384099002e-03,2.632329933108993377e-03,-1.680603648453771496e-03,1.646545164813357966e-03,2.321168296840134260e-03,4.148998313965541161e-03,3.938511944843259630e-03,2.636170184181325319e-03,-1.833187050751273854e-03,-3.808372673734551194e-03,-1.044137927594941933e-03,-9.055009132569895314e-03,9.131524437155072796e-03,-8.669879748429185073e-03,-9.658584487897986124e-04,2.583618329986627524e-04,-2.644933744501084399e-03,-1.758081920625977675e-04,-3.358343604588043003e-03,1.380476829475078739e-04,7.027516333746610354e-03,3.058550487941428739e-03,-8.574818632764552020e-04,-2.502366850522910139e-03,7.428529475540861585e-03,-7.696681264851280152e-03,6.013164334125025053e-03,6.233114250031916931e-04,3.951988583727945349e-03,2.970340190352204618e-03,-8.375325406109718858e-04,-1.527034590553750029e-03,-1.339440770526383268e-03,2.198424826245562024e-03,-3.183154732806250248e-03,-1.004331539151531207e-02,-6.445069109540214891e-03,-2.203797769547500337e-03,4.191993520174941076e-03,5.933364707984382057e-03,3.794551894753868285e-04,-2.252746067405403494e-04,2.196216486405323538e-03,-1.397990414278837794e-03,1.868723562030482135e-03,3.475071537632849414e-03,8.509249401572535620e-03,1.595610410374473462e-03,-3.779149122610691496e-05,-1.666692166845319028e-04,3.266681335398647235e-03,4.151474864908329468e-03,-1.812099842676501670e-03,-9.615043754081609856e-03,3.252812624887813260e-03,7.324025979592572219e-05,-2.523797788961374029e-03,-3.437232127620954331e-05,1.988624459614606933e-04,7.888838684216530452e-04,1.227287190277017532e-04,-1.443947426839937206e-03,-1.884277766956260610e-03,-2.476428642951405360e-03,3.166237468999684706e-03,1.666425105077479728e-03,-5.423011684424974312e-03,9.480270973869329857e-04,-2.417934886525865360e-03,8.127382912252043706e-03,2.022996928513659618e-03,-5.542613887777718146e-04,1.160046078641387103e-03,2.039415039326137116e-03,5.358187629149340631e-03,-9.633528494295824467e-04,1.021577870288140457e-03,2.511612494461035790e-03,4.546540776825837987e-03,6.392556917924293626e-04,-1.488302511792899234e-03,7.868576597846352024e-04,4.094553612880929942e-03,-7.311783742780623985e-03,-2.154514704808304770e-03,9.316057542930143969e-04,4.780566225796806581e-03,1.839788570295330461e-03,2.503443866893149006e-03,-2.048241409926208336e-04,5.601999741854876008e-03,2.103533215242468259e-03,1.863026410260652972e-03,-4.460718720725031540e-03,-8.613723909115825151e-04,-1.058312932798086007e-03,-1.241722328204020393e-03,8.144398871508747723e-03,3.416181421952828486e-03,-5.408030455320496019e-03,2.569552917071585459e-03,-3.612317663975937146e-04,2.780979329355265980e-03,7.273506820417934872e-03,-6.645284080601925741e-03,-3.473800692578720480e-03,6.413162324208718386e-05,-1.060715340563820903e-03,-5.984049815033061839e-03,2.387094023978375470e-03,1.873592585275961038e-03,-4.952271194429932501e-04,1.453511260144677930e-03,-2.994961778240220508e-03,2.772465333359839731e-03,-1.977318753952836134e-04,-4.227820684511209189e-03,-3.232448437711807083e-03,7.190280546188863734e-04,5.467382347683278666e-03,7.328886527175105820e-04,-7.138310883094475291e-03,-1.676219276294037242e-04,3.623204575232621613e-03,-7.834980960757128004e-04,2.380371027042465972e-03,1.429349586301240094e-03,1.765050185690025851e-03,1.083771393271498349e-03,3.307290534802292785e-03,-8.624331412442962433e-03,5.592282279962453896e-03,-3.934970229230744070e-03,-2.217242799436944505e-03,-2.812299020732709329e-06,2.527052355866412844e-03,3.152889923620271709e-03,2.850057549990133322e-03,1.276692839940257599e-03,-5.797318127500881470e-03,-3.476005403492601203e-03,2.970939203048693876e-03,8.367766346815018116e-04,2.881977804441538098e-03,2.042622273755194742e-03,1.734659518036001189e-03,-6.327954189663422121e-04,5.911430266829669974e-03,1.878986817628728202e-03,1.467142730540162273e-03,1.785359157357422266e-03,1.115527605553961413e-03,-2.714795497504141498e-03,1.995264574752585434e-03,5.930008841200193337e-03,-6.370200795126185721e-03,-3.892860198822800395e-03,1.675904137691919477e-03,9.435099911940807334e-05,-7.551576364972605038e-03,1.228459143242951676e-03,4.052864697324480917e-03,2.671349710266962378e-03,2.640780173838774879e-04,-2.170985119965186064e-03,-4.621911668405991394e-03,2.955548628421564902e-03,1.685519872887532343e-03,-4.126433466940094760e-03,-2.748588506925268415e-03,-3.678825141123881395e-03,1.882927794480209437e-03,-4.032287748152924012e-03,9.450745775994965096e-04,-5.012524248357679013e-04,-1.392150197193394878e-05,1.159095053540319491e-02,1.829427446840571929e-03,4.719984151726430884e-03,-2.713367222388368845e-04,-1.759848243846673377e-03,-1.116066899355088208e-03,-4.602279883594558550e-03,-2.876354811901943916e-03,-1.497660388056902417e-03,1.598027239833340165e-03,4.716529311631405263e-03,-1.854896381638068996e-03,3.344891691919900725e-03,7.408348545627518357e-03,5.880429389380254120e-04,-3.673087435162146561e-03,2.905213567587692364e-03,-2.299500783887819351e-04,-7.698273827577459087e-03,-4.128140540212234082e-03,-9.266066895443077212e-03,-2.312671697641041375e-03,-2.258797745650992662e-03,-5.917787361717660906e-03,-9.963390686536460848e-04,1.090512452437602435e-03,4.208926968751415879e-03,1.473843834912221462e-03,-2.566384720828437959e-04,4.475279294952923606e-03,-5.627533640310111891e-03,-1.860416567036133695e-03,-2.915673589487891362e-03,5.751875888526313766e-04,5.053321887976720307e-03,4.607050123194351976e-03,-2.461226251601969679e-03,-4.151811195918118577e-03,1.008693269529300530e-03,-2.330728757988683331e-03,1.882075711673266523e-03,-6.563735188532697495e-03,-1.789839278889103318e-03,1.206361348887842640e-03,1.188054944360199955e-02,-5.806243894194338361e-04,2.009117630633247380e-03,-2.460150997341422480e-03,-2.722598961524289561e-03,5.253346753961263965e-03,-6.090560169170049568e-03,4.857277738562195898e-04,-8.507825042034407315e-03,-6.155465034917714058e-05,-1.280999268527639474e-03,1.743573168563495408e-03,3.784499576176610959e-03,2.323421937073790085e-03,4.258154902988896436e-03,-2.291963884674822274e-03,-5.669273870553997226e-04,-2.175867153670926410e-03,-1.448707442776940362e-03,-4.473239955556791193e-03,4.368257113425080482e-03,5.099070757097693384e-03,-6.246979627450263998e-04,-1.984987834445422605e-03,-3.117636038863893892e-03,1.543238976721156427e-04,2.352418093613183828e-04,-2.285531063993896921e-03,6.756239038789162458e-04,-3.562616752298866651e-03,4.341700330187684038e-03,1.078165857230597340e-03,9.835451554854778217e-05,-3.539408534869890963e-03,2.000110194200006158e-03,1.443601818938943003e-03,-1.814690093103889278e-03,2.524450274440879008e-03,-2.580895292011141547e-03,-9.283842332377173143e-03,-5.419945762038679352e-03,1.438385578906511194e-03,7.044777894492629602e-03,-8.251433271275780343e-04,-4.867706002790599638e-03,-4.396031383306108972e-03,1.785784051404363084e-03,-2.198540184353624076e-03,-3.251963232920460772e-03,4.668840965070443320e-03,5.018280804851956366e-05,2.175728689015670944e-03,-2.542531977101962059e-03,3.946444385518544512e-03,1.694717155528517030e-03,8.107291299042813926e-03,1.286752937921206144e-03,6.243786717491395591e-03,3.516756086967199289e-03,3.637927097747668467e-03,2.243055165539915852e-03,1.321011003151509496e-02,2.619709251450757365e-03,2.101253741799585708e-03,5.206139251077592997e-03,-3.374578288124057204e-03,3.590082173003422434e-03,-1.122436046783220967e-02,3.252257150980714627e-03,-6.854221496568258486e-03,1.526456529548639784e-03,-6.601607412633165922e-04,5.361515437810957917e-04,6.022770515427527255e-03,-2.161324886805993280e-03,-3.948176258969552689e-03,-1.892322267190435460e-03,-9.432261288387024274e-04,4.284060484020241547e-03,-4.560247778977265416e-03,3.895724573375898992e-03,-1.669247122643967036e-03,1.285679013386234687e-03,2.892996039581696537e-03,-3.930120462025032436e-03,5.917815185319324871e-03,3.261231210745032822e-03,-1.243185790235076544e-03,6.912253523457229099e-03,2.059944918122104136e-03,-2.017460779503764488e-03,-1.799074922033285125e-03,5.663304294249296017e-03,-2.779317977350140387e-03,1.602051012959119190e-03,1.020528766646110674e-03,2.325784149359022990e-03,2.494173286412894840e-03,1.299945529442433738e-03,7.499741008707817109e-04,-1.662938589456178619e-03,2.808159219981532732e-03,-1.511570179874348980e-03,-6.178454419524878269e-03,-3.924786058450260677e-03,3.436451972321041912e-04,8.295085179985717304e-04,-1.458756852156599696e-03,1.275766995487237574e-03,-3.081883411785105685e-03,-4.788607311803816553e-03,-5.302680142347636003e-03,-3.887993614508041852e-03,3.067695650590403222e-03,-6.014542981400730105e-03,4.964066619511096554e-03,9.671999544260170270e-04,-1.946709827042448224e-03,8.414768145557402490e-03,3.056171025650282567e-03,-4.296715672568809236e-03,-1.543135875416836193e-03,3.874557576532559802e-03,-2.948516106143466396e-03,-6.274427703940554241e-03,-2.447931910993749599e-03,1.158867383790611327e-03,1.445930113548554025e-03,2.504265524760701349e-03,2.508392813854675406e-03,-2.604898639884842618e-03,-1.269471979763627640e-04,-2.999035889189205929e-03,8.504021815832923416e-04,2.347510016561796720e-04,-4.095560240264506033e-03,1.187384335074016372e-03,1.755091132835374500e-03,1.231424686236911974e-03,-4.807882464514240742e-03,-1.590483405977299912e-03,4.537360404657490506e-03,-6.925124698239717731e-04,6.769796556192178051e-03,1.085909397848051731e-03,-5.244085939730444394e-03,1.278977544124892214e-03,-5.563646576288194031e-04,2.616919970250342845e-03,-4.275324257898379353e-03,5.088579223218505504e-03,5.786647755462717355e-03,-1.819026322647595119e-03,-3.002658539172064697e-03,5.819351211679203197e-04,-4.093682829394902981e-03,-4.239307625395842202e-03,3.385171605333386034e-03,4.737971505108921379e-03,-9.431698219045513577e-04,-2.916873148555332047e-03,5.052079610881618846e-04,3.960872861394506769e-03,-1.214769184492916045e-03,-2.531161271251795310e-04,-1.658698311191417803e-03,4.401742749887437890e-03,3.766098163932745258e-03,-5.109825614986756963e-03,2.346228350941656927e-03,1.273712178733361206e-03,-1.181108733634330743e-03,6.384901613402715675e-04,1.881708044238400651e-03,9.130634934089642760e-03,-1.789062562031954731e-03,-2.592605526128914737e-03,1.275527195843857231e-03,-2.101049084282519564e-03,6.351963672966203130e-03,4.667032763446036202e-03,-2.127041739284802734e-03,-1.339092969713611986e-03,-1.756285281144781074e-03,-1.027046533182398615e-03,-5.803884285446272127e-03,5.158866334683249663e-03,-1.391086833493408552e-03,7.408058482270795915e-04,-1.455726444228458124e-03,-1.680402275592706753e-03,-9.678952586296900133e-04,1.658559879599135704e-03,1.129822501816849954e-03,4.709212281360451702e-04,-2.907988541981489442e-03,2.789278583403667134e-04,-1.479112231489869899e-03,-5.230983493977423016e-03,9.814986748030688109e-04,1.596494227031105455e-03,2.793760263944547375e-03,-1.590894645330433220e-03,1.736776649307246196e-03,-4.341659705693355631e-03,2.452229638830024250e-03,1.719537842458482898e-03,-2.802190501609441459e-03,5.058648411096376349e-03,-1.149803741567762081e-03,2.185971258524016417e-04,-6.687875727675404334e-04,-3.122710424829838738e-03,8.079524827800598288e-03,-5.801035954947014218e-03,-1.523642000678661085e-03,-1.514233608321234777e-03,-2.198445286504056850e-03,2.930234601275146303e-03,6.354437272858141569e-03,2.801863950858117427e-03,5.487573202500363630e-03,-4.352434389774924653e-03,-2.403111101317361928e-03,1.492850492661215852e-03,-1.320550837229454363e-03,5.052651995654855455e-03,2.123640816449424984e-03,-1.611955330779247836e-04,-1.962380920765930683e-06,8.571534638548736033e-04,2.543405409289698806e-03,-2.427293779946437368e-03,-1.058110383276058067e-02,-3.113070917589459610e-03,2.978538038054686331e-03,3.123775532534317934e-03,-2.879764298623802704e-03,2.972361745024095989e-03,-9.566089910424329518e-03,-4.111919179147758598e-03,2.433129364742808026e-03,-7.094681576953337787e-03,2.922038192661675741e-03,-1.514759265405424867e-04,3.186488053825641334e-03,8.828635078782463017e-04,1.976527447136838789e-06,-7.843832676285358841e-05,-1.420180825800399036e-03,6.217730517168135011e-03,-7.987604241835836874e-04,-1.825499532433477474e-03,6.812658206861758707e-03,4.100357286825873603e-03,2.563736105779507602e-03,3.529884448512184917e-03,2.846697191809194425e-03,1.019650959077257482e-03,-7.620711862260234415e-04,7.957572704599026202e-05,-5.416945161579458413e-03,2.459852318284897355e-03,-2.209676696037313984e-03,5.264758009971622944e-03,-4.739988372995122302e-03,-6.449014167667340390e-03,6.493390247352627022e-03,-1.983933952929099569e-03,-2.829716368527475379e-03,-3.743353895635722099e-04,-2.923620850488645694e-03,-2.028632505193276410e-06,7.344519423424333935e-05,-1.695205596182961265e-03,5.052341548146403391e-03,-4.717641085785169100e-05,4.757643868640179519e-03,1.479775981836500571e-03,-4.095967954187080748e-03,5.257393534360746586e-03,-3.114906242043757618e-03,-2.107330638803713192e-03,2.552063207991425695e-05,4.185230022157215613e-05,-4.390819433056683194e-03,-3.767910165387542386e-03,5.892943497632687107e-03,-2.410606624770446349e-03,3.998057226623307966e-03,-5.732527686967734716e-03,7.615793766354595276e-03,1.994966210181482822e-03,-5.792463920793234928e-04,-5.440462392376754427e-03,6.956502840195275157e-03,-4.835325506802981327e-04,9.245600277315534337e-04,-4.222624654549766618e-04,-1.424570540346153024e-03,-1.505097886776389244e-03,-1.463803034787389289e-04,7.505145256945126376e-04,-1.843060113069396825e-03,9.222253914978671523e-04,1.854754274823636846e-03,1.410473355614523050e-03,5.982083758523564290e-03,-1.027470637494033501e-04,2.072801124208002374e-03,4.705906770929722989e-04,-3.661287161369356728e-03,-2.336358654024067727e-03,-2.722692003504271173e-03,-1.645289576264411706e-03,2.264718714152119151e-03,-2.639331544367973844e-03,1.084379790446112976e-03,1.009575281665463491e-03,6.114803223854785655e-03,-3.013825680531448392e-03,-3.478867467200904770e-03,2.046806471871929958e-03,3.991994253758830366e-03,-4.896697803487667228e-03,-5.721718778840253561e-04,4.066333859628830025e-03,2.260098165399837100e-04,1.417131691923264420e-03,5.575170402799476017e-03,-1.354057703051385862e-03,-5.121522340418817147e-03,1.714995314174450752e-03,2.747969206866557150e-03,1.608516215177393637e-03,-5.239863231143575620e-03,-1.360657078438680823e-03,-4.007672807853018306e-03,-2.129179373527475150e-03,4.871788640509664130e-04,-1.266161277432302631e-03,-1.526359753163690621e-03,3.007881053757823723e-03,-2.470373397987378396e-04,5.290849028278689532e-03,8.842901519218338696e-04,-7.210613882612076355e-04,3.999553180968748976e-03,6.655654555101243181e-03,-1.129377987538978311e-03,-5.029056683989259827e-03,-9.819194059122397355e-04,5.352288003357344101e-03,3.209308337174644667e-03,3.371024960300655860e-04,-1.645431166700875693e-03,4.089709920446746132e-03,4.410364954462817218e-04,-8.384891455917241679e-03,2.292830474456910000e-03,-2.826652951712250051e-03,2.493267855321295960e-03,-8.936783685530677590e-03,-2.691209056099560723e-03,-3.316634455028072675e-03,-1.234151078540763691e-03,-1.262961768265463344e-03,-5.189729712466368584e-03,1.314024299594804485e-03,-1.101840044730431389e-04,4.190816780752733284e-03,-3.681657927885963449e-03,-2.720537835155940800e-03,2.258027295803280866e-03,4.420680440384101330e-03,-3.923993747362261239e-03,4.816038542115457476e-03,-5.013243588184784424e-03,2.890656773341122171e-03,4.508323303913751648e-03,4.291042909988574197e-03,-5.318594406176466866e-03,6.743322608113123603e-04,-8.075313317263266147e-05,8.096672436058473254e-04,-2.770334048265654225e-03,-1.268946452808440229e-03,2.422809036711428762e-03,5.443767193459355704e-03,-4.002517920738781800e-04,-4.404586840048878489e-04,9.227898576889487572e-03,7.821819403546950017e-04,6.648928930765487383e-03,-6.261870917717061308e-03,6.958591938959535206e-03,-1.053779882639536676e-03,-7.612732771449041676e-04,-2.767128907321800366e-03,-4.056923645345076462e-03,-9.975869084634849532e-03,1.355106770279807839e-03,2.109451279194168687e-03,-5.057501999176882562e-03,3.953187129221904242e-05,-1.108916001515378039e-03,-2.048047734778905143e-03,3.897334586654006752e-03,1.485241946540530626e-03,6.418164417607257876e-04,-4.750780471514942128e-03,-4.741074699128645237e-03,-3.355276228785146996e-03,-4.117807974517191140e-03,7.278642994019127090e-04,2.692279507931581890e-03,1.541056151513404103e-03,2.204863807568028557e-03,3.385051204556619939e-03,7.165435647355909059e-04,2.065897070277990157e-03,4.328810442245590619e-03,-1.598043845008595160e-03,2.850705225684955352e-03,6.055214777775074983e-03,6.276858850644124327e-03,2.918072329549726337e-03,7.712859477404344975e-05,-2.630318740326897855e-03,-4.331710117889141058e-03,-9.675067369964650797e-04,5.756115407163342121e-03,3.913556109961045804e-03,5.901720214586504221e-04,5.841786609855528778e-03,1.149086133443743152e-03,5.675759652627564676e-03,4.099545001230569532e-05,-7.381374551369202473e-03,-2.494912404904006762e-03,-5.139454953377090712e-03,-2.768881406033836274e-03,-3.924923294175998098e-03,2.003851441041884507e-03,-1.688805692655803844e-03,-2.012426033659288349e-03,2.503803308490488554e-03,6.171406269874635600e-03,9.026540911902060154e-04,8.539736241477483003e-03,-3.977914185714718706e-03,1.062357776615797814e-03,9.701556199831821948e-04,3.175997866166127819e-03,2.360946520617717391e-03,4.509479782422387562e-03,1.228878415626374985e-03,5.921127927585994113e-03,-3.933432882350955916e-03,-2.120690026432405471e-03,-1.018301440165188363e-04,3.218602236571367192e-03,1.354721385279673157e-04,1.168104252745091670e-03,-4.801980088595616050e-03,8.830134089963302766e-04,-4.769922148457042863e-05,2.152193282778306575e-03,-1.421503367654778109e-03,-2.056586695265261600e-03,-8.065868862772624954e-04,3.864325367925286643e-03,-3.203895105518334094e-03,4.735156114300523821e-03,-3.681987995656348447e-03,1.069820723746534843e-04,1.877424417789321518e-03,-4.129232567338666539e-03,1.419854102828446837e-03,-1.815143070498167381e-03,8.766764766844212719e-04,-7.098383834317948489e-04,-4.114575057602131163e-04,-3.347029850223623339e-03,1.774269422206859077e-03,1.385349781312438704e-03,2.198662032367599151e-03,1.271986397382523712e-03,-1.668978669028276570e-03,1.177089580241454450e-02,-1.613628153646823760e-03,-2.148487889989270788e-03,-3.479659392439200024e-03,-2.301177520198838982e-04,4.624648923308506340e-03,-3.610887166637358984e-03,5.985511169725722842e-03,2.570703825831435491e-03,-3.163778231543369531e-05,5.817986226107543096e-03,-5.780413552627388816e-04,-2.647331767135462088e-03,1.431189634196811632e-03,-2.939549692826374464e-03,1.920723266277309495e-03,3.648091726919398380e-03,7.378033300422998965e-03,3.344498053337280775e-03,-1.012400005561663388e-02,1.652426834760625449e-03,-4.904760129792235315e-03,-2.528876028449818670e-03,-8.440483917434076985e-03,8.590413004396588864e-03,-4.283798883050967538e-03,-1.479802468729599736e-03,3.596724370499413629e-03,1.735803451929582254e-03,-2.815020181293878456e-03,-1.005567176315618969e-03,-1.560459098458967476e-03,3.938526283601083400e-04,-7.145694152098117048e-04,2.768026738727437982e-03,-3.045559896244979907e-03,1.029734865834583532e-03,-2.723172986535879296e-03,1.800776816913347541e-03,-4.067618531504667810e-03,1.804025514333990011e-03,-2.487138867723306251e-03,5.288381796608475349e-03,5.330586303062053756e-03,-3.073919186276505115e-03,-7.941887905103008072e-04,-2.476421609484489271e-03,-8.727035051791271912e-05,5.266294338082082435e-04,2.063284820468008502e-03,4.209147157646316029e-03,-2.818833035030399777e-03,-3.053252401438276507e-03,5.882168342943419441e-03,1.053630372722640617e-02,-8.765650233551289684e-04,-5.783908590822078692e-03,8.094020576120679756e-04,3.524591829004281984e-03,-3.460322698179080572e-05,-1.451879359497933912e-03,4.747388364822900035e-04,7.629802477721181822e-03,-2.169155609050252374e-03,-6.714985286765499072e-03,3.103160850482472848e-03,3.122688761240293657e-03,-3.876886708106737703e-03,1.959264759891858696e-03,-1.174484615616637820e-02,4.623487459472778771e-05,-2.426097513315835039e-03,-8.804265828430265714e-03,5.872013239087746708e-03,1.916122279407507902e-03,1.654267195700684021e-03,-1.153598887983622650e-03,-2.519121968560064000e-04,3.563179552018631412e-03,-5.117722879861434475e-03,-5.857739059810595540e-03,7.466287360581132715e-04,7.015200280072142155e-05,-3.890633859567645906e-03,1.247302139449326264e-03,-1.616251847820794878e-03,2.327266574337907238e-03,-6.452976755836550490e-03,8.688748576264228654e-03,1.187070455863996647e-03,-2.554602120882163655e-03,5.183968649141199947e-04,2.815479277976089267e-03,-7.483374521082082967e-03,-6.532902251330684716e-03,6.679602038894771714e-03,2.952635304454095510e-03,1.184543899473516210e-02,-3.333058171809203003e-03,5.455606943413131467e-03,-2.074422855066665721e-03,4.914100532225642269e-04,5.652812379361226575e-03,-1.998140772795790714e-03,5.815970035374609009e-04,3.538233720239032914e-03,1.243274019747276238e-03,-7.023874342353561562e-03,4.182229431939718349e-03,-2.686460708079808869e-04,-1.896979123706414496e-03,1.607305677447465591e-03,-1.678819539951535863e-04,-2.870628572659059130e-03,-2.952768142162603884e-03,-5.826753114713951716e-04,1.460259042011885066e-03,9.508955007239515769e-03,3.354797071072339928e-03,3.259463247673499969e-03,2.994677745504094931e-03
+5.984741835854044852e-03,-9.616071770604235278e-03,5.016923604499034552e-03,2.411189479036791351e-03,4.348191068712575942e-03,-1.743202007711496925e-03,4.919436383703203880e-03,3.474024871070178263e-03,-1.905091775164753165e-03,-3.078415379212803230e-03,1.279124318974609069e-03,-3.010716340974314842e-03,-4.267351123944299225e-03,3.241731071252205824e-03,-4.456819449548205837e-03,-6.723705867683745993e-03,5.284789583145162173e-03,3.860317560219120488e-03,5.337309260307587268e-03,7.171672011266852498e-03,-8.876112007083230724e-04,-4.437403379619596529e-03,-5.822369253796511566e-03,-3.957949724708131715e-03,7.255171744730166787e-03,-8.942164099704680335e-04,7.223144336735389491e-03,7.063333757931944401e-03,-6.519645017837472512e-05,8.026028758878006597e-03,1.141864079768726160e-03,5.502227743979857047e-03,-1.203462392820219690e-03,5.840963329058839436e-03,5.052968679302062213e-03,4.147832798475316850e-03,-1.719884027267911536e-03,-1.361054186469799450e-03,1.120588060705736064e-03,4.732650465998793077e-03,8.755756625285134390e-03,-1.989575348666213026e-03,-4.578013703971127511e-04,-2.167271656268942963e-03,-4.991460956420429480e-03,-2.486064528158703039e-03,-8.570074861486962937e-04,7.680527287157184677e-03,-5.617025699225406624e-04,1.463910591659070603e-03,-3.275454088463320562e-03,-5.113975909179969579e-04,-2.484881360406376600e-03,8.458768337593618181e-03,-4.176590065352223623e-04,-1.167054805570489143e-02,-3.621057408368227697e-03,6.064380690398160807e-03,-1.729420572691511577e-03,4.979300551654473710e-03,1.472848081006948671e-03,6.696551075980698969e-04,-7.119780659349897629e-03,1.037561929923373760e-03,-1.084324851489194591e-04,-4.437409630427810447e-03,-2.420206805262874010e-03,3.756466701165596042e-03,2.546442679311766450e-03,-7.082536387051132634e-04,-3.271073195261703717e-03,4.581642612979068899e-03,-2.083039541143099586e-03,-5.455093513358987532e-03,9.551330519409753292e-03,1.513675475788653834e-03,-3.048128829332747650e-03,5.135968627745224595e-04,-3.693423019349943021e-03,-4.651237730069430272e-04,3.751536801093640402e-03,-3.389169471490333899e-04,1.689185072197593013e-04,-1.135675212287851066e-02,-1.207093061091169678e-03,-6.924955497145674453e-03,3.586166103152149589e-03,-8.446670420016054795e-03,-1.162354695077016768e-03,-3.072655791991421321e-03,3.783121138955834652e-03,2.353807017806827922e-03,-3.088504606222331177e-03,1.023889267984454799e-04,-1.545751026974185845e-03,-1.154188039893995878e-03,-8.073248485251522874e-03,1.442841811359925472e-03,-3.839119126397172455e-03,-2.061312894320979725e-03,7.568534333195118223e-03,2.248440587531316500e-04,5.094807684509428114e-03,3.363056939061728669e-03,3.491474488788752148e-03,-1.478110790117632706e-03,-6.048953579048474243e-03,-2.815287637143054113e-05,4.668599577432394274e-03,-3.863144409790750591e-03,3.920366484648329063e-03,1.709073188552647643e-03,4.025898141122434895e-03,-2.750183482302123381e-03,7.683684941520249221e-04,2.086495156286398162e-03,-2.692039552956773923e-03,-1.860768731035280111e-03,-3.755510117812231423e-03,-6.632051953218071668e-04,9.467766860269586054e-05,-9.065461656059535826e-04,-3.786620162512073767e-03,1.496148946979817358e-03,3.147545610658882258e-03,-1.523481553049101881e-03,5.080048515706393590e-03,1.925079632303594796e-04,1.085401354350893624e-03,3.249986530325569330e-03,-3.068307351263617738e-03,2.460029496312786737e-03,6.194819380263816809e-03,-4.481855644766623639e-03,2.753073129738299943e-03,-6.307660777639935391e-04,-1.015986837534816820e-02,-5.952696945216524350e-03,-3.037569027607777555e-03,-1.458705156441195746e-03,4.048630879725556402e-03,-9.071512628009295934e-03,-3.228722352681619428e-03,1.739150213039649848e-04,4.621407690739215944e-03,-5.832199101376632500e-03,6.138150382976338129e-04,3.138040466776577386e-03,1.437035939228206412e-03,-3.943824825296863255e-03,-5.367813821745066882e-03,-4.463784663830482925e-03,-4.261560310427983840e-03,1.397968257445010706e-03,2.056317028076303649e-03,6.687935116997597044e-04,8.194288772146630581e-03,-4.602562644755485259e-03,3.299759884889266553e-04,3.675143893579732417e-03,-1.429165520463838296e-03,-5.683960321024747421e-04,-1.901916477155874514e-03,2.006141511800279588e-03,5.058433372870746415e-04,6.097551752283562367e-03,-1.729696418471093466e-04,-2.115766374886675783e-03,-5.570373722532678644e-03,-1.304880903031904911e-03,-3.066708199027342804e-03,3.377953139334588517e-03,1.515335172993362072e-03,-4.084794771050588059e-03,-2.977341974360121334e-03,-9.266994071393902478e-04,9.707812078133750029e-04,-4.179714304791603587e-03,-1.782086217136300259e-03,-3.427045350943182375e-04,-4.770620629853542308e-03,-1.464877366718344990e-03,-1.051863765041041211e-02,-4.972379023381918658e-03,8.140344590464069444e-04,-5.636288790560839597e-03,-4.445561213140176299e-03,6.393395668044146273e-05,3.778640684677965142e-03,3.889663068469741830e-03,-7.951348792640858774e-04,-3.926459113669166293e-03,1.473429561887363301e-03,5.311106873535993873e-03,-3.922719230776639844e-03,3.236262220676792110e-03,8.353077697560743961e-03,-3.842814051155134768e-03,-5.136435320321209620e-03,-6.940592970462714298e-03,-5.121566226745827982e-03,2.031815583794806517e-03,5.245180307093181941e-03,3.820804693137550088e-03,-3.524719527378296358e-03,-3.501082855994146203e-03,7.345178529389664868e-04,4.262769509309133407e-04,-8.211766163424928438e-04,-2.345871264656022599e-03,-6.640362250819885391e-04,4.122277321804831084e-04,-4.151056148752748590e-04,2.815523288705776631e-03,1.024501870955497754e-03,-9.054475429456166138e-03,-8.823349530549714484e-04,-1.683809392546332115e-03,5.258500427105636602e-03,-2.397894954065111876e-03,-6.486383269001439258e-03,2.571702459819547818e-03,5.804944009895504804e-03,4.399679065515649202e-04,-1.505985619117285483e-03,8.507544448919676303e-03,3.145899039467003194e-03,-4.817550369532904581e-03,3.309942750963944854e-03,1.082210747116531084e-03,9.676503663546612549e-04,-6.454251394082519318e-05,5.636527228393553753e-03,1.102356568671701738e-02,3.185286802953827406e-03,-9.710036749241739845e-03,-1.999890254872992764e-03,-3.764897285498346007e-03,8.124033722980556621e-03,-8.749278627303687383e-03,3.267056046154209314e-03,-1.523099789786643401e-03,9.309087090823182919e-04,1.590102086705939442e-03,-1.890797066134092611e-03,-1.665239413000357003e-03,5.494539520996382645e-04,-9.503003335761794554e-03,-2.281130984359421802e-03,4.383599292811853065e-03,2.543379238627068318e-03,-3.918429175857327147e-03,-4.447849269638113300e-03,1.050935225207483975e-03,4.465978595366398548e-04,1.300441673341505002e-03,-2.483971376293981400e-04,4.861914698293399507e-04,-2.370352858275809468e-04,1.362228535832109419e-03,1.769739535568824306e-03,-1.115068618660436774e-03,-1.050410834611570279e-03,6.187504892558711919e-05,7.195995148576869943e-04,5.768385188621911498e-03,1.786113211696679481e-03,5.130435101807364652e-03,-1.770362452855528618e-03,1.762817156911375676e-03,-4.686580655088498290e-03,-3.635918881403608222e-03,5.582144875187249533e-04,-4.273995317960624894e-03,1.979649909965571988e-06,6.590959521418894790e-04,-5.066188024575813071e-04,-1.182590290031273160e-03,-3.292552349266479367e-04,1.587648612561726267e-03,-1.751783228967431199e-03,7.329582824956165513e-04,-5.574946930859723399e-03,-4.923143682201823077e-04,2.646464411391341413e-03,-2.451039888001482585e-03,1.419681093064993974e-03,5.553416240894956148e-03,-9.562677291513660763e-03,2.656688683268136879e-03,-2.041104198831559906e-03,1.612482921976768954e-03,6.931677523077578234e-04,-2.471442613018905946e-03,1.702430114497096268e-03,-2.979427324316170911e-03,-1.758067189287350861e-03,-2.430183297503889402e-03,6.520998612280197318e-04,3.198127697528613689e-03,4.709313627222789025e-03,2.142975987668421095e-03,-1.917668421034107295e-03,-1.344596625041372595e-03,5.668154707357733919e-03,-9.967649346507606763e-04,3.951905168902155631e-03,-1.198432922816578287e-03,-2.219519732737485309e-04,6.035693302685843798e-04,2.301302676963730594e-03,-6.511941475684163323e-03,2.568087024365138527e-03,3.999466205603327031e-03,1.876851439403669664e-04,1.540298286291482332e-03,-2.180710078044871113e-03,-6.593921605571882553e-03,-2.294465729851407915e-03,1.334678100502183131e-03,6.497616877833672401e-03,3.130635393972131496e-03,-1.768034350068220430e-03,1.926294151079666192e-03,1.013838112355984873e-04,5.752940177943552044e-03,7.273017065797676445e-04,-7.141606342827747957e-03,-4.000822488984359313e-03,-2.143192402409583881e-03,1.556780635609166224e-03,-3.918633335113092149e-03,-5.881259781598197414e-04,-3.174154216275393205e-03,-7.333686573046342867e-03,5.927351153824488711e-03,2.563270150385541565e-03,-4.369429294831410132e-04,-1.303511919107637473e-03,6.381214719644016757e-03,-3.415402902165595375e-04,-6.082288321461214349e-03,1.790300812134257074e-03,3.731473816010633972e-03,1.180988606679596781e-04,-5.044057386265727097e-03,-6.552858167015016347e-03,2.724679001092835708e-03,3.345022364743319328e-03,-1.321738913311315718e-03,-2.742549396852622754e-03,5.273180162278576352e-04,-2.463284261707194590e-03,2.619282871614015675e-04,-2.280127370470864338e-03,-1.308027123954593786e-03,7.855485501766071993e-03,1.633677482794614879e-03,5.919520706586079524e-04,4.610562281159584670e-03,-1.071225268686476799e-03,-3.827126616886834289e-03,2.683847222286119688e-04,-2.134748376927545713e-03,-2.683620060602805602e-03,3.858613906985411141e-03,-1.679018591163631505e-03,8.786821624756096681e-04,5.023037966246310555e-03,2.523827754246944603e-03,1.665053772010272656e-03,-3.573460670666753561e-03,-6.740351358978876467e-03,2.469692431082382358e-03,-5.124307123503932162e-04,-6.014983371153204313e-03,-3.228755488151885053e-03,5.419088958638559016e-03,6.281045259282001759e-04,1.128858072956646962e-03,-6.541121937381828844e-03,-7.542450289633913230e-04,1.993208960655790812e-03,7.775059574779664799e-03,5.191523833986528245e-04,5.788045599090868402e-03,1.504683114424371440e-04,2.273026688176346677e-03,2.311886655927378360e-03,-1.605063791633716585e-03,8.092432563362355127e-04,4.946474523103783563e-03,-4.382072724397517542e-03,1.711087749958356509e-03,2.154971619924284532e-03,-4.663169804750825001e-03,5.792364444763060082e-03,3.714518120006095759e-03,1.562545354172042208e-03,-3.520493719891623283e-03,-2.089198509192534935e-03,-1.787460302020727485e-03,5.500893805031054594e-03,-5.043309288922136376e-03,-1.619013809689027073e-03,2.898892709784814392e-03,-7.396665424169113212e-03,5.676021652920284523e-03,1.773656804262012160e-04,-1.323156058619177032e-03,-4.448260282288076482e-03,-2.252184497375445908e-03,-9.788209559937978334e-04,5.304020544499445967e-03,6.579350345963710239e-04,-1.727770397140206571e-03,-3.585702258422089790e-03,4.064918754447948389e-03,-8.137400675572607767e-03,4.903010647184055659e-03,-4.757651285107512146e-03,8.294039929208817108e-03,6.993170985611535247e-06,-1.913302169322607122e-03,-3.094801285033816751e-03,4.724088280770147893e-04,6.090593681244093303e-03,-1.774025155233890233e-03,3.864378043456308340e-03,-4.583930688331527142e-03,2.227790336767949790e-03,2.008801686039625879e-03,-1.356464848083176753e-03,9.729896687424671687e-04,-3.166623381264999960e-03,9.765163965524000476e-04,1.163292225592441498e-04,1.142172467079216640e-04,-3.319371800277874673e-04,-7.328528685613088824e-03,-2.981035873745005102e-03,4.307806200699013793e-03,3.565573229328214671e-03,3.317400783570957981e-03,-4.457863050957358293e-04,-5.654177317031343700e-03,-3.356523209244073169e-03,6.767607685289731998e-04,4.108071621190495341e-03,1.294237405132560260e-03,-2.160939185469755181e-03,-3.084251393976996519e-03,-6.348767216181291570e-03,-2.150859288283425534e-03,-5.155275548684150159e-03,2.266140943494834165e-03,1.343812006801061018e-03,6.340007158506557640e-03,1.552268065038129707e-03,-1.928649666744434565e-03,7.777957139061376186e-04,-1.759163337215373900e-03,-2.578728121624658445e-03,-1.003750003612912702e-02,-3.019503795373206737e-03,-4.974424544974696306e-03,-1.635818307165284348e-03,1.883200903848856024e-03,-1.926512984224450220e-03,1.965686322015919481e-03,1.800596431320356612e-03,-1.748032135373002796e-03,1.485833260131168981e-03,-2.392251185301801000e-03,6.504473280425520677e-04,2.945235133098360499e-03,5.116078801648860606e-03,3.653849763603926105e-04,5.762227807380149690e-03,2.349345406938168138e-03,1.204181774165716257e-03,-5.195403672375113399e-03,-4.684341938852798308e-03,-3.764394082702674903e-03,-7.232511322805205804e-04,-3.969688598906466275e-04,7.799050719973095978e-04,4.230786203649214948e-03,-2.973408735651571211e-03,5.940888559708030862e-03,-2.479476279192832904e-03,-1.039593544138805959e-03,-2.245346907996213000e-03,-2.832761191169010356e-03,3.326426232349576594e-03,-2.216688280851098099e-03,-1.864347556552545075e-03,1.347261923293933050e-03,-4.892217897052136269e-04,-1.730047237076681168e-03,-3.936389336780009361e-04,-8.415791172967033431e-04,-1.944822342472235163e-03,-8.437404352342557493e-05,-1.360605305025745359e-04,-2.772101842688236262e-04,1.753717722378790806e-03,-1.726209444106497487e-03,-4.389871830186868086e-03,6.443991999994473120e-03,-5.628727669125516532e-05,1.186326885818438787e-02,-4.927551134284935234e-03,5.640781022203694706e-03,8.671938402955992325e-05,3.196922766360444767e-03,3.382038458791716395e-03,-3.536844107808095709e-03,4.463110460816480086e-04,-1.084292685672951515e-03,7.299587839853752258e-03,3.104228179010946800e-03,3.790143886992101020e-03,5.083879921051486298e-03,2.670137424298635667e-03,8.572587417933633450e-03,2.166834415575550395e-03,6.879920912191069514e-03,-3.513135372671444615e-03,3.876335063866274966e-03,1.278742158143278884e-03,6.975909792095715759e-05,4.395651838759651563e-03,-4.953594321021849260e-03,6.206992783572578068e-03,3.379167474852172968e-03,-5.072283812018695253e-04,1.004170640369571798e-03,-1.338977489244882546e-03,4.885949658782580245e-03,-2.257325448543266895e-03,1.341212882875720928e-03,5.361770328180122044e-03,4.465849007374254019e-03,2.320759524252525869e-03,4.746079497469889337e-03,8.630436480097321283e-03,-4.677198018333762058e-03,-5.498643158776129504e-03,-4.981409989103734691e-04,3.213804849824733814e-03,1.668236680723350109e-03,6.679373504497413197e-03,4.087699028550899685e-03,-4.032071263189893409e-03,-3.640928674451714077e-04,2.906775815626509805e-03,-2.463008802066436742e-03,7.884782807565913348e-03,4.761152670609621052e-03,-6.318922170939656781e-03,4.777741439165105602e-03,-7.357767412023404410e-03,-1.266681308839424707e-03,-6.249748867617241355e-03,4.806448151941158931e-04,-3.135994518575374937e-03,4.782457913258358201e-03,1.466466245628294000e-03,-6.298211830924207652e-03,-1.946117968729037554e-03,3.398281526340951684e-03,5.417522296620562450e-03,-3.942814859287355894e-03,-2.258955454477965235e-03,-4.014689081715901511e-03,6.398994286891967510e-04,-3.864223992380299861e-03,-1.637980335682739453e-03,-1.336901302527089249e-03,-3.345119885946262230e-03,1.081424718531149177e-03,-6.284263872036049275e-03,1.300738572256849668e-04,1.760642258591387378e-04,3.847401599451156107e-03,-1.752750117529043973e-03,-6.303216985397135791e-03,1.635429866171467760e-03,-3.265162671698616705e-03,-2.937361901630912171e-03,-7.172113474188590229e-03,2.448890174936894188e-05,1.063646789578995625e-03,-2.388702553990708268e-03,-5.721032738068809222e-03,1.354655176036766509e-03,9.814656158871301259e-03,1.060191614409342672e-03,4.050194165351975184e-03,-1.323327179885597530e-03,3.007769523543653289e-03,-2.338033105253391284e-03,-7.315324892998978368e-03,-5.900611486391506115e-03,-8.486159162742330092e-03,3.815476777345529029e-03,-3.317190491514041516e-03,-1.754511744201300025e-03,-6.222224104574194947e-03,3.750723177643189733e-03,1.967024481350970744e-03,-4.401088365633230989e-04,-5.178988816765117546e-05,-2.333526264956280270e-03,3.909548422761480006e-03,-9.164148284276043459e-04,-1.816263761568012898e-03,-3.544848785175158581e-03,-7.655525116464237390e-03,8.661561910822439736e-04,4.856404063178292388e-03,5.981579082844528869e-03,2.444481217299759000e-03,2.684942986312791480e-03,-2.008759572025098881e-03,6.314811813473162998e-03,-4.497745869820748456e-03,-2.383472490144755273e-03,2.956809167293442033e-03,7.608084892883012519e-04,3.464715103171058658e-03,6.060456925774573844e-03,2.279208957250175843e-03,7.165810045228677462e-03,2.778660862499315962e-03,3.536238294817787946e-03,5.498124010768281299e-03,-1.280311844428193074e-03,-1.050849739131304956e-03,-4.646829878015117966e-03,2.492060864690679237e-03,6.745794541002295315e-03,-6.217487150718885996e-03,-5.933830538333512497e-03,-1.536603770369764392e-03,1.716618912886634879e-04,2.329366013167856377e-03,9.316076898178970176e-04,-8.645768024255622323e-04,5.333553331132390163e-03,-3.262049484852580223e-04,-6.472568042513449083e-03,1.739529045836621683e-03,6.788298370082938155e-03,-7.322380203766390774e-03,2.629858669062768405e-03,1.600286452569898684e-03,-2.997981300354379754e-04,6.637022049955884281e-03,-1.096745980767150182e-04,1.270644721098578228e-03,4.126596928155396027e-03,5.342304545371772025e-03,4.329260794925782890e-03,-7.067448634544573616e-03,-6.103262872218138783e-03,-2.681965452422610790e-03,2.832243089778423758e-03,3.679700484102399385e-03,5.298291360641295585e-03,4.771049888386066505e-04,1.070389133954142849e-03,1.540570865407182768e-03,5.648549344168697119e-03,-5.200255580846735275e-03,2.989884678221506678e-03,2.907316491999193189e-03,-3.130257167382434307e-03,1.321176565489239930e-03,5.962933994446411191e-04,-2.292232758257042608e-03,3.733677952927846790e-03,-2.344629595503533188e-03,-6.541576025888252814e-03,-2.470745635511124188e-04,3.134092764178858861e-03,-3.296652640738966834e-03,4.923785741896334264e-03,1.456000580278486549e-03,1.924316382111733231e-03,1.385458730756988777e-03,-2.349716870733464889e-03,-2.822165272724905770e-03,-1.774842557369734567e-03,2.125984851605376962e-03,-1.020751146630326017e-03,-2.301431726324407352e-03,-1.117087497683826692e-05,1.187829203681765573e-04,4.377155827691117662e-03,-1.153747066807049365e-03,2.572033385180045890e-03,4.794883797941776674e-03,-1.789081675493793804e-03,-1.670281722561250438e-03,-4.862377224703077518e-03,4.783466439507862503e-03,-6.281823527350773907e-03,-2.673088655717211545e-03,4.801891596604499282e-03,1.811671663067686636e-03,3.618821764593790988e-03,3.111381583491141557e-03,3.638139102186061565e-03,-6.360512910186992639e-04,-1.619349710521248570e-04,-1.787764385728021635e-03,-2.692783132344500618e-03,-6.064039608640791647e-03,2.694961224406875518e-03,-6.360372607275305984e-03,3.718807633751299241e-04,-2.848882220239667713e-03,1.153550163162654759e-03,-7.369758983346412669e-03,9.909278899036585292e-04,3.940158999018839904e-03,-5.604563655795694564e-03,-4.453645640099786543e-03,-1.044379809368569355e-03,-3.399973182260328936e-03,4.767068052478189719e-04,3.164628255307861933e-04,-3.240502567962268501e-03,-3.794461649588973164e-04,7.355041864353253564e-03,1.719862459226652535e-03,-1.854535587965011351e-03,-8.854255657842573929e-03,-1.548822055506781292e-03,-6.163850230211918337e-03,2.185496840904460527e-03,7.922171204979903915e-04,-1.825288166501542369e-03,1.002982418627654942e-02,-6.329384469899833852e-03,2.071356008184854595e-03,1.034894721681407521e-03,2.692092194672838818e-04,-1.664386589322277296e-03,3.335223665833396238e-03,3.557801604700636111e-03,3.655329277689619801e-04,3.968163503974625299e-03,-2.015511229410584925e-04,1.334002548423235792e-03,4.376350502799493218e-03,-2.766053740003485660e-03,2.136526859945393514e-04,-8.151959260643333594e-03,2.190307984679634983e-03,-4.829532690254247952e-03,-2.291053927914742420e-04,-3.152166406751467660e-03,5.055177081581836880e-03,1.519555399544625591e-03,-3.439114957786355495e-03,-6.317911208375969964e-03,-6.348768263387999204e-03,-2.855864931087007900e-04,-2.682755558470736011e-03,-5.056501749471625957e-03,6.614681145461809991e-03,6.580036523821388653e-04,1.929342121040290327e-03,-2.121645786201135572e-03,-3.295243056557521037e-03,1.106224263101300781e-03,3.703586697047729030e-03,4.084619854237905509e-03,2.733241900114976013e-03,4.213991969458458220e-03,-3.115428462035872616e-03,1.449972945021286812e-03,-2.448140219601857223e-04,4.155278942621825310e-03,2.520275397920251968e-04,2.297203750133191939e-03,-1.022944940623956786e-03,2.113405125970993677e-03,-2.543525006181854113e-03,9.931263831677960945e-04,4.949846680868438386e-04,2.268309488632146663e-04,3.333550680575074390e-03,-8.285769355212942514e-04,-5.474176522662779244e-03,7.722778912528572258e-03,-1.322472815642970799e-03,4.566732727830341131e-04,-1.005792133680738498e-03,3.340957215044676378e-03,-2.503386911234475268e-03,-6.045354782132209004e-03,6.364948264029997999e-03,5.107266661038779333e-03,-1.639228742486054796e-03,-7.067606384621500813e-04,4.177023377467733480e-03,3.034248293716160141e-03,-3.868095363278442264e-03,-5.560956403570761590e-03,5.030381720666277090e-03,-2.083533684153443359e-03,-1.386358106306878526e-03,6.023935841861155016e-04,1.319878189320076349e-03,2.058191835778181485e-03,2.659408759934872232e-03,5.328762928622803778e-04,-5.652664640239139435e-04,6.700959991602917221e-04,-7.795570849718573211e-03,3.460124566430473891e-03,-1.644717829991255379e-03,-1.707192850630548592e-03,-6.708347540486065547e-03,-6.065404821121199615e-03,3.014008219404918157e-03,-4.458599191380073837e-03,-4.759230539114330145e-03,1.382621399296375588e-03,-3.662600139641388960e-03,4.089094043699528953e-03,-3.785758536868489947e-03,7.034289877336163347e-04,-1.981202393950853326e-03,2.650829073582070070e-03,1.554603982368380974e-03,2.207326946440616756e-03,-2.021726400778876397e-03,-6.913853262310685078e-04,-2.405037444484591554e-03,-1.409873900493881791e-03,-3.031457631711633512e-03,1.111004349339322081e-03,2.794904389760290847e-03,1.458851053934109940e-03,2.579329117751885062e-03,-4.356973666740361802e-03,-4.489768679242393118e-03,7.112261859786057384e-03,3.995914176094277764e-03,2.824838638148125143e-03,4.331097366641834846e-03,-7.312426486407355274e-04,6.761135122140028708e-04,-5.569709010672524441e-03,1.906264500384635381e-03,-3.726829954571758523e-03,-1.282942603029763300e-03,-1.086404677523248241e-03,2.311303774140379200e-03,4.768849721698558905e-03,3.180005693469786007e-03,-4.306171936709778450e-03,6.188006362234959278e-03,1.976840119573855699e-03,7.252522633831520217e-03,4.452340468773338666e-03,1.124614288334634072e-03,4.210422141240316007e-04,8.468492481130384872e-03,2.994284452477932561e-03,-4.409921980080720380e-03,7.327004125724243325e-03,6.637764898006420083e-03,-2.383617664496128028e-03,-2.007312260059366648e-03,-5.119206788096843840e-05,5.413469327089776018e-03,2.500628809338626031e-03,4.228993926902408461e-03,-3.684139634221844650e-03,6.255248514146427401e-03,-1.198646788749574942e-02,1.262808497533713639e-04,1.983414581531913953e-03,3.492473381631337921e-03,-9.223120368273676761e-03,2.486613855233087839e-03,-3.795020094564476552e-03,4.239907757230676824e-03,-1.475834423476334815e-03,-4.086019491923865987e-03,-2.117595310977426863e-03,8.599865027483052232e-03,9.558645386920636752e-05,-3.068450538501813797e-03,-1.063483385079367748e-03,-4.051696555979981268e-04,-1.156608223859764163e-04,-3.173629709789399424e-03,2.795284643565331111e-03,-7.232388063917335151e-03,-1.807150180592619190e-03,-2.856416978163188545e-03,-2.639475529127509670e-03,6.153036209745816817e-03,8.422199292986268931e-03,8.067413200774842990e-04,9.743770832755668798e-03,7.399517729823550756e-04,6.268396433238117987e-03,-4.726214697114986679e-03,-2.011624507066358137e-04,3.839796846512776246e-04,-2.545200975128525736e-03,-5.497307888617460753e-03,-5.209923206478175688e-03,-6.788105689700816431e-03,1.156205735534774574e-03,4.673516161349418156e-03,2.354685867381633123e-03,1.049478830294845941e-02,9.291289297565405794e-03,-5.104000700525233337e-03,3.815070741445170718e-03,2.368709157816302219e-03,1.566229609934429931e-03,-9.693955469642453375e-04,-5.459374589261597327e-03,-3.046946421193819001e-03,1.431054367166543877e-03,4.966812870464812346e-03,-1.220147025674894270e-03,7.796382459321262587e-04,7.454943749561414139e-03,3.834847900983840655e-03,6.396424962405174814e-04,-4.104759995653331842e-03,-7.759714661385797496e-04,4.404876019389190184e-03,1.369788176396269375e-03,2.292972502681558597e-03,-2.682288180980477686e-03,6.681862049819373671e-03,2.473243955956249025e-03,-8.739180157752230588e-03,2.127836507478950059e-03,2.500784586404848110e-03,-2.753616729598949795e-03,-1.842587065522199126e-03,-9.353678170458764110e-03,-1.593468511056576626e-03,4.046088615545669867e-04,-5.596870784524569757e-03,1.224209596086234817e-03,-8.179104638410407413e-03,-7.176751664973813304e-03,4.615978095345179850e-04,-1.779303928585667435e-03,4.779044667914745734e-03,-5.642923458106939745e-03,-3.475093127974396550e-03,-3.874401905005009166e-05,-5.153615799057022218e-03,4.673010286513117062e-03,5.358512724165884547e-03,-2.176754677553799453e-03,5.463713280347496336e-03,2.099346308342539633e-03,5.918462913680780542e-03,-3.230481862616408750e-03,2.254401338055119983e-03,-8.878840988220174785e-03,-7.057146683604279555e-04,2.716259243941733081e-03,2.869413407143047269e-04,3.049278081624187762e-03,2.774100522798403357e-03,-5.194355804214147859e-03,5.544047778334698640e-03,-3.776734168967006743e-04,-1.114822201745784263e-03,9.104831637585296487e-04,1.961482145006316052e-03,-1.722222011451728920e-03,2.687446934908130532e-03,1.271055331955123864e-03,6.542593085634494481e-03,6.559867489840488959e-03,3.775064640701593192e-03,1.229998862042134015e-03,-3.216653239794817990e-03,-5.009324253063047842e-03,-6.238716170600492987e-03,4.774770743652277140e-03,-1.546027812505645178e-03
+4.751112427892467162e-03,3.598232589289064847e-03,9.794055407459878416e-05,-4.311677383805452568e-03,-6.069646572223588547e-03,1.700006872205484919e-04,-2.683924966989529540e-03,2.726485379570426151e-03,1.204339197307428442e-03,-8.008361037679528199e-03,-4.931701266330589884e-05,-3.860830978319484009e-03,2.947844522490427038e-03,4.847780035457674441e-03,3.088293701058521766e-03,4.001032744047576875e-03,3.742271291951105357e-03,-7.708572203758401985e-04,2.277260791482417881e-04,-1.782340146482208404e-03,-4.432077100071595520e-03,7.649921932901693894e-04,5.362957021730329352e-04,-1.819896738266709422e-04,-9.693123112482396331e-03,5.088443216996132654e-03,-3.096185798789494181e-03,-1.327702024370553367e-03,7.144851145693665543e-05,-4.296687422014252092e-03,-3.226807003285811666e-03,-5.490456708095717211e-05,-6.719552593947813839e-04,-2.321841506949465748e-03,-5.769178819702429654e-04,-1.394403854049465702e-03,-5.544631634609474731e-03,-8.917324187052913867e-03,-7.656833426561658596e-05,-2.871767064521395501e-03,-4.339971027751781535e-03,4.488313301408518655e-03,2.904845731473358096e-03,1.432399842908982845e-03,4.969076957736658431e-03,2.916567949455938034e-03,-1.517695331711829123e-03,-3.528772270057508806e-03,4.198203622430641060e-04,3.377057127671890570e-03,-4.100969831975339631e-03,2.080014754339649594e-03,7.396535841608609281e-03,2.353984603956922172e-03,6.215738728180666972e-03,-2.631278414131971253e-03,-3.611912402223945232e-03,-3.310677578342828640e-03,-5.372180775331567015e-03,-2.275472989581096925e-03,-3.013370939712520929e-03,2.611285546414787197e-03,5.439678892851988398e-03,1.034835891422428156e-02,-1.192896700106387290e-03,2.601120427349489731e-03,2.598609808726827139e-04,-4.093206611996316746e-03,1.783985653421440738e-03,3.982020741486520798e-03,-3.293778413573643891e-03,-1.589464585514752419e-03,-2.975473399364152172e-03,-1.899698567262982259e-03,1.229809835506709284e-03,4.502876150941585026e-03,5.397289558974044561e-03,-1.177817235770882258e-03,-3.945175865265296462e-03,2.760615542192645499e-03,7.141383233800923562e-04,-4.887618966960173214e-03,2.284259973982408470e-03,8.366453484927089193e-03,7.377615667422053299e-04,5.530095971008563982e-03,-2.875047831994934685e-03,2.258295071050656685e-03,-4.690029031712394068e-03,7.036623581234906222e-03,1.419212253548128664e-03,-1.619778653378387955e-03,3.488141338982750448e-03,1.565739863000268923e-03,2.811175037191080903e-03,-3.756213368078062820e-03,4.204337695583065804e-03,-1.826422963391528191e-04,-3.815247987393810205e-05,-2.581835900612219990e-03,-1.477843144350362344e-03,-2.029762196440689295e-03,-7.964893437219091460e-03,4.218531570296037148e-03,1.523983800090666918e-03,-3.567107900425241663e-03,3.432048835258643110e-03,9.070143601454026903e-04,-1.016538084754329933e-02,-2.991067262080778118e-03,-1.759288515687217449e-03,-4.743374174003064415e-03,-6.338125817298717428e-03,3.341500076009308075e-03,-2.095588936277073024e-04,6.317271815914584331e-03,-1.820629418547130805e-03,5.697510359778910037e-03,-1.440568874836346738e-03,1.102947435847957073e-03,-1.320093455041741314e-03,-4.142645163668056893e-03,-6.913376988387914639e-03,-6.845875978157296878e-03,-7.314596171455099716e-03,1.835055550989881474e-03,-6.730296742165187140e-03,-1.766716513027768044e-03,3.570383547588810991e-03,3.727465512863477933e-03,4.539556379957438055e-03,4.612032110016286296e-03,-5.581529153599816925e-05,8.589235415717228622e-03,2.411462066358868988e-03,-1.562226949688242435e-03,4.103716535278355802e-03,2.280448367272880101e-03,5.476980287945992787e-03,8.768893935152221979e-04,1.755933646428625575e-03,7.927756189039961848e-03,5.271679186773656905e-03,3.211827465127205412e-03,7.089049722967491907e-04,1.020792838978193569e-02,3.266769200796097294e-03,1.110547579067372182e-04,-2.665148812275858876e-04,-5.568476132855578256e-03,-3.086721959689193735e-03,-9.154700135932381227e-04,7.133529392529190899e-05,3.230229018439725267e-03,1.433739583392413868e-03,3.602221391652639208e-03,-1.028497755536817966e-03,-1.814109696730121677e-03,2.416551532944177360e-03,-3.579211836510048407e-03,-1.078591391591877659e-04,1.152718832008353114e-03,-1.165911899341168104e-03,2.591976781535125854e-03,-3.281146858235864296e-03,-5.708616009989036198e-03,-1.903245497176509621e-03,5.714554551594777827e-03,7.281491013935387745e-03,-3.246545417884667394e-03,3.656297258968087185e-03,-3.843666627665601708e-03,3.783134598089897115e-03,3.895849919436533259e-03,4.832403020197330768e-04,-6.148105139051103160e-03,4.910431979512591520e-03,-7.311900107240482203e-05,-1.748882792229711084e-03,5.783113382504092952e-03,4.404749838894073469e-03,9.058935675332420792e-03,1.813706708029725776e-03,1.299987642244934336e-02,-4.769044435355784470e-03,-5.586252072213872787e-03,2.880531281087001308e-03,-6.892157280086515930e-03,-3.541451562285494788e-03,-1.574313284382613089e-03,-3.537922304155931980e-03,-7.725828167760379278e-04,-5.997886933219413871e-04,-8.715852950368531546e-04,-1.996920060155536053e-03,4.488911386298337879e-03,-4.386039401698109234e-03,-1.712580809542601796e-03,4.915137489894115704e-03,-4.088432028077722789e-03,1.914018949820286411e-03,6.011584082194996841e-03,1.154192768174430814e-03,9.566052657747872896e-04,3.693925310152990776e-03,2.501912302997262062e-03,5.400761440841256311e-03,-3.240588014774124608e-03,-1.119988598982403103e-03,2.490779794567538757e-03,-1.458347661112163090e-03,-2.768890207800368906e-03,4.469991874051917956e-04,3.079206985534855678e-03,2.564900499607560241e-04,4.919103037617301825e-03,4.103254913170658637e-03,-1.828700947424058637e-03,1.720363639626954193e-03,-3.279230710274231523e-03,5.701987691240804554e-03,-8.277323403528788134e-03,-1.643791127248539312e-03,6.849521940525167421e-04,-3.568923200317348967e-03,-4.649900930295474549e-03,-2.506958671530442034e-03,-3.757044208080899590e-04,-1.962961572365218201e-03,-1.384264028873668931e-03,-1.208669921720957044e-03,2.087274398592452954e-03,5.002614469677960432e-03,-6.694731732156348342e-03,-2.501951544338850450e-04,-4.832982835714080610e-04,9.743483028258384179e-03,4.622636400062545871e-03,-1.883719454214295578e-03,-2.669782487560197395e-03,-3.804197429782694723e-03,4.484653800311718926e-04,1.434318319545963005e-03,1.031966749572373618e-03,7.705375629242765950e-04,3.318863398931122533e-03,2.358707055641781581e-04,3.111643382429137711e-03,3.028555266298476809e-04,-2.696885635196003429e-03,1.803070080207254346e-04,4.608457861470259542e-03,1.793827696738704066e-03,5.594480225663145054e-03,1.494106774420638330e-03,1.914215832407435117e-04,-9.878891150409791280e-03,-1.774334402442475419e-03,-8.399805136990033021e-04,3.871711106715227269e-03,-1.041630272906520068e-03,-2.834889951267230836e-04,4.106874936215859986e-03,-8.777904967998211571e-07,-4.735035712630949328e-03,-2.405742889900766243e-04,-5.074189557736369892e-03,-2.873656719081226895e-03,-3.267667845795819374e-03,-3.829830641262670782e-03,-1.337629409688065275e-03,4.757406862918128927e-03,3.833786605188761569e-03,2.810231785993847105e-03,-1.385282236695593513e-03,4.070523683113263375e-03,1.417645288278920630e-03,2.689443383697935966e-03,1.426368139232484390e-03,-2.392675665004865334e-03,-2.772351592534081418e-03,-2.074684593019271526e-03,3.460150481603524800e-03,-6.183174587505897712e-04,-1.272228982306910974e-03,5.689923664057519419e-03,-2.017658000217812178e-03,-3.911015348685338740e-03,2.014105074243958179e-03,-8.123024855175273867e-03,-1.901113444724840569e-03,-1.955802004838275154e-03,1.011757299034290233e-03,-8.125329533876190363e-03,3.195822046850677815e-04,-5.418759253517751485e-03,-8.264174066299145264e-05,6.187779205828482752e-03,-5.758975653811773329e-04,-2.867911340189416031e-03,5.176686458991207676e-03,-2.349004873897419295e-03,2.633153404353545928e-03,3.826557401242537421e-03,-2.331688506428087129e-03,-4.475583378157468005e-03,-7.720127879082797247e-03,-7.188411237299653471e-03,1.543766568331849916e-03,7.830168663255244327e-04,1.522025256574366771e-03,-1.316563268295094929e-02,-7.887677311696877349e-04,-2.571295848546007248e-03,9.135707851269997057e-04,2.441730826095525348e-03,8.706317246201709936e-03,-4.364614006604841841e-03,5.752485631926571724e-03,-5.144875858239237633e-03,-2.937848931655372489e-03,-1.413268799201748260e-03,1.253318055705798925e-03,-2.193029132980328557e-03,3.464618155300983049e-03,-1.466922123966007225e-03,-7.818531434462849036e-04,-7.188596343205163296e-03,-8.863950014540986685e-04,2.307449384710457980e-03,-1.099163664510496706e-06,-2.779290118249864994e-03,8.580775251340141963e-03,-1.997533413101668662e-04,4.450357147399840001e-03,-1.848469596772093213e-04,-7.181565263396944877e-03,1.178353837457256251e-04,1.039177362005461050e-03,-5.153851542786451294e-03,5.763034243757578570e-04,-3.777643053745036758e-04,2.084905316375686678e-03,4.647526417134976648e-03,4.615344432806515952e-04,2.019755734859851170e-03,-3.165610666588044003e-03,-1.244781439209713229e-03,-1.846219645012776335e-03,2.532248997998019259e-03,5.106337230182365211e-03,1.420198101378595743e-03,-2.023086802999893307e-04,-4.014224771373559042e-03,-1.531063990565432889e-03,-1.565497649488292687e-03,-7.938316242868122263e-03,8.576497585881902438e-04,-2.071648328030385559e-03,-1.097171979203500486e-02,-3.894750770780609685e-03,2.760000546417316208e-03,1.453921691580402917e-03,-2.767831204432912549e-04,1.256110833094983489e-02,9.460581349531815842e-03,4.436172974794994235e-03,6.666701476038137866e-04,-4.036867082045603357e-03,5.461298423198921934e-03,-2.912096642926034762e-03,6.229398400487049479e-03,5.745850197060201753e-03,-3.344469313610214708e-03,3.620081046155145497e-03,4.061739778293655193e-03,4.339739313995588654e-03,2.443247381518564958e-03,3.889822082876485433e-03,-6.167795561387938094e-03,1.603850375056499022e-04,-3.759536787823148252e-03,-5.616228212418992571e-04,-6.425586983716950404e-03,-5.077778528716984496e-04,4.488406876432052273e-03,4.278151488921508316e-03,-4.357952442340203535e-03,-4.909292339236077593e-03,1.354179642067882428e-03,-4.882241675616651871e-03,3.611376141387019623e-03,-1.003071147028167302e-04,-4.116326967176141272e-03,-5.997511364487274638e-04,-1.983332737669708714e-04,-8.367529058004355945e-04,-5.148501948356253542e-03,8.097524097585321809e-04,1.569243803410608120e-03,1.218450997119581783e-02,3.260338244797645908e-03,-4.745530846568861427e-03,3.067025207577692147e-03,-4.020513211348905255e-03,2.383215980077594762e-03,3.636516111800297112e-03,5.976063234252406743e-03,2.645057466917428469e-03,6.171429605416832963e-03,4.030509540197082206e-03,-3.590559805323803415e-03,6.772334742561442970e-04,3.384817062222499648e-04,5.575552565118227499e-03,-9.774628614793595422e-04,-1.762025049054459025e-03,1.581369395940973588e-03,2.646865850744322193e-03,-5.900949530651226904e-03,2.646374355838043008e-03,-1.485958855668142791e-03,3.395228956491116046e-03,-1.574509764637774733e-03,-1.087489695280846212e-03,-6.766633220538289216e-03,1.572449568187522708e-03,1.557023052500946126e-03,2.841268892777637934e-03,1.295175359472168150e-02,-1.972014334918988608e-03,-5.144642815367544801e-03,5.908129189175739121e-03,-6.295980323417176353e-03,-1.882406846066576054e-03,-5.836733888273157702e-04,-4.068017105776406024e-03,4.681410063364341984e-03,2.674225003026122448e-03,-1.777062720897896542e-03,3.009320877705466690e-03,-1.296979194528004786e-03,1.462454344649575556e-03,4.359802807929784784e-04,-7.343822623906283258e-03,-3.465545589820053039e-03,-4.843589929524398999e-03,9.936956927012853893e-04,-2.295591193683914431e-03,-7.210589400708205220e-03,-1.882807970164883230e-04,-3.299191839038448024e-03,-3.155877973573188288e-03,-3.230655355471525724e-03,3.711633193169724346e-03,2.060544535564891571e-03,4.144652302198530494e-03,2.738343539807979853e-03,-4.018231885453742158e-03,-1.628364859033761642e-03,1.308350093435883821e-03,-2.444251667756084517e-03,3.555926152170376612e-03,-7.161475886028894830e-03,2.088979321091222110e-03,2.055447720016645007e-03,2.370165246718790395e-03,1.920965750256538492e-03,-2.813241152936024701e-03,3.170884496082167214e-03,-2.209970724204387334e-03,3.232748312546345273e-03,7.977398133626506689e-04,-1.464136622802916440e-03,3.049734192476167617e-04,4.206753260908350188e-03,-4.646577949816595968e-04,5.323686546173050579e-03,-6.566456868695300618e-03,4.198310135654527778e-03,4.801601629437195033e-03,-4.886674832405915603e-04,3.358138277569465337e-03,2.010669753088312715e-03,2.009879276211852441e-03,4.737380546893341939e-04,-1.626017296667654202e-03,-4.079286275227935604e-03,8.120817229463946768e-04,-4.540083238981619529e-03,-6.529593091548052111e-04,1.011327353497905659e-03,1.885662663392196617e-03,1.127763495411718102e-02,1.342962828607702263e-03,-8.017287855256419182e-03,1.539845265402115838e-03,7.269222136952979653e-04,1.557453291338762669e-03,-1.296268048657572922e-03,-3.066420607258646131e-05,6.274713552944872520e-03,-6.943289330631629988e-04,-4.428517569021453090e-03,4.471167539007219784e-03,-7.604583821036302142e-03,8.788720358140769077e-04,2.514633799212381664e-03,2.380368012367925154e-03,3.409052476468150614e-03,2.073474862294808971e-03,-5.032137011234534898e-03,1.535189688723916177e-03,-3.513088318214838742e-03,9.331031532344911906e-04,2.854129518208823896e-03,5.947730263750751305e-04,-2.783302440030941891e-03,1.234659395754945238e-03,1.017527635716695272e-03,3.162902406757011307e-03,-1.334507408776766744e-03,-9.510994019967366347e-04,4.260009721846605424e-03,-3.210635858944519121e-03,-3.264526551524770531e-03,-2.446007312477111069e-03,3.046030051395212223e-03,4.299232095188510773e-03,-4.668835494575376832e-03,-1.567716501566915202e-03,-5.410013154585566761e-04,-3.946411438568496119e-03,-2.887526313323839598e-03,-5.493168886018728533e-03,-1.154838203187133179e-03,-2.194854580592302956e-03,-5.001363336310201899e-03,-5.265949645967855398e-03,-8.769755406754149198e-04,-6.810757738343563988e-03,-9.219430520789603095e-03,-7.446254886681104648e-03,-2.387132420010726862e-03,-8.007340569408583813e-04,3.044567399123697393e-03,-3.194256637482372797e-03,-3.531301660241670942e-03,1.552878129810169065e-03,-7.972539715337276234e-03,-9.256780408476276718e-04,1.159755540115268426e-03,4.156231075143780769e-03,-1.337181810857181966e-03,2.426130877016236843e-03,1.142579860641108197e-02,3.563622150635041130e-03,-1.037042443429907115e-03,-1.881214105142467626e-03,-4.028500752795211547e-03,-3.480305884255548501e-04,-2.580127425518893326e-03,1.142357298099872789e-03,5.548914907104709332e-03,-8.865942750669232207e-04,2.686923827638904375e-03,1.087463653314211398e-04,-1.017949647389168599e-02,4.114981932917190742e-03,3.644014903667998611e-03,1.077352256158874231e-03,7.196761956956329153e-04,-6.765243172598378292e-04,-2.389862706110729325e-03,4.811825008335057947e-03,4.011236701770482492e-03,-1.983038659744969061e-03,4.964906103037434321e-03,-1.314460159829346428e-05,-7.260528753144077953e-03,-1.118887371285759566e-03,-2.030651356136065153e-03,5.229424347027869860e-03,8.581783467714791756e-04,-2.285337384065555966e-03,-4.291008780473164197e-03,-5.765593320866401437e-03,-4.593436578244689114e-04,5.955229870589043599e-04,1.379549715939392936e-03,4.311581644784506263e-03,7.534621790064061075e-03,3.115831658978458936e-03,-6.051229841385605051e-04,1.176658236103291129e-04,-1.400287446177951199e-03,-3.701452887682181708e-03,1.794533447584954842e-03,-1.677459540794707911e-03,-4.005304453281397122e-03,2.452398647750269386e-03,-6.999971127960778923e-03,-9.698243288170167904e-04,5.960301338506949565e-04,-1.721732606844128674e-03,8.482730840411586634e-04,-2.007330878481313179e-03,1.095108297392178610e-03,-1.231889873959121336e-03,3.956657004268331382e-04,4.543961838866820632e-03,-7.165230543791548112e-03,6.319707667146787396e-03,-4.496621692259972344e-04,2.526346241828932374e-03,2.173744310851195502e-03,-2.545617215802563772e-03,1.510617969282336158e-03,1.616134374610241238e-03,-6.047202148297876369e-04,-8.079447137923965018e-04,-3.754291960267035037e-03,1.383358922135391163e-03,1.027770447307654574e-03,2.321555576390791319e-03,-1.484168739904939737e-04,-2.048796765358495628e-03,6.098838553491736257e-04,-2.596970764473410884e-03,5.724799175273972453e-03,3.239648668288916313e-03,-7.040655100214237780e-03,1.050069581664714920e-03,-1.502950608292353846e-03,-9.547617930805256652e-04,4.314510057047127572e-04,-1.868150343230029096e-03,-3.995973901610490146e-03,-4.996017447937150212e-03,-2.807730312516961996e-03,1.871327658045369166e-03,-2.044898184804357436e-03,-5.366207249754920312e-03,4.377423784145217586e-03,1.287212328671424530e-03,-3.269716759785002910e-03,-1.432982496297582420e-03,-2.236073522098374830e-03,2.701948450389750692e-03,-1.434048775530241421e-03,-2.047464316734545851e-03,1.605038210633806658e-03,2.531690951784758281e-03,6.631831903082437965e-04,-3.563646704062522434e-03,7.405367582396420525e-03,1.057935470122848718e-03,-5.060579627089121894e-03,-6.412993464851403976e-04,-4.871758905672072398e-03,-5.513630137101934239e-03,4.238170402817937298e-03,-1.096264254005418050e-03,1.901058018129636856e-03,-1.238537138306512761e-06,1.930791920275286592e-03,3.487352840206861104e-03,-4.991199854017628083e-04,-2.266886459021990631e-03,-6.528072204606947919e-03,-6.111312702298672163e-03,6.540847968021216811e-03,5.187295330601850336e-03,-2.002361162788652161e-03,9.189855302357739189e-05,4.002583750269361924e-03,-2.331131106388713986e-03,-1.616067803815970060e-03,-2.031154264791040949e-03,-3.270668714146117255e-03,5.605250320944961569e-03,-7.160303748149623657e-04,-2.361420570887925834e-03,1.017876426548137113e-03,-2.650055951567346739e-04,6.653326841949171178e-03,-9.006909549232754338e-04,6.174146948747873814e-03,-8.054297600036687196e-03,-1.796032241703867782e-03,2.886350766756236828e-03,-3.967777414232185468e-03,6.774934932390671433e-03,1.770901068839377196e-03,-6.231345188704826184e-04,-9.564201881207235693e-04,2.585215246288141439e-03,-8.947883746911656014e-03,6.412735693205581561e-05,-3.245390292533721112e-03,-7.609608399784983240e-03,2.730881817065658294e-03,2.242341407706735143e-04,-4.941056959448536338e-03,-6.929455264660078738e-03,-5.481507291456094548e-03,5.039603680174473306e-03,-3.687652881831340618e-03,3.005101950708309860e-03,2.672926614835181413e-03,-1.343359956378081680e-03,-4.939582245007387079e-04,4.016972309272254888e-04,-3.428099701709607511e-03,-4.678641038029422369e-03,-3.370186144524319870e-03,8.218311787608918034e-04,4.790566063618556178e-03,5.935137293145825374e-03,-4.330223720609795482e-04,-1.045726359887029055e-04,4.276468466254991496e-03,3.521037669843529033e-03,-5.222270620271440324e-03,-4.934318747290623516e-03,4.986866586842406655e-03,5.046585524088140207e-03,-3.505065776671744196e-03,-2.044524938741874094e-04,2.276006060528980388e-03,-3.588455157031594268e-03,-3.556968212002972881e-04,-3.993901142075094966e-03,8.123377262530694001e-04,9.554126126860520594e-04,-5.853137236535741091e-03,-5.088651509732769473e-03,-3.476923332402504646e-03,2.594524146278635434e-03,4.502597694826369510e-04,-3.025469708010116929e-03,-2.629874010311601760e-03,-5.051693066630834859e-03,-2.302224236508752971e-03,1.059767290864913839e-03,-5.006690293204859102e-03,4.662824866813388737e-03,1.605502213115322235e-03,3.403188359813787839e-03,2.233542876009323168e-04,-3.824906738621161074e-03,5.064974879025604775e-03,-2.830843647162452955e-03,-8.412165962377787039e-05,-3.749338402712621993e-03,9.702291398101591663e-03,-8.621525126744890299e-03,4.079827043371212283e-03,-7.433948026440317068e-03,-3.721307152996478312e-04,-2.485877912350225802e-04,6.812063714478897632e-04,3.907049075009677401e-03,9.372507754032526062e-03,-8.999413420430886447e-04,7.910691169350325506e-04,4.275897039382986313e-03,-1.468592857881299537e-03,-3.061262766901179204e-03,-6.689897213540837127e-03,-6.729987200311481586e-03,2.536422562242492129e-03,3.740381392675814695e-03,-4.803091286421680428e-03,1.626283589178245106e-03,-1.794947450084344222e-03,5.939145481389971996e-03,-4.266372819517444512e-04,-6.037819104306969137e-03,3.348868539307542271e-03,-5.447479935926996357e-03,-7.937583570217616677e-05,1.157511476076373809e-03,-6.584142164169371749e-04,-2.897960967183493657e-03,-6.863683334172770815e-03,6.074161262977982286e-04,7.028037168190004241e-04,4.684922323806692615e-03,-1.768568371211536212e-03,3.328770041522815994e-03,-4.920653820004120688e-03,-1.029983211487006667e-03,-7.156260592449951085e-04,1.222500339341608695e-03,-2.924214762452508314e-03,2.448469998229624929e-04,-6.125968696906724251e-04,-4.450696140583576543e-03,-6.394903338946505804e-03,8.495781985577585682e-03,-6.179001009166634837e-04,4.363046649936195773e-03,2.772115194104337980e-03,4.976210888783303039e-03,-3.777740356697054623e-03,-1.602766665721805199e-03,-1.290464794960491379e-03,1.631860861996411701e-03,-5.209581002061682009e-03,-3.442865600275215633e-03,-2.539015241196546178e-03,1.960260864564050687e-03,-4.942574332666216548e-03,3.467653241985839581e-03,-9.137510131474440439e-03,2.829736715200712252e-03,9.700106938124623129e-03,6.222301995268797025e-05,3.178892126066510965e-03,-3.151034679341945151e-03,1.387767154950812635e-03,7.034150917184323146e-04,-2.300370942650275044e-03,-3.468632690312592384e-03,1.007864806956379998e-03,-2.124063641288511696e-03,3.396247249888766171e-04,5.648649990999559702e-03,-5.345876289540496019e-03,-1.340740454532797618e-04,5.622628907586368213e-03,-2.795566061664661204e-04,4.181299001029972306e-04,-9.006106465972395485e-04,3.999462584572062267e-03,3.461275259540547676e-03,-6.680957893197607411e-03,5.476291306734493809e-04,-2.002948355493841071e-04,2.447646974774549643e-03,9.914973340390497141e-04,3.845237685245735202e-03,7.808084344882364278e-04,4.764157322419552629e-03,1.285591738079346329e-03,3.561037544516548574e-03,-3.359864809656486553e-03,2.271541689162566609e-03,-5.175805183953040636e-03,-2.212440606339399513e-04,-2.787891918085265604e-03,1.416675754225617126e-03,4.322151331249723602e-03,1.974373922628598708e-03,-3.932741059151222129e-03,4.375376193486411186e-03,1.226619490396479040e-03,3.414430779149061658e-03,1.462293550285664618e-03,1.787019736212461154e-03,-4.077300042032760800e-03,1.795125235207359945e-03,-4.628898813250517490e-03,2.121579549372614414e-03,-2.028765470242215475e-03,1.118805043711563509e-03,4.310171666352410544e-03,-2.923138157835027638e-03,-1.980305737983169279e-03,-3.246038984948019547e-03,-7.582998749114291948e-03,2.984441895680060505e-03,-2.571814783070192414e-03,-5.367480292037356965e-03,4.154993283425494933e-04,3.440106379355834573e-03,-6.368666049142703185e-03,-2.889243292066595841e-03,-4.552380881961968703e-03,5.906235366111033981e-03,-3.931634196465823468e-03,-9.016698517905008138e-06,4.802800843190712862e-03,-8.271348376023896680e-04,2.757983897728925144e-04,-1.053722538927860013e-03,-2.043886839061162530e-03,-3.003458147624496043e-03,-1.910077355016416480e-03,4.896107777631832428e-03,-4.051911343913978947e-03,-4.894936579678134392e-03,-2.957281094174368796e-03,1.114289084363345814e-03,-6.254233659137063174e-03,-2.478142424779338458e-03,-1.949392314596915454e-03,1.010038806271445350e-02,2.210569714792072222e-03,2.499005240119775949e-03,-1.694571030328656872e-03,8.355835350708357145e-03,-3.161598390579907231e-03,5.106429066493475659e-03,2.056300550351295407e-03,-3.608712549414550344e-03,-4.288578818014737899e-03,4.735436347504596265e-03,-1.134795219833583269e-03,-1.244300955500438803e-03,3.978758313868503010e-03,-2.604202415873648792e-03,-7.214013078170075470e-03,-5.281573015081881400e-03,-2.874984402228129396e-05,1.862833770778490232e-03,-1.691171011624666718e-03,5.274857370515546179e-03,-4.221061019977168267e-03,2.768400211397524773e-03,4.457400349303693859e-03,4.647455753258921481e-04,-1.745684633216789284e-04,3.908342000994421105e-03,3.615151648533918884e-03,1.319018687887574520e-03,1.314953668887453020e-02,-4.397296404072330474e-03,-1.673406632952708370e-03,1.223290018603720268e-03,1.177300314500259867e-03,-4.204300117896560008e-05,4.284009208978337320e-03,-2.959348435796815618e-04,3.907750222444522636e-03,-7.420021964832524536e-05,-5.057494441107702500e-03,-2.556160063601692518e-03,2.206878491979379336e-03,1.729458996577468025e-03,2.713897095620614562e-03,-2.342129928036099388e-04,-2.861099176481242574e-03,-2.121554320951775027e-03,-5.259464771001594319e-03,3.183169735226940109e-03,-6.306886501707894242e-03,7.708763803936110812e-03,5.400460782050891216e-03,-2.410202957862492192e-03,-6.527952574787346252e-03,5.278700311260221692e-04,-4.118689207154643296e-03,1.185564764476278133e-03,8.519197515293802040e-04,8.498208577415280237e-04,5.885909989918986125e-03,1.013206709458168257e-02,9.760367018331977257e-03,4.728745628097850984e-03,4.032666339408711928e-03,8.930814313871654822e-03,-9.770277522848031318e-03,9.501115821856645971e-04,-1.837823302952522320e-04,-2.547647287475516046e-04,-3.397116259324008385e-03,-4.802710007783247102e-03,-1.946435745854449431e-03,1.247891778041675462e-03,-5.252463421268743597e-03,7.486794027855189250e-03,3.393081198441694982e-03,1.047701210404152709e-03,-2.412694181860836236e-03,3.846342799512262808e-03,5.683615253840045706e-03,-1.967205689313473137e-03,-6.519742305465518577e-03,-5.282071692048613239e-04,-1.994617015006392574e-03,7.526397762900759701e-03,-2.990573828458643543e-03,-9.975291698262375283e-04,-2.228928386441927749e-03,-2.868656963875749425e-03,1.728173521741053955e-03,4.088471273824374405e-03,-6.545994602868695578e-04,1.638074239495886444e-03,-8.189296034742422903e-03,-2.037857016885272383e-03,3.668595365550832849e-03,-1.943665742930687258e-03,-2.133667790256513555e-03,-2.033554545502615652e-03,8.504163018201278072e-04,-5.857283633386198111e-03,-7.116207231018030922e-04
+-4.244754636845790401e-03,1.814370183751027460e-03,-5.637415291172300467e-03,2.518962139139174598e-03,-5.660020744147198487e-04,-2.613782439886408807e-03,-5.659303821339869457e-03,-9.573795527756090093e-03,-3.578949863604211686e-03,2.954395766287181129e-03,1.426565442928849278e-03,2.883168470728276426e-03,-8.239493439285545950e-03,-3.394196275823122737e-04,3.523039459920337436e-03,-2.577783174330478596e-03,-1.774595730292542924e-04,2.440545095816881846e-05,-1.002248930589685026e-03,-3.329306441566379140e-03,3.019801472329404989e-04,2.628867257420627792e-03,1.164408095223703945e-03,4.037233763349597389e-03,-1.734574732593697089e-03,1.402612306885885106e-03,-5.585415137393799287e-03,-1.296067322134838128e-02,7.004793564211259788e-03,9.182367745657792817e-05,-5.595163298012266126e-04,3.186188144742806237e-03,-1.655025619499868143e-03,1.764490622973579692e-03,-6.412847474522495247e-03,-1.707809184254360270e-03,7.461290802065646951e-04,5.704781402064580328e-03,-2.933674876976547202e-04,9.135804978922847921e-04,3.042668760708262983e-03,-9.051957246048664748e-04,2.498631354434860331e-04,-9.063064335871322957e-04,-1.859670344625082060e-03,-6.480978236169744654e-04,-2.621452733134941020e-03,-5.484171645716907385e-03,-1.454725375745135338e-03,-5.670713805416126123e-03,5.484536391851066277e-03,-2.933535071661688438e-03,6.947176300378498769e-04,-7.222713937846361731e-03,-2.417930692265838830e-03,1.243980268679455636e-03,7.468524397231893902e-03,-7.822192745322614120e-04,6.581162488155814158e-03,-6.856081639279519059e-03,8.937090752647243105e-03,-6.603428442725910348e-04,-4.356595992315570424e-03,9.218895504541735168e-04,1.630420044919521774e-06,-1.606356526146892954e-04,-2.198069381282586861e-04,3.609962043008964859e-03,2.280428267043484256e-03,-8.491431774775323564e-04,-2.366837505508930829e-03,2.202370312203537697e-03,1.322806122772153710e-03,6.383674031235828800e-03,-9.159410669778782557e-03,-4.510177912002862749e-03,-3.169704719233267312e-03,1.675544187439757891e-03,7.443659536768880251e-03,-7.120128068012554663e-03,-5.393398734325999082e-03,-3.124329008574995657e-03,-5.824164916986821275e-03,3.710494984376385471e-03,-7.056853489262398475e-04,2.271424213634538685e-04,4.608012676808020090e-03,8.446591317648057635e-04,3.674328465982742986e-03,-4.529171887773542418e-03,6.609236212523467678e-04,1.714023660559661939e-03,2.274690814583878000e-03,3.184499890937975879e-04,-7.197273297424928995e-03,5.077690463494871978e-03,5.080444297239748372e-03,-1.981743116585405612e-03,-1.633832454757576423e-03,4.482697482143292202e-03,-1.743092123430035195e-03,8.507247868685596445e-04,2.339219424991358098e-03,-3.788308033536441199e-03,5.443341437888615970e-03,7.083592345723477507e-04,-1.986763757403330489e-03,7.716319018699028132e-04,-5.452016257003513807e-04,3.143546541277836455e-03,2.830326880758131120e-03,2.639613924766654168e-03,1.904293829460491872e-03,-7.096024499427331343e-03,-3.446400760998640476e-03,-2.530793117633555606e-03,3.488787229561868784e-03,-4.420447328708699240e-03,-2.469754578207764440e-03,8.815151315527689282e-04,-3.401939939923218775e-03,-3.074275986605284135e-04,6.661036123508478138e-03,2.238862606080540673e-03,1.247164873275220407e-03,2.749638655447579454e-03,6.865004589603693197e-04,5.845576591259202086e-03,-4.584831769723329495e-03,1.318724500203500866e-03,-1.653569478971522082e-03,1.334677055102736001e-03,6.719523861042693558e-03,-3.850379452937370897e-03,-8.976266228228073699e-03,-5.173165777601141865e-03,3.521876470977838066e-03,6.558201194301676792e-04,1.352547710143261203e-03,3.901638119150463283e-04,-2.055676532146103429e-03,1.010270785606737394e-03,4.937022905465415215e-04,2.459448887651183822e-03,-6.518644630793115312e-03,9.499978912052935414e-04,3.221819374411090571e-03,-6.911505083965231033e-06,4.861994634488394675e-03,2.953849842056295496e-03,1.278033887028515343e-03,5.113002795948501074e-04,1.518726448412385972e-03,-9.843548406939362991e-03,-6.240497327307756650e-03,-7.045329888696614043e-03,-3.036839890704331416e-03,1.649090890460934130e-03,-4.061573109438333141e-03,-2.673305845351033917e-03,-1.632141964931617395e-03,-1.884073718559090482e-03,1.185906118598448362e-03,4.560572344758125735e-04,2.421038400121373023e-03,1.587430776939157675e-03,5.112727314297422360e-03,3.588154167094196201e-03,-1.378513574980961979e-03,-1.318680722410606621e-03,-2.335360190566052822e-03,-6.383573646125173355e-03,-3.754458735504152822e-03,-4.617768328806099928e-03,6.329721536355166836e-03,2.108882529161749432e-03,-1.009214340686519391e-02,-6.464968740404289718e-03,-3.526912557225053962e-03,-1.816757746949501282e-04,-2.727573807631912366e-03,-3.405491391554118195e-03,1.012581096289242910e-04,-3.062874250446138806e-03,6.062210722061994565e-03,2.379744139303612959e-03,3.959266740663912494e-03,1.097627816213342898e-03,7.368180257140283240e-04,1.589386793076647464e-03,-8.802963730224641842e-04,1.337312275592788993e-04,4.961837868888285298e-03,-3.879860160203241898e-03,-2.314533832867299995e-04,2.625417738687549585e-05,-7.948607916959950573e-03,-1.215096266565917054e-03,-4.654987733530135258e-03,4.775604586463002427e-03,2.516303329909611898e-03,-4.361821356206667562e-03,4.114499201523773124e-03,-1.397694336433073998e-03,1.617046820535890557e-03,1.127440970450127762e-02,2.352869029694571942e-03,3.740484101635760221e-03,1.012795648415173479e-03,-7.730341387797932387e-04,3.044938846073915427e-03,-3.658212892283898340e-03,-3.097489694885488271e-03,-2.131509768874217947e-03,-1.686446266888107833e-03,-2.145524520382364152e-03,-6.799104851127491661e-03,1.203655288114038359e-03,-5.291585444731064167e-03,3.891385647279208340e-03,-3.474329693573786904e-03,4.179355437968338728e-03,7.740625556430412923e-04,3.635465726926599694e-03,1.487302690316999769e-03,7.761095544418155712e-04,2.487597492928518561e-03,1.048765447126769534e-03,8.785246961261627616e-03,1.110720526721824947e-03,-7.113930575641920712e-04,2.891090963505302258e-03,-2.640299234993034178e-03,2.106772451610798482e-03,2.050001008859205984e-03,3.976031414005258557e-03,4.237787153330252812e-03,4.506265782442071821e-03,-7.479116135541741955e-03,-8.811926906302439003e-04,-8.519408389867469053e-03,9.683427057063149655e-04,-4.010461581922565577e-03,-1.807949888443072251e-03,-1.819828466624868850e-04,-4.070015065230943654e-03,-1.905211090943825358e-04,4.583822939633539796e-03,-2.880819877063860515e-03,-5.893233082910015984e-03,3.509708586024264977e-03,-6.478707319200856718e-04,1.694443984359785644e-03,-9.794827975879161208e-04,-3.587367930779485764e-03,5.650926494068627716e-03,-1.895051192654042161e-04,-5.718040389700558036e-03,-1.128317716362130414e-03,-6.488483918294076815e-03,-3.911783286956895369e-04,4.269281430320344232e-03,-4.297279507213317556e-03,2.264465593879519574e-03,9.519840877215517139e-04,-5.127943676330778099e-03,3.938653599737086675e-03,-6.135959833536634538e-03,-5.534599939608492430e-04,9.657327060898923329e-03,1.415533086879016203e-04,4.595794709013847792e-04,-4.544986649135758158e-03,-3.147922747556240481e-03,3.812808237511650100e-03,7.491239368726154008e-03,-2.046394721514569869e-03,-2.508178545258286235e-03,-1.285881442596082183e-03,5.034103536916058330e-03,-1.409184854994009577e-03,1.595410245256499368e-03,1.677658280011285049e-03,-3.765008089907644941e-03,-7.968639609079418429e-03,-1.439663388472016588e-03,3.531942441798399668e-03,-5.204148067372491287e-04,-1.353620420424870088e-03,-3.790292538358627870e-03,3.581002681122320234e-03,6.883239922739890770e-03,2.017738734980654720e-03,6.697667063568535015e-03,5.202948319890203239e-05,-2.829557867701548105e-03,1.676511459957018106e-03,-3.269404901541118728e-03,-5.969198718164594845e-03,4.390320734093848608e-04,-1.478775826293524803e-03,-8.417292845491661755e-04,3.393433957450245776e-03,-1.433443096224105680e-03,-6.566460142380135118e-03,-1.138446091393417563e-03,2.272231769355184585e-03,1.687931602306033110e-03,-2.875573690741301588e-03,-4.492301337589760702e-04,-4.620370157176393157e-04,6.230629956251549847e-03,-4.442554258026813166e-04,-4.212547112382726902e-03,-4.826185451971305197e-03,-4.717418052079692253e-03,1.953683118182486262e-03,2.345342412783765328e-03,1.395292300003392804e-03,4.478624565689225232e-04,-6.242680770061177602e-03,1.286943219531827857e-03,-1.163734988640732995e-05,8.901877897624126601e-04,6.038064963404401282e-04,-2.220613447935033870e-03,1.380379437837550008e-03,9.957679768926541025e-03,3.724169004710967557e-03,5.137387889440613378e-03,-1.118095091432286486e-04,-2.976486449024609571e-03,-5.017784531016030586e-03,-2.588262623178743771e-04,-1.576684252291555340e-03,3.445279784773663898e-03,1.074603004555385946e-03,2.276709810720884981e-03,-2.745417054274102244e-04,3.848971773501536769e-03,-1.123003539973448789e-03,-2.181528371982937361e-03,-3.236495885567504567e-03,-3.063106195066683194e-03,3.450366787753824261e-04,3.684180743601786056e-03,1.787421339032253852e-03,4.680074089331832075e-03,-3.745489264934488693e-03,-2.971063272987845786e-03,-3.529187995064902907e-03,4.439581998322775745e-03,7.107965508115178348e-03,3.239111593179098102e-03,5.264306173710532784e-03,-2.681258045983615518e-03,-2.939089333983392789e-03,-4.567633380227510698e-03,1.331604732817569551e-03,-7.940235583949330862e-04,2.326454598339320733e-03,5.994344230184573200e-04,-8.516256561652798748e-03,-1.655829438541934692e-03,1.343639500300875369e-03,-1.038736026934850877e-02,-3.863772704855131853e-03,1.843615115084611816e-03,3.213625716879331623e-04,-7.343783891777343371e-03,-9.646300840505521749e-05,-4.967905255453750057e-03,1.292530022564431945e-03,-4.577115897822192983e-03,-8.494856236142513594e-04,-6.952065925699742015e-03,5.084244630056388922e-03,-1.112200099526224235e-02,8.551526839438562764e-04,1.119675692847430784e-03,2.851365966346561601e-03,7.981759664365898612e-04,-8.875853267488815123e-04,4.554522880180861440e-03,9.467291986811492458e-05,-4.522056296171516124e-03,-9.126483744422439828e-03,6.876297500609400469e-04,-5.220532851770016665e-04,-5.999780228466865782e-03,-2.115249239841289226e-03,-6.109568199723637051e-03,2.259620867418073078e-03,-2.382633865174031695e-03,1.794614697915818602e-03,-9.465552708851451185e-04,-5.022526788050903860e-04,2.202913712131120919e-03,1.205933407284825185e-03,1.720913491600641401e-03,-8.557255776236715253e-03,2.034294400954423548e-04,-1.976471332347798577e-04,2.795941024814010793e-03,-9.381799558206004334e-04,-3.345453636489031587e-03,2.012646471654304739e-04,-2.886636588795888497e-03,3.308042384150211828e-03,3.250103011705598251e-04,-3.621730623808304579e-03,-2.597203062746101469e-03,-4.237271489853790263e-03,1.378697787968264114e-03,3.114626142429233451e-03,2.054267052917050761e-03,-5.928023788717243789e-04,-5.479773465041561098e-03,5.915098961252475765e-03,2.155011224777039075e-03,1.452680068180233841e-03,-7.548176598565566227e-03,-5.069762073501607805e-03,6.214203713788079599e-03,2.261377637266010106e-03,4.077624891722404429e-03,-2.748435324045660558e-03,1.396855707079140696e-03,-8.092960446264924274e-04,-4.658816756548530506e-04,3.858899245568958233e-03,1.434674218702493981e-03,-1.056777796567715151e-02,4.457333988410092443e-03,1.707433339431083238e-03,2.329379158041793973e-03,-5.898496784661787836e-04,-1.182881667478894117e-04,-2.846618165456277997e-03,3.736384549594297302e-03,-4.432830681668714114e-03,-2.652381594105811084e-03,-2.584628536725199714e-03,-2.534620633236120239e-03,-2.555361750893203002e-03,9.189675974820762103e-04,-2.921554142630039107e-03,9.102728070453155063e-04,-4.302491186633291588e-03,2.010113027740487428e-04,1.608532987116106493e-03,-5.818410310789319181e-03,4.512944505984278418e-03,8.193874117777536934e-04,-3.353308164697003588e-03,-6.854757612096551780e-03,-7.534910116045989356e-03,5.542017267180256862e-03,-6.041805690889336979e-03,-2.257709129688225975e-03,3.352269434892337589e-03,2.730366935802285618e-03,-8.327217521191576698e-04,7.174172519130612123e-03,3.558400575959208469e-03,6.790484142339736628e-03,-4.833266570717511569e-03,-7.782094015736644877e-04,3.035005712753666434e-04,-2.260517316942962842e-03,3.063502294315079318e-03,5.180738174034379842e-04,3.383547774791845035e-03,6.961977100131254035e-04,4.583949113827725380e-03,-1.229492994138885013e-02,-7.681285609639719719e-04,-2.705809011094017515e-04,-2.920253956454477075e-03,-3.983551039942427943e-03,1.653695543666799462e-03,6.653955930877110229e-03,-8.604122178180722419e-04,3.240830439649074624e-03,-1.083937178258845713e-03,8.028777827690361578e-05,-2.204653076023843899e-03,-2.465593830775833408e-03,2.273616682087680613e-03,1.595514679772958543e-03,4.902422506485631243e-03,-5.516665529967833964e-03,-9.531323872300165659e-03,3.552290394417774964e-03,-3.149992368294374061e-03,8.950290155275476139e-03,6.100361465556037623e-03,-2.172247494634956238e-03,-2.920098402491195632e-03,6.214875298513470800e-03,7.935743754114795226e-03,-1.066097415271968947e-02,1.854134243163844866e-03,-2.337129947293329138e-03,4.283389053842032689e-03,5.432014648986885699e-03,1.670355018186051510e-03,-5.005802834182692627e-06,-3.704812185075554744e-03,-2.822063094415773465e-03,2.315075861778822987e-03,-8.917503395308098990e-03,3.660760414252223118e-03,-1.619971651641653179e-03,2.873094588646866576e-03,1.176375667929291857e-03,-1.970504507455598034e-03,-3.125449444021876863e-04,-5.175277493500501577e-03,-4.255263986610496006e-03,-8.074420324545278360e-03,-1.965731364776901469e-03,-6.099571444572723065e-03,-4.591199249361141779e-03,-4.282406948999470907e-03,2.334503066772129851e-03,2.966292432885021054e-04,-2.085493619491684435e-06,6.170344083017998560e-06,5.273274539082607217e-03,-1.485485816182248670e-03,3.219971260843404784e-04,3.155005370344929425e-03,-3.442209117606566740e-04,2.301490437392107128e-03,8.513720271686788313e-03,2.219147876922917834e-03,5.828831713340228496e-03,-2.740346850895732621e-03,1.820075694915027206e-03,-4.943679204836663044e-05,-2.534993929684746863e-03,-7.742466314644380577e-04,1.548276690589025324e-03,3.231211428501678910e-03,-8.100779312450419930e-04,-4.529075702241261263e-03,-4.237403109372689662e-04,2.059755221806847982e-03,4.855009744439397687e-03,-1.378099488753637206e-03,-3.888235977054031733e-03,-9.577772812397095797e-03,4.239338101593425993e-03,-2.584659655582048719e-03,-5.841303266692569017e-03,-8.171782319091488306e-03,2.469505770487249834e-03,-9.408581671198977536e-03,-3.131598131802845741e-03,6.155497876249545371e-03,-2.816855555403076448e-03,-1.302035978222782669e-04,-4.782680952854649115e-04,-2.531094958350105662e-03,4.238234011865144729e-03,-2.172337012798125286e-03,1.530604196094832075e-03,-4.162351259905035000e-04,-3.709437185525612125e-03,-9.599994919037813882e-04,-1.889743232734609080e-03,-1.523264397855183562e-03,3.095767693144639045e-03,-1.998216483316376645e-03,3.742533252091992065e-04,-4.468044333750353751e-03,1.135038899728862266e-02,-2.019969073103434424e-04,4.191996222510640038e-03,5.424917144377693415e-03,-7.306189442867149463e-04,7.689833875259860117e-04,-2.752465016221874051e-03,3.656166714903777860e-03,5.617653996944458640e-03,1.271630996310602101e-03,5.255079884481314752e-03,5.539302319981859657e-04,-2.534213470270364754e-03,-9.927436336733350825e-04,-5.976102738734999885e-04,-4.422035786667413010e-03,-2.139994831158055673e-03,3.586360657655993708e-03,9.390687240233557432e-04,-1.703798600886888065e-03,-1.196016296693982800e-03,-5.573236251325863408e-03,3.825779660193256906e-03,-2.542895179310836022e-03,3.281753779316488179e-03,8.433707859765537766e-04,9.603205676270180284e-06,5.079490250399515232e-03,7.334245504587508365e-04,-2.232717948024904621e-03,2.148762516670864754e-03,-5.481816662867788036e-03,-3.633238238143748650e-03,-5.420401259987663381e-03,-2.216729082728597592e-03,1.844054658999105421e-03,-1.186906145910762635e-04,-4.206303974615425469e-03,-2.381577719084212053e-03,-9.037155161902071451e-03,-2.224703738890389162e-03,3.710671689950170802e-04,1.982118345957882295e-03,-2.077357799853931080e-03,4.325133528089252985e-03,-8.445235561883805264e-04,6.399308112472594859e-03,-5.120189467939433112e-04,3.340502995104872169e-03,-7.992475932422388185e-03,2.859319900141338994e-03,-1.165356218598390895e-03,4.075122396390985606e-03,-5.866328860166468949e-03,-1.516532437012441331e-03,1.497725847549151402e-03,-4.272855588216637607e-03,-1.804214337454057701e-03,-2.851611262605188019e-03,4.356697989103446370e-03,-7.748395040588862501e-04,3.290900395171836908e-03,-1.153429243791052102e-03,-4.147766916458631853e-03,2.896738485041851999e-03,2.944497406202927544e-03,-6.930866881284700534e-03,-3.298440150470613105e-03,4.165135027550735496e-03,-2.689615863519014341e-03,-1.850246922231349499e-03,-6.092278883318539857e-04,7.474548780262640219e-03,-2.675539697226096084e-03,4.834366949855674681e-03,-4.413883049831269709e-03,2.856799617888627126e-03,-6.692442957765629662e-03,-3.084399820840166275e-04,2.635406572757549148e-03,-6.326434537155622637e-03,3.976199264780513193e-03,1.254353730285029649e-03,4.316755255744552516e-03,-4.485487290033450189e-03,-2.615669292562704251e-03,2.612451897284952945e-03,2.511453395347783887e-03,-6.549112343281764288e-03,3.116674140880384448e-03,-7.494964152112635865e-04,5.219399989254626566e-04,3.479058556650442471e-03,-3.128826442384471528e-04,-2.139102468625161610e-03,4.228534302690690702e-04,-9.121936991447920267e-04,-1.295468762892087317e-03,6.900820830457192814e-03,6.459689517866361134e-03,2.413007775922884103e-03,9.932161745433779054e-03,-4.124925248361910671e-03,-1.844645298970130996e-03,-1.641737745823712569e-03,-1.382339143345938425e-03,-6.012985767077035776e-03,6.871483124579946464e-03,-9.042425808219702738e-03,-5.625167357837656135e-04,-5.097703111048164024e-03,4.952767591691792197e-03,-3.700288068084494134e-03,2.898289473835833057e-03,-7.170719520874139054e-04,6.740189572555652575e-05,2.199548880928798179e-03,-3.747553209639650079e-03,-1.364834109392858586e-03,-1.532474455154667460e-03,2.854564902850819687e-04,1.575165283313826074e-03,4.952393136422782172e-03,-1.923923891834944345e-03,5.156618372822266080e-03,5.718067407223247599e-03,2.191691131855766275e-03,6.135477314733151291e-03,-2.820192289670789970e-03,6.825545927100707322e-04,-1.524498160394408854e-04,-1.155999575908847089e-03,-8.169372471878374314e-03,-2.187827328171589990e-04,6.521220586267094214e-03,-2.748454202563354546e-03,-2.625860730211004770e-03,1.444807471031894254e-03,-1.139519996573236339e-03,-5.328908628138122501e-03,-5.153123298114130509e-03,-2.839370127191163766e-03,-3.948024685810011709e-03,-7.183270287097805878e-04,7.424820866364994013e-03,-7.954552747485847675e-03,1.906762597097350076e-03,-3.751295511437238727e-07,6.872210927815568422e-03,-8.808759603885503008e-04,-3.816465102327710329e-03,4.466701201611343944e-03,9.729408762963444529e-04,3.740727231192068208e-03,-1.782008183277775191e-03,1.012529372899172988e-03,5.590968903926091943e-03,-2.980153082649404390e-03,2.597099847249394623e-03,-7.645665893504294264e-03,-6.126102265559454224e-04,-3.888985156882859536e-04,4.953555107714596902e-03,6.870051819150207149e-03,5.499985619050509408e-03,-1.122975244465720902e-03,8.161539865712927767e-04,-1.718736750429630551e-03,1.164149943976047676e-03,-6.348164280612233860e-03,5.645900613973796854e-03,-3.172922744041212217e-03,-3.894960025600783149e-04,1.332512876798002798e-03,-7.617208247761144378e-03,-1.766968334423036410e-03,3.579899507211116959e-03,-2.232029592858509313e-04,-5.555680081830434262e-03,5.203465895850656678e-03,2.897241182243144673e-03,7.107735050839671456e-03,-2.304048484829815727e-05,-9.382304564638587569e-04,8.304226309079619159e-03,9.314697724575045183e-04,-5.009822425355675742e-03,4.669020629797967538e-03,-2.651164329787659979e-03,-8.628327936046537028e-03,7.022292974973747860e-03,4.858507517776218358e-03,7.653751137026152014e-03,2.261466216680022619e-03,3.780185458355357284e-03,2.197544816152360164e-03,-2.339982363314907672e-03,-3.265026229645346875e-03,2.893202853632785543e-04,-6.710766374867360197e-03,-2.955217525281235609e-03,2.159114923571404422e-03,-1.934960448070760942e-03,-1.250246660892617755e-03,-2.333304716269516809e-03,2.149093563602303206e-03,-3.246678623853766626e-03,3.145183598060183588e-03,3.833944127293973259e-03,-1.717038196950675328e-03,-4.628809465473440396e-03,1.358501507053567993e-03,2.423820345600850162e-03,6.496782959701874818e-04,7.188378260008603665e-03,1.842254366368146958e-03,-1.265780320568194589e-04,2.812391824149712040e-03,3.779172060686786239e-03,2.942672503974840068e-03,5.706858847639317660e-03,-3.361372334989268153e-03,4.995860337090964054e-03,-1.809319286458130208e-03,6.563778632913677344e-03,2.280102338790634486e-04,-6.532399728514925183e-04,5.533737334082615195e-05,-4.893537521419993874e-04,-6.206425679104771338e-03,-3.082552985593557410e-03,-1.040889418253210241e-02,-2.596606779398892757e-04,-1.523417674124831044e-03,2.004684826446812094e-03,4.401932272349130880e-04,-6.809664891090057640e-03,-9.290146325660981683e-03,6.587974226464579641e-03,3.542184341938255359e-04,-7.926624766679777126e-03,4.357345989276133549e-03,-4.058628217658838758e-03,-7.591478233987535322e-04,-5.689401668964759635e-03,-5.691719886558894471e-03,7.159715543339186666e-03,-3.847576982376860102e-03,4.026059385523750311e-03,9.938894627161303560e-04,6.487443464914696874e-03,5.293384245937162932e-03,1.717383030602911758e-03,3.687835901630109100e-03,-2.151707657966321716e-03,4.210007480127325253e-04,3.626570934939678595e-03,2.755391616809081505e-03,2.008093077259121267e-03,-4.280734992999328178e-03,-2.327108406070232602e-03,5.348409545130079898e-04,-2.849258675936757433e-03,2.319045787534839472e-03,-4.876565790049573745e-03,1.526858018754909813e-03,7.088869878346933627e-03,-3.275820483812621510e-03,-1.483683413073295368e-03,-4.704297265692611654e-03,-6.076381715655821553e-03,-5.713529563897378621e-03,5.849117357204720695e-03,4.500240368140256964e-03,1.993763398720991946e-03,-1.779602058413933766e-03,-2.964998114944750272e-03,-1.537517205118511766e-03,5.278365285142162626e-03,7.344824903521426310e-04,-1.420949020924460859e-03,4.314484284004992641e-03,-2.671016617741790462e-03,3.710417453202264054e-03,-2.059868599393084934e-03,1.799809751126999445e-03,-6.955739933943511566e-04,-5.058118980588110179e-03,-2.828203275083891050e-03,-1.833497875957318521e-03,1.580229501260031581e-03,4.318559468746876301e-04,-4.632691243974161599e-03,-4.918063977251676579e-03,7.978922624830681390e-03,2.807365792802196322e-03,-1.465819122763622985e-03,5.009677038861908190e-03,-1.560953997467050920e-03,-6.817932030239854529e-03,3.392855509209601744e-03,3.672832993982274100e-03,1.714015676763900719e-03,-4.119693655960729971e-03,6.661683555867444796e-04,-6.808098929793483366e-04,-1.159404517334696426e-02,-6.564927415567827244e-04,-2.392182098228518616e-05,2.801249158571122392e-03,3.937122773417793715e-03,-3.574602680120276144e-03,-6.263551880829628001e-03,8.310348428649474593e-04,-1.103329647950243055e-03,6.873840383040773347e-04,2.486536772026439975e-04,4.311596072798187622e-03,4.199447755999776051e-03,5.659456379898526691e-04,-8.595108984091211473e-04,-4.405073340182160020e-05,-4.110129896088019451e-03,1.763482277395306838e-03,5.704102983392167134e-03,7.896066512382496220e-04,2.808163141484907796e-03,-1.777407433686517524e-04,4.641288632544630748e-03,6.081219571004724093e-03,5.739878229124543031e-04,6.161734592156127175e-04,-5.405207370977530359e-03,-2.825617231067930300e-03,-4.110026355584336195e-04,-2.020776634571422488e-03,9.747688415299245093e-04,-3.741347161297979215e-03,3.901869262021254604e-03,4.577837913300749988e-04,2.892192116253322100e-03,5.188071992539493131e-04,5.499227567768563479e-04,5.762246461051639089e-03,4.806217307476295074e-03,-4.692679804448891996e-03,1.504495137773180459e-03,-7.421759034110748943e-03,-1.184661566542924427e-02,-9.766763197824634632e-03,-6.303668555832427946e-03,-2.190071676450840686e-03,-2.682119133177485702e-03,7.810372233759436608e-03,5.620739839851181686e-04,5.236901962710221623e-04,-1.393217990660232906e-02,-7.595506383072811668e-04,3.277216577513350623e-04,-2.722900031775676594e-03,3.781229425105929725e-03,-1.316407211157503932e-03,-2.817991928313649634e-03,2.195291024428970388e-03,-2.207693270361796474e-03,-1.274571512320531001e-03,-9.330777322558126644e-05,-4.215291462706643316e-03,-4.768956542200548866e-04,2.692165692465541341e-03,-9.226680866096285093e-04,1.116395963897117349e-03,-1.616674939783183584e-03,5.870811778109667279e-04,5.606414427242883101e-03,-4.120451688162444365e-03,7.212639105637552381e-04,4.416054895528136875e-03,9.750371417007689528e-03,4.868180854476694859e-03,-1.413671671184706363e-03,-1.379948083720419739e-04,9.298006508006585302e-04,-7.106499182783087990e-05,-2.388298792793646774e-03,2.478534065214612909e-03,1.502576863239386915e-03,1.188773893589955893e-03,-1.415781426195430797e-03,6.498410665993984391e-03,-1.532927701289221473e-03,5.479220080568964618e-03,5.338966395195592291e-04,1.815096839260890748e-04,-1.907963591346218670e-03,8.906305764279446799e-03,-6.964762163644943726e-04,-1.982892765054320664e-03,-2.846855614605070094e-03,-5.289143588611394491e-03,-7.250630801151364963e-03,-8.277067621467210090e-03,5.744980654405829230e-03,-8.834199475025936166e-04,-6.356168670025273693e-03,8.740538039150293420e-04,-2.545950633176502931e-03,5.792515194702579613e-04,5.143679798986748467e-04,3.637242898665698880e-03,-3.962817080427952915e-03,4.675721855139764968e-04,-4.483586516733729932e-03,-1.945566514850028631e-03,-5.546157383154598237e-04,4.325314985367676585e-03,5.751359172100629479e-04,1.589941252852946740e-04,2.311964533053196321e-03,-6.191373121589886612e-03,4.380774000253994348e-03,-7.195743098005339192e-04,-1.254141693930187467e-03
+-4.011190479730338335e-03,-4.248652605526495704e-04,5.602158023433947802e-03,-5.841452699828287784e-03,1.141937976703647634e-03,1.702938669412549133e-04,1.132855429771144340e-03,1.516251632389390780e-03,6.049638143041972775e-03,6.089523546137489730e-03,-9.330605714195147962e-03,-2.024150517270188602e-04,-7.418033974929328549e-04,1.417864571307929800e-03,-1.838318216469466836e-04,1.485399393053556377e-03,-7.369660068975906683e-03,-1.983311177084323088e-03,-6.727438459125915753e-03,-4.312767251233539084e-04,2.779834927393850207e-03,-2.500428377048239705e-03,1.261375993987135565e-03,-1.432662918821457848e-03,6.999734881434321876e-03,-5.182906056096945810e-03,1.474658844402427539e-03,1.680656724414484552e-03,1.267521443965554610e-03,-3.645575997325135095e-03,2.750413408205234968e-03,-7.557106660393349447e-03,2.160098203272438865e-03,-3.130355438471219432e-03,-9.187545143203314744e-04,-5.906966083195458158e-04,1.566876587841232321e-03,1.321585418800523245e-03,-6.896150116913773130e-03,-2.122926296107457804e-03,-3.244286320999500723e-03,1.836258883479035707e-03,5.383068302880245374e-04,1.459919716655463549e-03,3.199115255038940692e-03,4.917204178524592351e-03,5.828790693963342880e-03,-2.459650482203809725e-03,5.879025798518430594e-03,-2.827544374166316397e-04,1.629675930863566267e-03,5.067673907057916594e-03,3.328615245496643026e-03,1.232044957830977358e-03,-9.626270929844005318e-04,4.926064547223873817e-03,-3.617453530943783805e-03,-1.638263965815238378e-03,2.013545705612643667e-03,1.819473509101763404e-03,-3.101586855279137803e-03,3.288964962699392116e-03,8.058715996405615284e-03,-8.382424200092906952e-03,2.239940096034008149e-05,-5.702822739156770048e-04,-1.703355190632996731e-03,-6.865305201525303972e-04,-7.638052155461795556e-03,3.231537643431854710e-03,1.653643108948566094e-03,-7.048777428316656442e-03,1.009173901826121086e-03,2.243167523207967361e-04,5.114826474570908013e-03,4.566291167927604214e-03,7.405449183036358135e-04,3.901620190143342652e-03,-1.309385405975861151e-03,6.733710360650030195e-03,1.359721628658257546e-03,5.531212446178360512e-03,2.386200801929502183e-03,5.263564257180395126e-03,1.988329401069360066e-03,-4.727926551587820908e-03,-5.423037166680437157e-03,6.100394715333113576e-04,8.348066840577414438e-04,2.017952576486557216e-03,-5.060030677812397439e-03,-1.254091169195371185e-03,-5.439898588139919014e-03,-1.600026998501424391e-03,2.501511995844280835e-03,1.656909187060960284e-03,1.726232137024159667e-03,-2.806407437205629685e-04,5.118117103457379101e-03,-5.456289302423095481e-04,-2.090364158990374189e-03,1.186345688698104512e-03,5.292553321128326375e-04,-1.481845408468408266e-03,-2.496555901040298057e-03,-2.410055949686453532e-03,4.403361466140480115e-03,-1.965552971314441389e-03,4.447106826468433770e-03,4.717120286941565988e-03,-2.968846511384547800e-03,2.350801917080679645e-03,3.939718658934943107e-03,9.226618419157480225e-03,1.419322276107737573e-04,1.198458292169052091e-03,2.274667434771528423e-03,3.909397025108742561e-03,1.836111560570106039e-03,-7.265649224061112438e-03,-6.753879783286333323e-04,1.802459105134822373e-05,1.029719102039919169e-03,4.339230694690609201e-03,-3.577964224637291617e-04,-2.184565711661893769e-03,-4.716128145871975944e-03,-9.638100894826043263e-03,2.082556556625658490e-03,-5.450026900100175126e-03,1.582009825321585485e-03,-5.659022231450516402e-03,-1.071562044639295941e-02,2.545001538803219783e-03,-2.665390507551236837e-03,4.668438104454185333e-03,-1.307344388840842985e-03,2.743171477387548740e-03,-3.029128332317037459e-03,-7.329363987785762725e-05,7.672793994912610729e-06,1.530030700050040654e-03,-6.720242189412464415e-04,-8.459641581524398679e-05,-1.182367768107723616e-03,2.796120111479737363e-03,-1.081169134833307767e-02,3.104186656094210742e-03,-3.752766288261704265e-03,3.007631654090637315e-03,-4.232368823553746490e-04,2.982455760255871453e-03,1.330017405675314700e-03,4.821685858094456635e-03,2.425341871363497978e-03,1.555351454258059726e-03,-2.642537654617571879e-03,-8.782871972262718289e-04,-2.094595996622683563e-03,2.281788442842345131e-03,2.493699718007813190e-03,8.139485279653494929e-03,3.445438891898577015e-04,9.152725094072837854e-04,2.739661741194065006e-04,2.043179113218400369e-03,1.003030921741127242e-03,8.478408553336170715e-04,-1.947831940549732410e-03,5.496445271069702251e-03,4.960965184009466858e-03,4.216957201347236103e-03,1.365926128746594928e-04,3.160168015606743853e-03,-6.156788160854963741e-03,8.169237030631957674e-04,2.724182746958846569e-04,8.081631862086935403e-03,8.891068617342370167e-03,-1.450360398980247425e-03,4.137582526186429389e-03,4.866442215509968715e-03,-5.278516044032934441e-04,3.705256599807932576e-03,-1.141342296962768229e-03,8.584434890472458085e-03,2.506969358903150748e-04,5.906403699255344773e-03,2.384336456481616750e-03,-4.042784260111253504e-03,-1.814162777304065537e-03,1.506760214944458933e-03,-4.977996874177227291e-03,1.942705432226992778e-03,-1.277436801322032450e-03,-5.474651951074932760e-05,-2.031594713027741299e-03,6.147179702260376598e-03,9.242969802292336267e-04,3.283080221725259967e-03,1.528776487626948939e-03,-2.154543719229401379e-03,-9.174431506264993721e-03,-5.578140448724790493e-03,1.396902129495656165e-03,-2.319237599890884073e-04,-2.043079213934606737e-03,1.277624731754988191e-03,-3.263623953265190093e-03,-5.305239099116044153e-03,-1.302010149355159029e-03,6.240149974644371676e-03,8.693796359502342988e-04,-2.365230091087532766e-03,-1.463269216058638760e-03,2.805825374588336116e-03,-4.931064510560688972e-03,7.132446414818666798e-04,-1.649487130775445206e-03,1.952409233744623702e-03,9.920439359357130071e-04,-2.625209359167373236e-03,-3.123115595613532112e-03,4.854657926550910718e-03,3.347935042298321546e-04,-4.706993332861907769e-03,-6.027400719037857065e-04,4.178861664490436494e-03,-1.033109058582178455e-02,-1.597551113386401619e-03,8.296838939027009968e-04,-3.469875421317004591e-03,-6.114564696122243466e-03,-3.959177763220028332e-03,-8.151330126378780738e-03,4.550878513730416235e-03,-6.558368242290601828e-03,-6.312031994393184180e-03,3.656736753301334346e-03,4.172870893242092670e-03,7.033552844982305687e-03,6.605530928269433191e-04,4.851884746528720778e-04,-2.853533987161383564e-03,-4.055945279277023419e-03,3.379743928729715685e-03,-1.616381418857593642e-03,-7.040754807619261160e-04,3.125545579676370896e-04,3.567263733501776016e-03,-4.745092320879139465e-03,-8.290156134774601332e-04,-3.134976024341306136e-03,1.645851089497897199e-03,3.801178001630524817e-03,-8.074395504943921581e-03,5.536227181507813455e-03,5.166394752018389769e-03,-2.981103510044639691e-04,1.459368416739448672e-03,-5.938930675821506647e-03,-4.974257031775634365e-03,-6.222110046516107036e-04,2.135255868416684015e-03,3.924824500963314602e-03,6.584457094970290686e-04,5.911450745066536196e-04,-4.265217420698289540e-03,2.175308870659360076e-03,-2.182283121227110087e-03,3.313103839017065427e-03,-1.219835686018344622e-03,-2.627993772926975804e-03,-2.661820540894927877e-03,4.215778429875876102e-03,-8.747058311402565456e-03,1.071236612451191474e-03,2.062040792155440862e-03,6.172818353323309033e-03,-6.616170508451312514e-03,4.059727091219547494e-03,2.415431847100682190e-04,-1.016373090899783029e-03,7.870601695118651514e-03,3.821938846635188590e-03,-1.601488512188733136e-03,1.293855149865633146e-03,2.110462352239445092e-03,8.183164583075660387e-03,3.789246362582490094e-03,-3.716736902834547355e-04,5.983900432160594966e-04,-3.555041858693299729e-03,2.759005080939648351e-04,-1.290543706305375413e-03,8.847373418493018407e-03,-1.265152116307862069e-03,-2.253241192473513267e-03,4.809225029046165951e-03,-4.076538965550882514e-03,2.171071525224312288e-04,-4.544524109184253959e-03,-1.739487115389213114e-04,1.168725119459880588e-03,3.232806240471217119e-03,4.083742048335158373e-03,-1.656898982794817498e-03,3.848855007820111297e-03,2.770662557429140081e-04,4.859962064665759515e-03,1.147144856721514619e-04,1.047600499933481512e-02,-2.171840502074272750e-03,-9.760104911611887618e-05,1.682307249148705212e-03,-9.992705783433131466e-04,-7.846494867373009219e-03,2.701762932563284755e-03,-6.732299019707296307e-03,1.895192570627962821e-03,8.975536665254720122e-04,-4.119837419856158846e-03,3.241149141752914555e-03,-2.618717382614355136e-03,-1.009981730407723786e-02,4.304796189083700576e-03,2.567032197851718857e-03,2.696618779414448365e-03,1.068697486608083435e-03,2.249931493231085034e-03,-2.672331106044389427e-03,5.621545205063075389e-03,-5.646214452431273963e-03,3.368655802519515598e-03,6.630998797903319732e-03,-4.565872310514936146e-03,5.881434800345137224e-04,-3.643122137870996678e-03,4.665309729469672463e-03,-2.327746443433824711e-03,4.567065546938138369e-03,6.758653204338455761e-03,3.393577505210450349e-03,-6.260968615678440090e-03,-4.234175578882430776e-04,-6.459575902240342069e-04,-3.659903040832376483e-03,-7.989399098055488535e-03,-2.473297606522551102e-03,2.031889910539964872e-03,2.925248405581846838e-03,-5.271031216574236650e-03,1.599073174865751825e-04,3.389539702935015324e-03,4.454557538257263651e-05,3.956755579593179174e-03,-1.694609236545927509e-03,3.931202520482774895e-03,-3.196800167365673045e-03,-2.531674763571568963e-04,2.051478502189801271e-03,4.140779082613186071e-03,3.888736580256075236e-03,4.297311014301548500e-03,-3.522853943070474846e-03,1.196305553043532999e-03,1.037268521666688710e-02,-1.076295727522941746e-03,9.512642897623477473e-04,5.276445076049096974e-03,2.339844947974448284e-03,1.221455073195905697e-03,-4.506955829397235776e-03,3.978048945437873476e-03,-2.001526238107689815e-03,4.430030223606359722e-03,-5.676948155900377697e-04,5.120081222141200684e-03,-2.457405703260669497e-03,3.344074206731250439e-03,-1.523950954447099741e-03,-8.914442728444101766e-04,1.917225438936899246e-03,-1.752184275325468533e-03,-1.114740719924371118e-03,-3.423667421362678719e-03,2.815788232385221367e-03,7.961031192635702886e-03,4.909298247374811915e-03,5.044303029144904353e-03,-5.692301784480254499e-03,-1.867658075347716469e-03,2.136413410490530223e-04,7.248024079733282089e-03,-6.272362198670464271e-04,5.547115954416490533e-04,1.637201309012476869e-03,-1.254657657637176221e-03,4.929593870819037014e-03,2.918910752294686653e-04,-2.570564172388285838e-04,-3.419880551435839841e-03,-4.342215401404101761e-03,-3.143078880543331063e-03,2.319552073869632965e-03,3.553088313014081162e-04,4.125812052258787978e-03,-6.589581144705875491e-03,-4.681758088716910828e-03,-7.000438924113645519e-04,-3.287757475257907882e-04,3.340876608347629652e-03,5.163556934336484619e-03,-5.036212204244499765e-03,-6.230477703279998364e-03,4.966423080303815222e-04,5.705287876986904830e-03,2.119016378646446704e-03,-1.100188040303121725e-03,2.405544410988978872e-03,-3.683103469098485388e-03,-3.380703322056769692e-04,1.569185430476361005e-03,8.130167550421757644e-04,-7.919685115142560455e-05,7.729976817998693685e-04,-3.099917832282071597e-03,-3.704303878787111592e-03,-3.315455599596365318e-03,1.381901101641853837e-03,1.305270514620720471e-03,2.627813317188914134e-04,-1.028664269908023546e-03,1.690799993392213532e-03,8.209302890691046878e-03,1.673809217539302171e-03,2.755753510279028776e-03,-2.478798958970745633e-03,3.756293575378483532e-03,7.003658918462024596e-04,4.354362869659687656e-03,-2.534174101080855206e-03,9.913293984939713227e-05,-5.165204816646347397e-03,8.650253524366690802e-03,9.346401032655253180e-05,9.834914343477279761e-03,-8.824515985767025089e-03,-1.026745327840543473e-03,2.077413595053882213e-03,-1.502368170169725119e-03,-1.008256487710301741e-03,2.371991432319463158e-03,2.460873490177587029e-03,-4.091887308662217637e-04,5.902650421160878479e-03,-1.543636365956943925e-03,-3.396007497662624584e-03,5.255512360324319371e-03,1.266894515203464099e-02,-6.964871771782962274e-03,2.133208810432553684e-03,-6.806278222076688660e-04,4.002032887599890273e-03,-4.665512148712158171e-04,7.666894158783731387e-05,5.991241898354392192e-03,-2.082731123693007651e-03,1.521930818647873130e-04,1.684194277822194039e-03,-6.549852574806939130e-03,-3.335428719328327788e-04,-6.952799861671926478e-03,2.671940585099135518e-04,-1.609249399092841553e-03,-7.748736379372222864e-04,-7.144523656448264516e-03,-4.175304831188978230e-03,-3.187770505687647499e-03,-4.624049420772386419e-03,-5.642018167766009798e-03,8.291901467719123807e-04,-3.476688415147859146e-03,1.292211867315475656e-03,-1.804741978631599049e-03,-1.177585739721705204e-03,7.249811513352303477e-04,-1.800793840583304211e-04,-1.412235198824952995e-03,-3.746294101370192461e-03,-1.070483286817600679e-04,2.736772500627159682e-03,1.140257829677850283e-02,-5.818709737841033965e-03,2.405389365787348292e-03,9.400407835998777530e-04,-6.605367692404231182e-03,1.556820209802314619e-03,5.760999053703363742e-04,-1.066766009941689749e-04,-2.722833302560456314e-03,1.053051985941918373e-02,4.673010646879620193e-03,3.782325909247854535e-03,-2.603953080945978216e-03,-2.514287263192891125e-03,-5.270144730804901881e-03,1.163291274770745544e-03,-2.700452142662899610e-03,-1.008715240769702572e-02,-3.554777481355693219e-05,3.629507423775147051e-03,-4.142956545223807664e-03,2.440795053796303727e-03,2.381510578029619272e-03,-4.779496041504788587e-03,-3.165174361501765354e-03,5.186761378692663835e-03,9.900415269032125654e-04,1.813636222712087251e-03,2.168287178194197270e-04,3.229577925191017408e-04,6.259562848766568946e-03,-5.603092575120274810e-03,4.588154381989224644e-03,-3.547003692916546151e-03,-1.204840681460988324e-03,-1.167895660280216311e-02,-2.023691752837009084e-03,3.271083561407649902e-04,3.364943565583215692e-03,-4.388115313204264331e-03,-2.911733300712124588e-03,1.415544989914122812e-03,-4.101223732794001889e-03,-5.493963603621395200e-03,-7.960424592536160995e-04,-1.275274359770723048e-03,6.728354015858099314e-03,-1.553725555123776315e-03,4.028940755716690567e-03,4.626563812135927288e-03,4.677757512613086796e-03,-6.243347387805711591e-04,-5.333172016981182692e-03,-2.886886135953613784e-03,3.332203485022147606e-03,5.247068384179361632e-03,-2.851961697949820847e-03,4.558840791712885122e-03,2.006845828331917971e-03,5.550357951273063913e-04,-5.218704628638587591e-03,-5.775086134794262145e-03,4.443716168880913926e-03,-3.561704161081934855e-03,-2.700358397057477100e-03,-2.735552688923616974e-03,8.514838583762349492e-03,5.271786304677605043e-05,-7.622572956176410603e-04,1.800350275391210961e-03,5.707123550405441420e-03,-5.685673547218224322e-03,5.368031669446008064e-04,-5.616740092666157140e-03,7.792190692170969032e-03,2.590907168694650195e-03,-3.064163968272514207e-03,1.167043981561949810e-05,-2.838267558972058097e-03,-1.011727500172407098e-04,-1.878640611791924844e-03,4.993195153398275604e-03,4.723934626416736028e-04,5.433083340182506306e-03,6.523172301228876736e-04,-1.105998427445231171e-02,1.112315996364224855e-03,5.861160277388959700e-03,7.194923884469908975e-04,2.647740840521363003e-03,-1.372703742116896872e-03,6.343903104078647522e-04,-2.017814784833395474e-03,-2.266058273940988431e-03,5.967784583744451497e-03,2.986592990230109001e-03,2.446100408748830019e-03,3.438327751226513858e-03,-1.977970336511456189e-03,-2.558658305946557698e-03,2.872609871270016997e-03,-5.530562358960707556e-05,3.026335778417754523e-03,3.730021821393427698e-03,2.331058470073851807e-03,-7.610982808152835961e-03,1.131855766614858153e-03,-2.720865630696044698e-03,1.195226585593783071e-03,1.379235397868210956e-03,5.587197072923714286e-03,-1.369833274328561808e-03,8.343889797783173023e-03,8.428103528921114529e-03,1.938805657481893624e-03,2.271764963955038522e-03,5.537983679753070466e-03,3.105359369432939779e-03,-5.676147707380467013e-03,1.927361942668791165e-03,-3.371219099807626959e-03,3.023253773302146327e-03,2.520633505081534744e-03,-2.381164177299006490e-03,7.446535210477426940e-03,4.067087147267106325e-04,1.718842774647826510e-03,6.280089366256946630e-03,1.161992858709211089e-03,-7.970651040290207406e-03,-4.093120668845090862e-03,7.095440556806226510e-04,-1.381408698320159712e-03,-4.161864587144025601e-03,6.026853517734102536e-04,3.908306412954013279e-03,3.173437728192517442e-03,-3.190640828534052793e-03,5.977790881736992634e-03,2.659343069312803103e-03,-1.513570778035099778e-03,5.745590465971350987e-04,-4.255357988153623573e-03,-3.544988659424513477e-03,-6.907649448569455099e-03,-7.292871325460131317e-04,3.063867223956404269e-03,6.432190121239644148e-03,7.055386192602822876e-04,-2.545140740744582849e-03,-2.498585165016741372e-03,1.958272040996597661e-03,3.844701073306846118e-03,-1.922550603175532391e-03,4.210520200158109961e-04,-1.089742760647258385e-03,-4.043979210953050664e-03,-4.413852016080804450e-03,-6.901250442177055437e-04,-6.962550504247731970e-04,5.894907822275056650e-03,-3.049835157996115195e-03,-1.796966718226461479e-03,4.965440976449115792e-03,1.258134557241670720e-04,3.293337464322199093e-03,-1.051167835330841594e-03,-1.899005060413708971e-04,1.680884868997708104e-03,1.900293016605564527e-03,-4.126347490820256769e-04,-9.929259655020573547e-03,-5.144270056952322465e-03,6.633427509404654643e-03,-4.535876980112376017e-04,-8.270620156046123734e-04,-2.567072611518382910e-03,3.465281118041202858e-03,-4.131489548464966494e-03,-1.763944246190877075e-03,-4.573207256021354430e-03,1.348463364446330585e-03,-4.486831434521827526e-03,-1.312329566823170531e-03,-3.350532924311595395e-03,-9.597199961738868065e-03,-2.360143688489834031e-03,2.076360170395510386e-03,1.375629265548295333e-03,-2.965792887458898117e-03,-1.984884448875444320e-03,7.918985328111739080e-03,5.375560888841829713e-03,-4.320329310373675034e-03,-1.187352522791461406e-03,1.295648775307678673e-03,-1.399993268096544081e-03,5.464890269855969193e-03,-6.203858735927407150e-03,3.445770939041419186e-03,-5.243134483077237055e-03,6.947754360261022832e-04,5.223304894381979729e-03,-1.659484828915411270e-03,-1.442463000117675566e-03,4.981775224427458307e-03,-1.525374701519646324e-03,-3.898586888465242241e-03,4.685785786835865163e-03,-3.072465961711463855e-03,-6.177976279149853360e-03,6.771740331552826694e-04,4.228349201685138378e-03,7.707262589536063510e-03,1.779550496186463196e-03,-2.027743532083247074e-03,3.423145870885788665e-03,2.054731288905643945e-03,-2.259232268777949511e-04,-4.360066557633129891e-03,-3.318871264829520440e-03,-1.049415405329412552e-03,4.492392268299187121e-03,5.717929310000266843e-03,6.026785608005330109e-04,1.924311803683618300e-03,-1.820115725533363177e-03,4.869916263712259670e-03,-5.180038566569491066e-04,-2.193943849258500449e-04,2.346639000092711619e-03,3.416421036547153105e-03,2.172627498501738478e-03,7.505446998320105005e-04,-3.795980375164569130e-03,7.389044687833629457e-03,3.471494321585976914e-03,7.175198392235235863e-04,2.158269012640386553e-03,7.064037939902087157e-03,-5.032160898296798423e-04,-6.472732115911229538e-03,-2.560599035300682304e-05,2.751193402323340743e-03,6.100336927144966309e-03,-1.629882081236959683e-03,-7.006744597726602235e-03,4.579818768184296812e-03,2.733296628355985376e-03,4.275658027385265819e-03,-1.363116533165255155e-03,-2.515455207735139628e-03,2.147155130036139392e-03,-4.541852670346804843e-03,3.925013963251977947e-03,-9.763795330720262400e-03,7.119731118463243358e-03,-2.395897978965053684e-03,6.983433402857658519e-03,-3.276258152419200299e-03,9.976794062415516098e-03,-1.170101719256779133e-03,3.321012313051926318e-03,-2.710909114789357768e-03,3.347018450736454018e-03,-6.163689015947495901e-03,1.778242416009357578e-03,2.917025186264692103e-04,5.147960727589441857e-03,-7.494082737321198021e-03,2.711551677943081860e-03,-6.391607231688700881e-03,-5.037593321387537175e-03,7.289118786330219717e-04,-1.983163044996493537e-03,2.898658428878722170e-03,1.630009684159847453e-03,8.765910359890061784e-04,4.630645667515756935e-04,1.684741944049331129e-03,-1.555756720167865599e-03,6.091456374473863894e-04,1.710717444791505981e-04,2.346619091946408695e-03,4.162213132487005418e-03,4.254288827079057318e-03,-2.671287275825136656e-03,-4.717242996210621771e-03,3.053565426586648656e-03,-9.548459925032802553e-03,-8.302778949491827418e-03,3.715923715410764583e-04,4.564675341819080662e-03,-3.684038710120579249e-03,-1.991876989329366323e-03,4.435251486007329297e-03,-5.573498673671638086e-03,4.424616611579686339e-03,3.884015621170827659e-03,-1.079306204102412200e-03,-3.914848817243140634e-03,5.404721422391891780e-03,-4.301254983646789036e-03,-3.079591977293923751e-03,-3.108948334976033290e-04,-3.517420041939922290e-03,-1.395952383496244092e-03,2.079699375737811481e-03,-2.392355500387547892e-03,-1.990116991876611876e-04,-4.576737870074846365e-03,4.502319664498779204e-03,2.859422206061919183e-04,-4.790133663122566644e-03,1.160869532274377055e-03,3.807382715472081534e-03,6.098742754441193424e-03,5.756747377981966465e-04,-1.338620895941492655e-04,7.361533578256851764e-05,4.758546156780394873e-03,3.871152277405051721e-03,1.850811990053507225e-03,-2.340964839327356814e-03,-6.704421063130459532e-03,-3.170488955538917474e-03,-3.847442104680150361e-03,2.552037449546623073e-03,4.344812946874861484e-03,-8.892144291254954879e-04,4.046526857192096868e-04,2.346054227571039465e-03,-1.985701503596482963e-03,-4.538292495932576449e-03,-2.838392623424656428e-03,-1.597160994298468567e-04,2.504745135553679454e-03,3.109397444436725125e-03,6.044292288580389340e-03,4.057190695528130331e-03,2.400857753033363694e-03,-2.385952136443194628e-03,-4.255069321255518060e-03,-2.093710913034829203e-04,-2.387772562765372308e-03,4.817862423344114985e-03,-4.635953249958752430e-03,-7.044786712359707610e-03,-2.996105523035256374e-03,4.844437080173620395e-03,-5.650865997752729103e-03,-8.640638734516732991e-03,-2.439409458730768237e-03,-1.279930475716820840e-03,-1.196623043789618807e-03,7.202142079660260729e-04,5.916569703151672299e-03,-1.131951264571159649e-03,1.798075032506122250e-04,2.065294962035300452e-03,-6.884807694125681145e-03,-6.521171645144283567e-03,1.540284732937410687e-03,-6.559735001629158312e-03,-4.330916813713163449e-03,-2.633977264464252600e-03,-7.377026848267577127e-04,1.359045262336724307e-03,-9.640203485657329206e-04,3.555785876530288998e-03,6.953647967072682651e-04,-1.721987081795989163e-03,1.849677705173236519e-03,1.569980979481641094e-03,3.143440351166178551e-03,-7.949174962508812059e-03,-9.054120114420088775e-04,1.175617418623953322e-03,2.835086016741923502e-03,-8.106997806191967903e-04,-5.801392396846102688e-03,-1.083828725072949820e-03,-4.410750621509299495e-03,4.170026867892774429e-03,-6.147840507207767875e-03,-4.933940755455817201e-03,3.269006631608752361e-03,-6.925218337157126348e-03,-1.217214886225495351e-04,4.844037785743257299e-07,1.668857174597864531e-03,2.792368155273754291e-03,5.431182900724808868e-04,-2.896237916299650595e-03,4.253430098126708900e-03,-4.590220609941469618e-03,4.626284028528276568e-03,-1.136830371892791217e-03,4.074019103039667848e-03,5.315958903766597118e-03,6.571383001741690211e-03,1.408269352567433552e-03,-5.870773999962497491e-03,2.492713668141666693e-03,-3.348241202656083819e-03,-9.135459867814291202e-03,-2.792372756517069824e-05,3.315583885037084837e-03,1.705755997356572101e-03,-1.075486269015650097e-03,-5.892972621265118137e-04,-5.063348045512244648e-03,-3.440174083639188293e-03,2.014601349751028898e-03,-4.262081211497003100e-03,4.835204521335892923e-03,1.028859037011444632e-04,1.392533557380971291e-03,-2.791822007958917173e-03,-4.768349626747662301e-03,-4.404222367100755348e-03,2.731265724930352374e-03,5.800083639386747689e-03,-1.766395867185562810e-04,6.037178751338970238e-04,-3.385777606193263092e-03,8.191549368637397960e-04,7.364420637483808790e-03,-8.335845982477188944e-04,-7.763893763687725914e-03,-2.258857809473834716e-03,-2.929523649524411535e-03,1.952919956244985765e-03,2.130973234313135337e-04,-7.562744371104314313e-04,-2.268496866325132345e-03,7.090125186099895550e-04,-2.222330549332428296e-03,-7.781058827267197005e-03,2.283273649917159746e-03,-9.044409487738655235e-04,2.158813391591423662e-03,5.009961380755844663e-04,-3.418376159838685136e-03,3.207845219486765102e-03,-3.027965058934700086e-03,-3.547773247820426232e-03,-1.771855105151607605e-03,-5.039131353729257336e-03,-4.031246307326724447e-03,4.453911580313005616e-03,5.264516217083121778e-03,1.863896756779722524e-03,3.206991612998111546e-04,6.838394613947155077e-04,-1.411572221393703761e-03,4.464424323769073549e-04,4.909159440889752403e-03,2.121382785888396242e-03,7.225019389569849529e-03,4.741851475476489722e-04,3.323890082705597930e-03,2.097791314753683078e-03,-7.855223769496832292e-03,-5.870698376916819519e-03,-2.438547086228323637e-03,-6.100559002434436523e-03,3.346676029088356541e-03,1.772442292645681500e-03,-2.176117358950326042e-03,-5.739410590932996457e-03,2.241353852244628349e-03,5.120289830152086920e-03,2.747390310127212701e-03,2.480748376401659800e-04,1.394911183405985692e-03,-6.536590815162769859e-03,-1.391371496948848796e-03,2.289563506719137021e-04,-1.118765545708656946e-03,-1.100907157939505325e-02,-2.082165644427244611e-03,-2.513928621628171156e-03,5.251315395672986208e-05,-1.523797217415897264e-03,5.605788387480912377e-03,2.072407123425808705e-03,2.562243165412047413e-03,-2.129061018139320575e-03,-7.612721712797321900e-05,-1.207954989805848627e-03,-5.891203202370504707e-04,-2.742282652986368936e-03,1.586108588541714631e-03,5.344805243805039741e-03,-3.909068018259122430e-03,-2.565175377078656328e-03,4.347718165818898958e-03,-2.570868734048391974e-03,2.369200641418532996e-03,-5.976696885663826224e-03,-5.066027781299806826e-03,1.137348067074662496e-03,1.578097454986257097e-03,3.725296912916078139e-03,-2.347271202545977552e-03,-1.457376048139028815e-03,5.171124840335755007e-04
diff --git a/projects/resources/python/other/data/ridge_intercept.csv b/projects/resources/python/other/data/ridge_intercept.csv
new file mode 100644
index 00000000..c27e430a
--- /dev/null
+++ b/projects/resources/python/other/data/ridge_intercept.csv
@@ -0,0 +1,5 @@
+-1.270395982723125705e+00
+-8.357880016779333232e-01
+-3.836522329792435571e-01
+3.789999466034074116e-01
+-8.891637292231081569e-01
diff --git a/projects/resources/python/other/images.py b/projects/resources/python/other/images.py
new file mode 100755
index 00000000..901a3813
--- /dev/null
+++ b/projects/resources/python/other/images.py
@@ -0,0 +1,197 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Fri Jun  5 08:55:18 2020
+
+@author: alberto.parravicini
+"""
+
+import numpy as np
+import scipy.misc
+import matplotlib.pyplot as plt
+
+sobel_filter_diameter = 3
+sobel_filter_x = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]])
+sobel_filter_y = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
+
+
+def show_image(image):
+    plt.imshow(image, cmap="gray")
+    plt.show()
+
+
+def sobel_filter(image):
+    out = np.zeros(image.shape)
+    rows, cols = image.shape
+    radius = sobel_filter_diameter // 2
+        
+    for i in range(rows):
+        for j in range(cols):
+            sum_gradient_x = 0
+            sum_gradient_y = 0
+            for x in range(-radius, radius + 1):
+                for y in range(-radius, radius + 1):
+                    nx = x + i
+                    ny = y + j
+                    if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                        gray_value_neigh = image[nx, ny]
+                        gradient_x = sobel_filter_x[x + radius][y + radius]
+                        gradient_y = sobel_filter_y[x + radius][y + radius]
+                        sum_gradient_x += gray_value_neigh * gradient_x
+                        sum_gradient_y += gray_value_neigh * gradient_y
+            out[i, j] = np.sqrt(sum_gradient_x**2 + sum_gradient_y**2)
+    return out
+
+
+def gaussian_kernel(diameter, sigma):
+    kernel = np.zeros((diameter, diameter))
+    
+    mean = diameter / 2
+    sum_tmp = 0
+    for x in range(diameter):
+        for y in range(diameter):
+            kernel[x, y] = np.exp(-0.5 * ((x - mean)**2 + (y - mean)**2) / sigma**2)
+            sum_tmp += kernel[x, y]
+    for x in range(diameter):
+        for y in range(diameter):
+            kernel[x, y] /= sum_tmp
+    return kernel
+
+
+def gaussian_blur(image, kernel):
+    out = np.zeros(image.shape)
+    rows, cols = image.shape
+    
+    # Blur radius;
+    diameter = kernel.shape[0]
+    radius = diameter // 2
+    
+    # Flatten image and kernel;
+    image_1d = image.reshape(-1)
+    kernel_1d = kernel.reshape(-1)
+    
+    for i in range(rows):
+        for j in range(cols):
+            sum_tmp = 0
+            for x in range(-radius, radius + 1):
+                for y in range(-radius, radius + 1):
+                    nx = x + i
+                    ny = y + j
+                    if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                        sum_tmp += kernel_1d[(x + radius) * diameter + (y + radius)] * image_1d[nx * cols + ny]
+            out[i, j] = sum_tmp
+    return out
+
+
+def linear_blur(image, diameter=5):
+    out = np.zeros(image.shape)
+    rows, cols = image.shape
+    
+    # Blur radius;
+    radius = diameter // 2
+    filter_area = diameter**2
+    
+    # Flatten image and kernel;
+    image_1d = image.reshape(-1)
+    
+    for i in range(rows):
+        for j in range(cols):
+            sum_tmp = 0
+            for x in range(-radius, radius + 1):
+                for y in range(-radius, radius + 1):
+                    nx = x + i
+                    ny = y + j
+                    if (nx >= 0 and ny >= 0 and nx < rows and ny < cols):
+                        sum_tmp += image_1d[nx * cols + ny]
+            out[i, j] = sum_tmp / filter_area
+    return out
+
+
+def normalize(image):
+    return (image - np.min(image)) / (np.max(image) - np.min(image))
+
+
+def truncate(image, minimum=0, maximum=1):
+    out = image.copy()
+    out[out < minimum] = minimum
+    out[out > maximum] = maximum
+    return out
+
+#%%
+if __name__ == "__main__":
+    face = scipy.misc.face(gray=True)
+    # Normalize to [0, 1];
+    face = np.array(face, dtype=float) / 255
+    show_image(face)
+    
+    #%% Part 1: Small blur on medium frequencies;
+    
+    # Compute blur filter;
+    blurred_small = linear_blur(face, 3)
+    show_image(blurred_small)
+    # Find edges;
+    edges_small = normalize(sobel_filter(blurred_small))
+    show_image(edges_small)
+    
+    #%% Part 2: High blur on low frequencies;
+    
+    kernel = gaussian_kernel(3, 10)
+    blurred_large = gaussian_blur(face, kernel)
+    show_image(blurred_large)
+    # Find edges;
+    edges_large = sobel_filter(blurred_large)
+    # Extend mask;
+    edges_large = normalize(edges_large) * 5
+    edges_large[edges_large > 1] = 1
+    show_image(edges_large)
+    
+    #%% Part 3: Sharpen image;
+    kernel2 = gaussian_kernel(3, 10)
+    unsharpen = gaussian_blur(face, kernel2)   
+    amount = 0.5
+    sharpened = truncate(face * (1  + amount) - unsharpen * amount)
+    show_image(sharpened)
+    
+    #%% Part 4: Merge sharpened image and low frequencies;
+    image2 = normalize(sharpened * edges_large + blurred_large * (1 - edges_large))
+    show_image(image2)
+    
+    #%% Part 5: Merge image and medium frequencies;
+    image3 = image2 * edges_small + blurred_small * (1 - edges_small)
+    show_image(image3)
+
+
+    
+    
+    
+    
+    
\ No newline at end of file
diff --git a/projects/resources/python/other/train_ensemble.py b/projects/resources/python/other/train_ensemble.py
new file mode 100755
index 00000000..11b0603e
--- /dev/null
+++ b/projects/resources/python/other/train_ensemble.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# -*- coding: utf-8 -*-
+
+import numpy as np
+from sklearn.naive_bayes import MultinomialNB
+from sklearn.linear_model import RidgeClassifier
+
+
+def softmax(X):
+    return np.exp(X) / np.sum(np.exp(X), axis=1).reshape(X.shape[0], 1)
+
+
+def logsumexp(X):
+    return np.log(np.sum(np.exp(X)))
+
+
+def naive_bayes_predict(X, feature_log_prob, log_class_prior):
+    num_features = X.shape[1]
+    num_classes = len(log_class_prior)
+    jll = np.zeros((X.shape[0], num_classes))
+    jll += log_class_prior
+
+    # K1
+    for i in range(num_features):
+        for j in range(num_classes):  # col Y.T, i.e. feature_log_prob
+            for q in range(X.shape[0]):  # row X
+                jll[q, j] += X[q, i] * feature_log_prob[j, i]
+    # Same as 
+    # jll = X.dot(feature_log_prob.T) + log_class_prior
+    
+    # K2
+    amax = np.amax(jll, axis=1)
+    
+    # K3
+    # l = np.zeros(X.shape[0])
+    # for i in range(X.shape[0]):
+    #     logsum = 0
+    #     for j in range(num_classes):
+    #         logsum += np.exp(jll[i, j] - amax[i])
+    #     l[i] = np.log(logsum) + amax[i]
+    l = logsumexp(jll - np.atleast_2d(amax).T) + amax
+        
+    # K4
+    for q in range(X.shape[0]):
+        for j in range(num_classes):
+            jll[q, j] = np.exp(jll[q, j] - l[q])
+    
+    return jll 
+
+def normalize(X):
+   return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
+
+
+def ridge_pred(X, coef, intercept):
+    return np.dot(X, coef.T) + intercept
+
+
+#%%
+
+if __name__ == "__main__":
+    
+
+    rng = np.random.RandomState(1)
+    
+    num_features = 1000
+    max_occurrence_of_ngram = 10
+    num_classes = 5
+    
+    X = rng.randint(max_occurrence_of_ngram, size=(6000, num_features))
+    
+    y = rng.randint(num_classes, size=6000)
+    
+    
+    #%% Training;
+    
+    naive_bayes = MultinomialNB()
+    naive_bayes.fit(X, y)
+    
+    ridge = RidgeClassifier(random_state=rng)
+    ridge.fit(X, y)
+    
+    
+    #%% Testing;
+    
+    # Create some random inputs;
+    
+    num_test_docs = 100
+    X_test = rng.randint(max_occurrence_of_ngram, size=(num_test_docs, num_features))
+    
+    nb_scores = naive_bayes.predict_proba(X_test)
+    print(naive_bayes.predict(X_test))
+    ridge_scores = ridge.decision_function(X_test)
+    print(ridge.predict(X_test))
+        
+    print(np.argmax(softmax(nb_scores) + softmax(ridge_scores), axis=1))
+    
+    
+    #%% Testing, using hand-made functions;
+    
+    nb_res_2 = naive_bayes_predict(X_test, naive_bayes.feature_log_prob_, naive_bayes.class_log_prior_)
+    print(np.argmax(nb_res_2, axis=1))
+
+    ridge_pred_2 = ridge_pred(X_test, ridge.coef_, ridge.intercept_)
+    print(np.argmax(ridge_pred_2, axis=1))
+
+    print(np.argmax(softmax(nb_res_2) + softmax(ridge_pred_2), axis=1)) 
+    
+    
+    #%% Store matrices used in predicting results;
+    np.savetxt("data/nb_feat_log_prob.csv", naive_bayes.feature_log_prob_, delimiter=",")
+    np.savetxt("data/nb_class_log_prior.csv", naive_bayes.class_log_prior_, delimiter=",")
+    np.savetxt("data/ridge_coeff.csv", ridge.coef_, delimiter=",")
+    np.savetxt("data/ridge_intercept.csv", ridge.intercept_, delimiter=",")
+    
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/compute_transfer_computation_overlap.py b/projects/resources/python/plotting/compute_transfer_computation_overlap.py
new file mode 100755
index 00000000..0d9bc9eb
--- /dev/null
+++ b/projects/resources/python/plotting/compute_transfer_computation_overlap.py
@@ -0,0 +1,625 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+For each benchmark, we can measure how much overlap is present in the computation. We measure 4 different types of overlap:
+
+CT, computation w.r.t transfer: percentage of GPU kernel computation that overlaps with any data transfer (host-to-device or viceversa)
+TC, transfer w.r.t computation: percentage of data transfer that overlaps with (one or more) GPU kernel computation(s)
+CC, computation w.r.t computation: percentage of GPU kernel computation that overlaps with any other GPU kernel computation
+TOT, any type of overlap: here we consider any type of overlap between data-transfer and/or computations. Note that if a computation/data-transfer overlaps more than one computation/data-transfer, the overlap is counted only once (we consider the union of the overlap intervals)
+
+Measures are taken for the largest data-size in the evaluation (for each benchmark), 
+for the block size that results in higher speedup, to obtain a clearer understanding of what type of overlap is providing the speedup. 
+In general, the TOT overlap is a good proxy of the achieved speedup, although it is sometimes inflated by high CC overlap: 
+in fact, overlapping computations does not always translates to faster execution, 
+especially if kernels are large enough (in terms of threads/blocks) to fill the GPU processors on their own.
+
+Created on Mon Jul 20 10:36:56 2020
+
+@author: alberto.parravicini
+"""
+
+import pandas as pd
+import json
+import os
+import numpy as np
+import time
+
+DEFAULT_RES_DIR = "../../../../grcuda-data/results/scheduling_nvprof_log"
+
+# 960
+# INPUT_DATE =  "2020_10_07_960" # "2020_09_23_960"
+# P100
+# INPUT_DATE = "2020_10_10_P100" # "2020_09_19_P100"
+# 1660
+INPUT_DATE = "2020_10_10_1660" # "2020_10_06_1660"
+
+OUTPUT_DATE = "2020_10_10"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+BENCHMARK_NAMES = {"b1": "Vector Squares", "b5": "B&S", "b6": "ML Ensemble", "b7": "HITS", "b8": "Images"}
+
+# 960 - Unused
+# DATA_DICT = {
+#     "b1": "b1_31343.csv",
+#     "b5": "b5_808.csv",
+#     "b6": "b6_1989.csv",
+#     "b7": "b7_2663.csv",
+#     "b8": "b8_default_nometric_20821.csv",
+#     "b10": "b10_7753.csv",
+#     }
+
+# P100 - Unused
+# DATA_DICT = {
+#     "b1": "b1_default_nometric_13259.csv",
+#     "b5": "b5_default_nometric_14072.csv",
+#     "b6": "b6_default_nometric_14724.csv",
+#     "b7": "b7_default_nometric_15270.csv",
+#     "b8": "b8_default_nometric_17585.csv",
+#     "b10": "b10_default_nometric_17786.csv",
+#     }
+
+# 1660 - Unused
+# DATA_DICT = {
+#     "b1": "b1_default_nometric_True_6931.csv",
+#     "b5": "b5_default_nometric_True_10628.csv",
+#     "b6": "b6_default_nometric_True_14244.csv",
+#     "b7": "b7_default_nometric_True_18466.csv",
+#     "b8": "b8_default_nometric_True_22913.csv",
+#     "b10": "b10_default_nometric_True_25151.csv",
+#     }
+
+# SKIP_SUMMARY_ROWS = {
+#     "b1": 7,
+#     "b5": 22,
+#     "b6": 0,
+#     "b7": 0,
+#     "b8": 0,
+#     "b10": 0,
+#     }
+
+NVPROF_HEADER = ["start_ms", "duration_ms", "Grid X", "Grid Y", "Grid Z", "Block X", "Block Y", "Block Z",
+                 "Registers Per Thread"," Static SMem", "Dynamic SMem", "Device", "Context", "Stream",
+                 "transferred_data_byte", "Virtual Address", "name", "Correlation_ID"]
+NVPROF_HEADER_FILTERED = NVPROF_HEADER[:2] + [NVPROF_HEADER[-7]] + [NVPROF_HEADER[-4]] + [NVPROF_HEADER[-2]]
+
+OPERATIONS_TO_MERGE = set(["htod", "dtoh"])
+
+NUM_ITER = 29
+
+def time_phase(func: str):
+    def func_call(*args, **kwargs) -> object:
+        start = time.time()
+        result = func(*args, **kwargs)
+        end = time.time()
+        print(f"{func.__name__} took {end - start:.2f} seconds")
+        return result
+    return func_call
+
+def get_overlap(a, b, c, d):
+    """
+    Given 2 segments (a, b) and (c, d), get the overlap length of the 2 segments;
+    """
+    s = max(a, c)
+    e = min(b, d)
+    return max(0, e - s), s, e
+
+
+@time_phase
+def get_overlap_ct_fast(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself.
+    Keep only overlaps of computations with transfers;
+    """
+    data["overlap_ct_sec"] = 0.0
+    
+    segments = [(r["start_ms"], r["end_ms"], r["name"] in OPERATIONS_TO_MERGE) for i, r in data.iterrows()]
+    overlap_list = np.zeros(len(segments))
+    
+    # Initial collection of overlaps;
+    for i, row_i in enumerate(segments):
+        overlaps = []
+        for j, row_j in enumerate(segments):
+            if row_j[0] > row_i[1]:
+                break
+            if i != j and not row_i[2] and row_j[2]:
+                overlap, start, end = get_overlap(row_i[0], row_i[1], row_j[0], row_j[1])
+                
+                if overlap > 0:
+                    overlaps += [(start, end)]
+        overlap_list[i] = get_total_segment_set_length(overlaps)
+    return sum(overlap_list), None
+
+
+@time_phase    
+def get_overlap_ct(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself.
+    Keep only overlaps of computations with transfers;
+    """
+    data["overlap_ct_sec"] = 0.0
+    overlap_matrix = np.zeros((len(data), len(data)))
+    overlap_collection = set()
+    # Initial collection of overlaps;
+    for i, row_i in data.iterrows():
+        overlaps = []
+        for j, row_j in data.iterrows():
+            if row_j["start_ms"] > row_i["end_ms"]:
+                break
+            if i != j and row_i["name"] not in OPERATIONS_TO_MERGE and row_j["name"] in OPERATIONS_TO_MERGE:
+                overlap, start, end = get_overlap(row_i["start_ms"], row_i["end_ms"], row_j["start_ms"], row_j["end_ms"])
+                if overlap > 0:
+                    # overlap_collection.add((start, end))
+                    # overlap_matrix[j, i] = overlap
+                    overlaps += [(start, end)]
+        data.at[i, "overlap_ct_sec"] = get_total_segment_set_length(overlaps)
+    return data["overlap_ct_sec"].sum(), overlap_matrix
+    # return, overlap_matrix
+
+
+@time_phase
+def get_overlap_tc_fast(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself.
+    Keep only overlaps of transfers with computations;
+    """
+    data["overlap_tc_sec"] = 0.0
+    segments = [(r["start_ms"], r["end_ms"], r["name"] in OPERATIONS_TO_MERGE) for i, r in data.iterrows()]
+    overlap_list = np.zeros(len(segments))
+
+    # Initial collection of overlaps;
+    for i, row_i in enumerate(segments):
+        overlaps = []
+        for j, row_j in enumerate(segments):
+            if row_j[0] > row_i[1]:
+                break
+            if i != j and row_i[2] and not row_j[2]:
+                overlap, start, end = get_overlap(row_i[0], row_i[1], row_j[0], row_j[1])
+                if overlap > 0:
+                    overlaps += [(start, end)]
+        overlap_list[i] = get_total_segment_set_length(overlaps)
+    return sum(overlap_list), None
+
+
+@time_phase
+def get_overlap_tc(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself.
+    Keep only overlaps of transfers with computations;
+    """
+    data["overlap_tc_sec"] = 0.0
+    overlap_matrix = np.zeros((len(data), len(data)))
+    overlap_collection = set()
+    # Initial collection of overlaps;
+    for i, row_i in data.iterrows():
+        overlaps = []
+        for j, row_j in data.iterrows():
+            if row_j["start_ms"] > row_i["end_ms"]:
+                break
+            if i != j and row_i["name"] in OPERATIONS_TO_MERGE and row_j["name"] not in OPERATIONS_TO_MERGE:
+                overlap, start, end = get_overlap(row_i["start_ms"], row_i["end_ms"], row_j["start_ms"], row_j["end_ms"])
+                if overlap > 0:
+                    # overlap_collection.add((start, end))
+                    # overlap_matrix[j, i] = overlap
+                    overlaps += [(start, end)]
+        data.at[i, "overlap_tc_sec"] = get_total_segment_set_length(overlaps)
+    return data["overlap_tc_sec"].sum(), overlap_matrix
+    # return get_total_segment_set_length(overlap_collection), overlap_matrix
+
+
+
+@time_phase
+def get_overlap_cc_fast(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself.
+    Keep only overlaps of computations with computations;
+    """
+    data["overlap_cc_sec"] = 0.0
+    segments = [(r["start_ms"], r["end_ms"]) for i, r in data.iterrows() if r["name"] not in OPERATIONS_TO_MERGE]
+    overlap_tot = 0
+
+    # Initial collection of overlaps;
+    for i, row_i in enumerate(segments):
+        overlaps = []
+        for j, row_j in enumerate(segments):
+            if j >= i:
+                break
+            if i != j:
+                overlap, start, end = get_overlap(row_i[0], row_i[1], row_j[0], row_j[1])
+                if overlap > 0:
+                    overlaps += [(start, end)]
+        overlap_tot += get_total_segment_set_length(overlaps)
+    return overlap_tot, None
+
+
+
+@time_phase
+def get_overlap_cc(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself.
+    Keep only overlaps of computations with computations;
+    """
+    data["overlap_cc_sec"] = 0.0
+    overlap_matrix = np.zeros((len(data), len(data)))
+    overlap_collection = set()
+    # Initial collection of overlaps;
+    for i, row_i in data.iterrows():
+        overlaps = []
+        for j, row_j in data.iterrows():
+            # if row_j["start_ms"] > row_i["end_ms"]:
+            #     break
+            if j >= i:
+                break
+            if i != j and row_i["name"] not in OPERATIONS_TO_MERGE and row_j["name"] not in OPERATIONS_TO_MERGE:
+                overlap, start, end = get_overlap(row_i["start_ms"], row_i["end_ms"], row_j["start_ms"], row_j["end_ms"])
+                if overlap > 0:
+                    # overlap_collection.add((start, end))
+                    # overlap_matrix[j, i] = overlap
+                    overlaps += [(start, end)]
+        data.at[i, "overlap_cc_sec"] = get_total_segment_set_length(overlaps)
+    return data["overlap_cc_sec"].sum(), overlap_matrix
+    # return get_total_segment_set_length(overlap_collection), overlap_matrix
+
+
+@time_phase
+def get_overlap_total_fast(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself;
+    """
+    data["overlap_tot_sec"] = 0.0
+    segments = [(r["start_ms"], r["end_ms"]) for i, r in data.iterrows()]
+    overlap_tot = 0
+
+    # Initial collection of overlaps;
+    for i, row_i in enumerate(segments):
+        overlaps = []
+        for j, row_j in enumerate(segments):
+            if j >= i:
+                break
+            if i != j :
+                overlap, start, end = get_overlap(row_i[0], row_i[1], row_j[0], row_j[1])
+                if overlap > 0:
+                    overlaps += [(start, end)]
+        overlap_tot += get_total_segment_set_length(overlaps)
+    return overlap_tot, None
+
+
+@time_phase
+def get_overlap_total(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself;
+    """
+    data["overlap_total_sec"] = 0.0
+    overlap_matrix = np.zeros((len(data), len(data)))
+    overlap_collection = set()
+    # Initial collection of overlaps;
+    for i, row_i in data.iterrows():
+        overlaps = []
+        for j, row_j in data.iterrows():
+            # if row_j["start_ms"] > row_i["end_ms"]:
+            #     break
+            if j >= i:
+                break
+            if i != j:
+                overlap, start, end = get_overlap(row_i["start_ms"], row_i["end_ms"], row_j["start_ms"], row_j["end_ms"])
+                if overlap > 0:
+                    # overlap_collection.add((start, end))
+                    # overlap_matrix[j, i] = overlap
+                    overlaps += [(start, end)]
+        data.at[i, "overlap_total_sec"] = get_total_segment_set_length(overlaps)
+    return data["overlap_total_sec"].sum(), overlap_matrix
+    # return get_total_segment_set_length(overlap_collection), overlap_matrix10
+
+
+def get_total_segment_set_length(segments):
+    
+    def merge_overlaps(a, b, c, d):
+        start = max(a, c)
+        end = min(b, d)
+        if start < end:
+            return (min(a, c), max(b, d))
+        else:
+            return None
+    
+    overlap_collection = set(segments)
+    # Join overlaps until a fixed point is reached;
+    while True:
+        new_overlap_collection = set()
+        for i, s_i in enumerate(overlap_collection):
+            skip_set = set()
+            merge_done = False
+            for j, s_j in enumerate(overlap_collection):
+                if j >= i or j in skip_set:
+                    break
+                overlap = merge_overlaps(*s_i, *s_j)
+                # If a new merged overlap is created, add it to the collection, else add both segments;
+                if overlap:
+                    merge_done = True
+                    skip_set.add(j)
+                    new_overlap_collection.add(overlap)
+            if not merge_done:
+                new_overlap_collection.add(s_i)
+                
+        if (len(new_overlap_collection) == len(overlap_collection)) or len(overlap_collection) <= 1:
+            break
+        else:
+           # print(len(new_overlap_collection)) 
+           overlap_collection = new_overlap_collection
+        
+    return sum([s[1] - s[0] for s in overlap_collection])
+
+
+@time_phase
+def read_nvprof_log(input_path: str) -> pd.DataFrame:
+    
+    skip_rows = 5
+    try:
+        header = pd.read_csv(input_path, skiprows=3, nrows=1)
+        start_unit = header.iloc[0, 0]
+        duration_unit = header.iloc[0, 1]
+    except:
+        header = pd.read_csv(input_path, skiprows=5, nrows=1)
+        start_unit = header.iloc[0, 0]
+        duration_unit = header.iloc[0, 1]
+        skip_rows = 7
+    data = pd.read_csv(input_path, skiprows=skip_rows, names=NVPROF_HEADER, dtype={"Unified Memory": "str"})
+    
+    # Keep only a subset of columns;
+    data = data[NVPROF_HEADER_FILTERED]
+    
+    # Remove page faults;
+    data = data[data["name"] != "[Unified Memory GPU page faults]"].reset_index(drop=True)
+    data = data[data["name"] != "[Unified Memory page throttle]"].reset_index(drop=True)
+    
+    # Remove rows with NaN Duration;
+    data = data.dropna(subset=["duration_ms"]).reset_index(drop=True)
+    
+    # Fix data transfer column;
+    data["transferred_data_byte"] = data["transferred_data_byte"].apply(lambda x: int(str(x).split(".")[0]) if not pd.isnull(x) else 0)
+    
+    # Rename "Device" column;
+    data = data.rename(columns={"Device": "device"})
+    try:
+        data[["device_start", "device_end"]] = data["device"].str.split(",", expand=True)
+    except ValueError:
+        data["device_start"] = data["device"]
+        data["device_end"] = None
+    data["device_start"] = data["device_start"].str.split("(").str[-1].str.replace(")", "")
+    data["device_end"] = data["device_end"].replace({None: ""}).str.split("(").str[-1].str.replace(")", "")
+    
+    # Convert start and duration from seconds to milliseconds;
+    if start_unit == "s":
+        data["start_ms"] *= 1000
+    elif start_unit == "us":
+        data["start_ms"] /= 1000
+    if duration_unit == "s":
+        data["duration_ms"] *= 1000
+    elif duration_unit == "us":
+        data["duration_ms"] /= 1000
+    
+    # Set the start of the computation equal to 0;
+    data["start_ms"] -= data["start_ms"].iloc[0]
+    
+    # Set the end of the computation;
+    data["end_ms"] = data["duration_ms"] + data["start_ms"]
+          
+    # Clean names of operations;
+    data["name"] = data["name"].replace({
+        "[Unified Memory Memcpy HtoD]": "htod",
+        "[Unified Memory Memcpy DtoH]": "dtoh",
+        "[Unified Memory Memcpy DtoD]": "dtod",
+        })
+    
+    # Keep just the name of kernels;
+    data["name"] = data["name"].apply(lambda x: x.split("(")[0])
+    
+    # Fix names of transfer;
+    data.loc[data["name"] != "dtod", "device_end"] = data[data["name"] != "dtod"]["device_start"]
+    data.loc[data["name"] == "htod", "device_start"] = "CPU"
+    data.loc[data["name"] == "dtoh", "device_end"] = "CPU"  
+    
+    return data
+
+
+@time_phase
+def read_nvprof_log_a100(input_path: str) -> pd.DataFrame:
+    data = pd.read_csv(input_path)
+    
+    # Keep only a subset of columns;
+    data = data.iloc[:, [0, 1, 12, 13, 16, 19]]
+    
+    # Convert start and duration from seconds to milliseconds;
+    start_unit = data.columns[0].split("(")[-1].replace(")", "")
+    duration_unit = data.columns[1].split("(")[-1].replace(")", "")
+    if start_unit == "s":
+        data[data.columns[0]] *= 1000
+    elif start_unit == "us":
+        data[data.columns[0]] /= 1000
+    elif start_unit == "ns":
+        data[data.columns[0]] /= 1000000
+    if duration_unit == "s":
+        data[data.columns[1]] *= 1000
+    elif duration_unit == "us":
+        data[data.columns[1]] /= 1000
+    elif duration_unit == "ns":
+        data[data.columns[1]] /= 1000000
+        
+    # Rename columns;
+    data = data.rename(columns={data.columns[0]: "start_ms", 
+                                data.columns[1]: "duration_ms",
+                                "Device": "device",
+                                "Bytes (MB)": "transferred_data_byte",
+                                "Thruput (MBps)": "transfer_throughput_bytesec",
+                                "Name": "name",
+                                })
+    
+    # Remove page faults;
+    data = data[data["name"] != "[Unified Memory GPU page faults]"].reset_index(drop=True)
+    data = data[data["name"] != "[Unified Memory page throttle]"].reset_index(drop=True)
+    
+    # Remove rows with NaN Duration;
+    data = data.dropna(subset=["duration_ms"]).reset_index(drop=True)
+    
+    # Turn MB into bytes;
+    data["transferred_data_byte"] *= 1000
+    data["transfer_throughput_bytesec"] *= 1000
+    
+    # Set the start of the computation equal to 0;
+    data["start_ms"] -= data["start_ms"].iloc[0]
+    
+    # Set the end of the computation;
+    data["end_ms"] = data["duration_ms"] + data["start_ms"]
+    
+    # Clean names of operations;
+    data["name"] = data["name"].replace({
+        "[CUDA Unified Memory memcpy HtoD]": "htod",
+        "[CUDA Unified Memory memcpy DtoH]": "dtoh",
+        "[CUDA Unified Memory memcpy DtoD]": "dtod",
+        })
+    
+    # Keep just the name of kernels;
+    data["name"] = data["name"].apply(lambda x: x.split("(")[0])
+    
+    # Add start-end devices;
+    data["device_end"] = data["device"].str.split("(").str[-1].str.replace(")", "")
+    data["device_start"] = None
+    
+    # Fix start-end devices;
+    data.loc[data["name"] == "htod", "device_start"] = "CPU"
+    data.loc[data["name"] == "dtoh", "device_start"] = data.loc[data["name"] == "dtoh", "device_end"]
+    data.loc[data["name"] == "dtoh", "device_end"] = "CPU"
+    data.loc[data["name"] == "dtod", "device_start"] = "NVSwitch"  
+    
+    return data
+
+#%%
+
+if __name__ == "__main__":
+    
+    files = os.listdir(os.path.join(DEFAULT_RES_DIR, INPUT_DATE))
+    files_dict = {tuple(file.split(".")[0].split("_")[:4]): file for file in files if file != "summary.csv"}
+    
+    # Filter files;
+    filtered_files = [v for k, v in files_dict.items() if k[1] == "default" and k[2] == "nometric" and k[3] == "True"]
+    
+    output_res = []
+    for b in filtered_files:
+    
+        data = read_nvprof_log(os.path.join(DEFAULT_RES_DIR, INPUT_DATE, b))
+        
+        # 960
+        # merge_dict = {"b1": 0.1, "b5": 0.1, "b6": 0.1, "b7": 0.1, "b8": 0.1, "b10": 0.1}
+        # P100
+        # merge_dict = {"b1": 0.1, "b5": 0.1, "b6": 0.1, "b7": 0.1, "b8": 0.1, "b10": 0.1}
+        
+        # Create a summary data-set where contiguous operations are merged;
+        data["group"] = -1
+        current_group = -1
+        current_operation = ""
+        for i, row in data.iterrows():
+            tmp_operation = row["name"]
+            # Keep adding to the current operation, if the time difference between the 2 operations is small enough to consider them as contiguous;
+            if tmp_operation == current_operation and tmp_operation in OPERATIONS_TO_MERGE and i > 0: # and (row["start_ms"] - data.at[i - 1, "end_ms"] < merge_dict[b]):
+               data.at[i, "group"] = current_group
+            else:
+                # New group of operations;
+                current_operation = tmp_operation
+                current_group += 1
+                data.at[i, "group"] = current_group
+                
+        summary = data.groupby(["group"]).agg({
+            "start_ms": np.min,
+            "duration_ms": np.sum,
+            "transferred_data_byte": np.sum,
+            "name": lambda x: x.iloc[0],
+            "end_ms": np.max,
+            "group": lambda x: x.iloc[0],
+            })
+        
+        # Ignore the first iteration;
+        # summary = summary.iloc[SKIP_SUMMARY_ROWS[b]:, :].reset_index(drop=True)
+        # # Set the start of the computation equal to 0;
+        # summary["end_ms"] -= summary["start_ms"].iloc[0]
+        # summary["start_ms"] -= summary["start_ms"].iloc[0]
+        
+        summary["end_ms"] = summary["duration_ms"] + summary["start_ms"]
+        # Filter operations that last too little to be useful;
+        # summary = summary[(summary["end_ms"] - summary["start_ms"]) > 0.001].reset_index(drop=True)
+        print(b, len(summary))
+        #%%
+        
+        # Compute 3 types of overlap: 
+        # 1. Percentage of computation overlapped with transfer;
+        # summary["overlap_ct_sec"] = get_overlap_matrix_ct(summary).sum(axis=0)
+        # summary["overlap_ct_perc"] = summary["overlap_ct_sec"] / summary["duration_ms"]
+        # ct_overlap_perc = summary[~summary["name"].isin(OPERATIONS_TO_MERGE)]["overlap_ct_perc"].mean()
+        ct_overlap, ct_overlap_matrix = get_overlap_ct_fast(summary)
+        ct_overlap_perc = ct_overlap / summary[~summary["name"].isin(OPERATIONS_TO_MERGE)]["duration_ms"].sum()
+        
+        # # 2. Percentage of transfer overlapped with computation;
+        # summary["overlap_tc_sec"] = get_overlap_matrix_tc(summary).sum(axis=0) 
+        # summary["overlap_tc_perc"] = summary["overlap_tc_sec"] / summary["duration_ms"]
+        # tc_overlap_perc = summary[summary["name"].isin(OPERATIONS_TO_MERGE)]["overlap_tc_perc"].mean()
+        tc_overlap, tc_overlap_matrix = get_overlap_tc_fast(summary)
+        tc_overlap_perc = tc_overlap / summary[summary["name"].isin(OPERATIONS_TO_MERGE)]["duration_ms"].sum()
+        # tc_overlap_perc = 0
+        
+        # # 3. Percentage of computation overlapped with other computations;
+        # summary["overlap_cc_sec"] = get_overlap_matrix_cc(summary).sum(axis=0) 
+        # summary["overlap_cc_perc"] = summary["overlap_cc_sec"] / summary["duration_ms"]
+        # cc_overlap_perc = summary[~summary["name"].isin(OPERATIONS_TO_MERGE)]["overlap_cc_perc"].mean()
+        cc_overlap, cc_overlap_matrix = get_overlap_cc_fast(summary)
+        cc_overlap_perc = cc_overlap / summary[~summary["name"].isin(OPERATIONS_TO_MERGE)]["duration_ms"].sum()
+        # cc_overlap_perc = 0
+        
+        total_overlap, total_overlap_matrix = get_overlap_total_fast(summary)
+        total_overlap_perc = total_overlap / summary["duration_ms"].sum()
+        # total_overlap_perc = 0
+        
+        print(f"Benchmark={b}; CT={100 *ct_overlap_perc:.2f}%; TC={100 * tc_overlap_perc:.2f}%; CC={100 * cc_overlap_perc:.2f}%; TOTAL={100 * total_overlap_perc:.2f}%")
+        output_res += [[b.split("_")[0], ct_overlap_perc, tc_overlap_perc, cc_overlap_perc, total_overlap_perc]]
+        
+    # Store the DataFrame;
+    out_df = pd.DataFrame(output_res)
+    out_df.to_csv(os.path.join(DEFAULT_RES_DIR, INPUT_DATE, "summary.csv"), index=False, header=["benchmark", "ct_overlap_perc", "tc_overlap_perc", "cc_overlap_perc", "total_overlap_perc"])
+    
+    
+    
+    
diff --git a/projects/resources/python/plotting/load_data.py b/projects/resources/python/plotting/load_data.py
new file mode 100755
index 00000000..6c7d5f1e
--- /dev/null
+++ b/projects/resources/python/plotting/load_data.py
@@ -0,0 +1,514 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS"" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sat Jun 20 09:43:46 2020
+
+@author: alberto.parravicini
+"""
+
+import pandas as pd
+import json
+import os
+import numpy as np
+import functools
+from scipy.stats.mstats import gmean
+import segretini_matplottini.src.plot_utils as pu
+
+DEFAULT_RES_DIR = "../../../../grcuda-data/results/scheduling_multi_gpu"
+DEFAULT_RES_CUDA_DIR = "../../../../grcuda-data/results/scheduling_multi_gpu"
+PLOT_DIR = "../../../../grcuda-data/plots/multi_gpu"
+
+ASYNC_POLICY_NAME = "async"   # If parsing new results;
+# ASYNC_POLICY_NAME = "default"  # If parsing older results;
+
+BENCHMARK_NAMES = {
+    "b1m": "VEC",
+    "b5m": "B&S",
+    "b6m": "ML",
+    "b9m": "CG",
+    "b11m": "MUL",
+    }
+
+POLICY_NAMES = {
+    "sync": "SYNC",
+    ASYNC_POLICY_NAME: "ASYNC",
+    }
+
+
+def _load_dictionaries(input_folders: list, benchmark=""):
+    dictionaries = []
+    for folder in input_folders:
+        input_path = os.path.join(DEFAULT_RES_DIR, folder)
+    
+        # Load results as JSON;
+        data_dict = {}
+        for res in os.listdir(input_path):
+            with open(os.path.abspath(os.path.join(input_path, res)), "r") as f:
+                if not benchmark or res.split("_")[6] == benchmark:
+                    data_dict[res] = json.load(f)
+        dictionaries += [data_dict]
+    return dictionaries
+
+
+def _basic_filename_cleaning(filename, dictionary):
+    # Parse filename;
+    benchmark = filename.split("_")[6] 
+    
+    # Retrieve other information;
+    total_iterations = int(dictionary["num_iterations"])
+    try:
+        cpu_validation = dictionary["cpu_validation"].lower() == "true"
+    except AttributeError:  # It's already bool;
+        cpu_validation = dictionary["cpu_validation"]
+    try:
+        random_init = dictionary["random_init"].lower() == "true"
+    except AttributeError:  # It's already bool;
+        random_init = dictionary["random_init"]
+    size_dict = dictionary["benchmarks"][benchmark]
+    return [benchmark, total_iterations, cpu_validation, random_init], size_dict
+
+
+def load_data(input_date: str, skip_iter=0, remove_inf=True, remove_time_zero=True, benchmark="", phases=None) -> pd.DataFrame:
+    """
+    Load the benchmark results located in the input sub-folder
+    :param input_date: name of the folder where results are located, as a subfolder of DEFAULT_RES_DIR
+    :param skip_iter: skip the first iterations for each benchmark, as they are considered warmup
+    :param remove_inf: remove rows with infinite speedup value, as they are not useful
+    :param remove_time_zero: if True, remove rows with 0 computation time;
+    :param benchmark: load data only for the specified benchmark
+    :param phases: list of benchmark phases to add as columns
+    :return: a DataFrame containing the results
+    """
+    data_dict = _load_dictionaries(input_date, benchmark)
+                
+    phases_names = []
+
+    # Turn results into a pd.DataFrame;
+    rows = []
+    for k, v in data_dict.items():
+        row, size_dict = _basic_filename_cleaning(k, v)
+
+        # Parse data for each input data size, and other settings;;
+        for size, val_size in size_dict.items():
+            row += [int(size)]
+            for num_gpu, val_num_gpu in val_size.items():
+                row += [int(num_gpu)]
+                for num_blocks, val_num_blocks in val_num_gpu.items():
+                    for exec_policy, val_exec_policy in val_num_blocks.items():
+                        row += [exec_policy]
+                        for dependency_policy, val_dependency_policy in val_exec_policy.items():
+                            row += [dependency_policy]
+                            for new_stream_policy, val_new_stream_policy in val_dependency_policy.items():
+                                row += [new_stream_policy]
+                                for parent_stream_policy, val_parent_stream_policy in val_new_stream_policy.items():
+                                    row += [parent_stream_policy]
+                                    for device_selection_policy, val_device_selection_policy in val_parent_stream_policy.items():
+                                        row += [device_selection_policy]
+                                        for prefetch, val_prefetch in val_device_selection_policy.items():
+                                            row += [prefetch]
+                                            for stream_attach, val_stream_attach in val_prefetch.items():
+                                                row += [stream_attach.lower() == "true" or stream_attach == "True"]
+                                                for kernel_timer_enabled, val_kernel_timer_enabled in val_stream_attach.items():
+                                                    row += [kernel_timer_enabled == "true" or kernel_timer_enabled == "True"]
+                                                    for realloc, val_realloc in val_kernel_timer_enabled.items():
+                                                        row += [realloc == "true" or realloc == "True"]
+                                                        for reinit, val_reinit in val_realloc.items():
+                                                            row += [reinit == "true" or reinit == "True"]
+                                                            for block_size, val_block_size in val_reinit.items():
+                                                                # Process each iteration;
+                                                                block_size_1d = int(block_size.split(",")[0])
+                                                                block_size_2d = int(block_size.split(",")[1])
+                                                                block_size_str = str(block_size_1d) + "," + str(block_size_2d)
+                                                                row += [int(num_blocks), block_size_1d, block_size_2d, block_size_str]
+                                                                
+                                                                for curr_iteration in val_block_size:
+                                                                    num_iter = curr_iteration["iteration"]
+                                                                    gpu_result = curr_iteration["gpu_result"]
+                                                                    total_time_sec = curr_iteration["total_time_sec"]
+                                                                    overhead_sec = curr_iteration["overhead_sec"]
+                                                                    computation_sec = curr_iteration["computation_sec"]
+                                                                    
+                                                                    # Process phases;
+                                                                    phases_time = []
+                                                                    if phases:
+                                                                        phases_time = [p["time_sec"] for p in curr_iteration["phases"] if p["name"] in phases]
+                                                                        if not phases_names:
+                                                                            phases_names = [p["name"] for p in curr_iteration["phases"] if p["name"] in phases]
+                                                                    
+                                                                    # Add a new row;
+                                                                    if (num_iter >= skip_iter):
+                                                                        rows += [row + [num_iter - skip_iter, gpu_result, total_time_sec, overhead_sec, computation_sec] + phases_time]
+
+    columns = ["benchmark", "total_iterations", "cpu_validation", "random_init", "size", "gpus", "exec_policy", "dependency_policy", "new_stream_policy", "parent_stream_policy",
+               "device_selection_policy", "prefetcher", "force_stream_attach", "kernel_timing", "realloc", "reinit",
+               "num_blocks", "block_size_1d", "block_size_2d", "block_size_str", 
+               "num_iter", "gpu_result", "total_time_sec", "overhead_sec", "computation_sec"] + (phases_names if phases else [])
+    
+    data = pd.DataFrame(rows, columns=columns).sort_values(by=columns[:20], ignore_index=True)
+    
+    # Clean columns with 0 computation time;
+    if remove_time_zero:
+        data = data[data["computation_sec"] > 0].reset_index(drop=True)
+    
+    # Compute speedups;
+    compute_speedup(data, ["benchmark", "total_iterations", "cpu_validation", "random_init", "size", "exec_policy", "dependency_policy", "new_stream_policy",
+               "device_selection_policy", "prefetcher", "force_stream_attach", "kernel_timing", "realloc", "reinit"])
+
+    # # Clean columns with infinite speedup;
+    # if remove_inf:
+    #     data = data[data["computation_speedup"] != np.inf].reset_index(drop=True)
+    
+    return data
+
+
+def load_data_grcuda_multigpu(input_folders: list, skip_iter=0, remove_inf=True, remove_time_zero=True, benchmark="", phases=None) -> pd.DataFrame:
+    """
+    Load the benchmark results located in the input sub-folder
+    :param input_folders: list of the folders where results are located, as a subfolder of DEFAULT_RES_DIR
+    :param skip_iter: skip the first iterations for each benchmark, as they are considered warmup
+    :param remove_inf: remove rows with infinite speedup value, as they are not useful
+    :param remove_time_zero: if True, remove rows with 0 computation time;
+    :param benchmark: load data only for the specified benchmark
+    :param phases: list of benchmark phases to add as columns
+    :return: a DataFrame containing the results
+    """
+    dictionaries = _load_dictionaries(input_folders, benchmark)
+    
+    data_tmp = []
+    for dictionary in dictionaries:
+        # Turn results into a pd.DataFrame;
+        rows = []
+        phases_names = []
+        for k, v in dictionary.items():
+            row, d = _basic_filename_cleaning(k, v)
+            # Parse data for each input data size, and other settings;
+            for size, d in d.items():
+                row += [int(size)]
+                for num_gpu, d in d.items():
+                    row += [int(num_gpu)]
+                    for num_blocks, d in d.items():
+                        for exec_policy, d in d.items():
+                            row += [exec_policy]
+                            for dependency_policy, d in d.items():
+                                row += [dependency_policy]
+                                for new_stream_policy, d in d.items():
+                                    row += [new_stream_policy]
+                                    for parent_stream_policy, d in d.items():
+                                        row += [parent_stream_policy]
+                                        for device_selection_policy, d in d.items():
+                                            row += [device_selection_policy]
+                                            for mem_advise, d in d.items():
+                                                row += [mem_advise]
+                                                for prefetch, d in d.items():
+                                                    row += [prefetch]
+                                                    for stream_attach, d in d.items():
+                                                        row += [stream_attach.lower() == "true"]
+                                                        for kernel_timer_enabled, d in d.items():
+                                                            row += [kernel_timer_enabled.lower() == "true"]
+                                                            for realloc, d in d.items():
+                                                                row += [realloc.lower() == "true"]
+                                                                for reinit, d in d.items():
+                                                                    row += [reinit.lower() == "true"]
+                                                                    for block_size, d in d.items():
+                                                                        # Process each iteration;
+                                                                        try:
+                                                                            block_size_1d = int(block_size.split(",")[0])
+                                                                        except:
+                                                                            print(k)
+                                                                        block_size_2d = int(block_size.split(",")[1])
+                                                                        block_size_str = str(block_size_1d) + "," + str(block_size_2d)
+                                                                        row += [int(num_blocks), block_size_1d, block_size_2d, block_size_str]
+                                                                        
+                                                                        for curr_iteration in d:
+                                                                            num_iter = curr_iteration["iteration"]
+                                                                            gpu_result = curr_iteration["gpu_result"]
+                                                                            total_time_sec = curr_iteration["total_time_sec"]
+                                                                            overhead_sec = curr_iteration["overhead_sec"]
+                                                                            computation_sec = curr_iteration["computation_sec"]
+                                                                            
+                                                                            # Process phases;
+                                                                            phases_time = []
+                                                                            if phases:
+                                                                                phases_time = [p["time_sec"] for p in curr_iteration["phases"] if p["name"] in phases]
+                                                                                if not phases_names:
+                                                                                    phases_names = [p["name"] for p in curr_iteration["phases"] if p["name"] in phases]
+                                                                            
+                                                                            # Add a new row;
+                                                                            if (num_iter >= skip_iter):
+                                                                                rows += [row + [num_iter - skip_iter, gpu_result, total_time_sec, overhead_sec, computation_sec] + phases_time]
+    
+        columns = ["benchmark", "total_iterations", "cpu_validation", "random_init", "size", "gpus", 
+                   "exec_policy", "dependency_policy", "new_stream_policy", "parent_stream_policy",
+                   "device_selection_policy", "mem_advise", "prefetch", "force_stream_attach", "kernel_timing", "realloc", "reinit",
+                   "num_blocks", "block_size_1d", "block_size_2d", "block_size_str", 
+                   "num_iter", "gpu_result", "total_time_sec", "overhead_sec", "computation_sec"] + (phases_names if phases else [])
+        
+        data_tmp += [pd.DataFrame(rows, columns=columns).sort_values(by=columns[:21], ignore_index=True)]
+    
+    # Concatenate results;
+    data = pd.concat(data_tmp, ignore_index=True)
+    
+    # Clean columns with 0 computation time;
+    if remove_time_zero:
+        data = data[data["computation_sec"] > 0].reset_index(drop=True)
+        
+    # FIXME: Execution time in CG ASYNC, 1 GPU explodes when using the largest size;
+    data = data.query("~(benchmark == 'b9m' & exec_policy == 'async' & gpus == 1 & num_iter > 11)")
+    
+    # Compute speedups;
+    pu.compute_speedup_df(data, key=["benchmark", "total_iterations", "cpu_validation", "random_init", "size", "dependency_policy",
+                                     "mem_advise", "prefetch", "force_stream_attach", "kernel_timing", "realloc", "reinit"],
+                          baseline_filter_col=["exec_policy", "new_stream_policy", "parent_stream_policy", "device_selection_policy", "gpus"],
+                          baseline_filter_val=[ASYNC_POLICY_NAME, "always-new", "disjoint", "round-robin", 1],
+                          time_column="computation_sec", aggregation=np.mean)
+    
+    # Clean columns with infinite speedup;
+    if remove_inf:
+        data = data[data["speedup"] != np.inf].reset_index(drop=True)
+        
+    data["benchmark"] = data["benchmark"].replace(BENCHMARK_NAMES)
+    data["exec_policy"] = data["exec_policy"].replace(POLICY_NAMES)
+    data["benchmark"] = pd.Categorical(data["benchmark"], list(BENCHMARK_NAMES.values()))
+    data["exec_policy"] = pd.Categorical(data["exec_policy"], list(POLICY_NAMES.values()))
+    
+    data = data.sort_values(["benchmark", "exec_policy", "size", "num_iter"]).reset_index(drop=True)
+
+    return data
+
+
+def load_data_cuda(input_date: str, skip_iter=0, remove_inf=True, remove_time_zero=True, add_prefetch_as_policy=True) -> pd.DataFrame:
+    """
+    Load the benchmark results located in the input sub-folder
+    :param input_date: name of the folder where results are located, as a subfolder of DEFAULT_RES_DIR
+    :param skip_iter: skip the first iterations for each benchmark, as they are considered warmup
+    :param remove_inf: if True, remove rows with infinite speedup
+    :param remove_time_zero: if True, remove rows with 0 computation time;
+    :param add_prefetch_as_policy: if True, consider prefetching as part of the policy, to compute speedups w.r.t. sync with no prefetching
+    :return: a DataFrame containing the results
+    """
+    input_path = os.path.join(DEFAULT_RES_DIR, input_date)
+
+    # Load results as pd.DataFrames;
+    data_tmp = []
+    for f in os.listdir(input_path):
+        # Parse filename;
+        try:
+            benchmark, exec_policy, size, block_size_1d, block_size_2d, force_prefetch, total_iterations, num_blocks = os.path.splitext(f)[0].split("_")[7:]
+            force_prefetch = force_prefetch == "True"
+        except ValueError:
+            benchmark, exec_policy, size, block_size_1d, block_size_2d, total_iterations, num_blocks, force_prefetch = os.path.splitext(f)[0].split("_")[7:] + [False]
+        tmp_data = pd.read_csv(os.path.join(input_path, f))
+        
+        # Skip first lines;
+        tmp_data = tmp_data.iloc[skip_iter:, :]
+
+        # Add other information;
+        tmp_data["benchmark"] = benchmark
+        tmp_data["exec_policy"] = exec_policy
+        tmp_data["force_prefetch"] = bool(force_prefetch)
+        tmp_data["size"] = int(size)
+        tmp_data["block_size_1d"] = int(block_size_1d)
+        tmp_data["block_size_2d"] = int(block_size_2d)
+        tmp_data["block_size_str"] = block_size_1d + ",8" # block_size_1d + "," + block_size_2d
+        tmp_data["total_iterations"] = int(total_iterations)
+        data_tmp += [tmp_data]
+        
+    data = pd.concat(data_tmp).reset_index(drop=True)
+    data["num_iter"] -= skip_iter
+
+    # Reorder columns;
+    columns = ["benchmark", "exec_policy", "force_prefetch", "block_size_1d", "block_size_2d", "block_size_str",
+               "total_iterations", "size", "num_iter", "gpu_result", "total_time_sec", "overhead_sec", "computation_sec"]
+    data = data[columns]
+    
+    # Clean columns with 0 computation time;
+    if remove_time_zero:
+        data = data[data["computation_sec"] > 0].reset_index(drop=True)
+   
+    # Compute speedups;
+    if add_prefetch_as_policy:
+        data["exec_policy_full"] = data["exec_policy"] + np.where(data["force_prefetch"], "_f", "")
+        compute_speedup(data, ["benchmark", "block_size_1d", "block_size_2d", "size"], baseline_filter_col="exec_policy_full", baseline_filter_val="sync")
+    else:
+        compute_speedup(data, ["benchmark", "force_prefetch", "block_size_1d", "block_size_2d", "size"])
+    
+    # Clean columns with infinite speedup;
+    if remove_inf:
+        data = data[data["computation_speedup"] != np.inf].reset_index(drop=True)
+    
+    return data
+
+
+def load_data_cuda_multigpu(input_folders: list, skip_iter=0, remove_inf=True, remove_time_zero=True) -> pd.DataFrame:
+    """
+    Load the benchmark results located in the input sub-folder
+    :param input_folder: name of the folders where results are located, as a subfolder of DEFAULT_RES_CUDA_DIR
+    :param skip_iter: skip the first iterations for each benchmark, as they are considered warmup
+    :param remove_inf: if True, remove rows with infinite speedup
+    :param remove_time_zero: if True, remove rows with 0 computation time;
+    :return: a DataFrame containing the results
+    """
+
+    # Load results as pd.DataFrames;
+    data_tmp = []
+    for folder in input_folders:
+        input_path = os.path.join(DEFAULT_RES_CUDA_DIR, folder)
+        for f in os.listdir(input_path):
+            # Parse filename;
+            benchmark, exec_policy, size, num_gpu, block_size_1d, block_size_2d, prefetch, total_iterations, num_blocks = os.path.splitext(f)[0].split("_")[7:]
+            tmp_data = pd.read_csv(os.path.join(input_path, f))
+            
+            # Skip first lines;
+            tmp_data = tmp_data.iloc[skip_iter:, :]
+    
+            # Add other information;
+            tmp_data["benchmark"] = benchmark
+            tmp_data["exec_policy"] = exec_policy
+            tmp_data["prefetch"] = prefetch 
+            tmp_data["size"] = int(size)
+            tmp_data["gpus"] = int(num_gpu.replace("gpu", ""))
+            tmp_data["block_size_1d"] = int(block_size_1d)
+            tmp_data["block_size_2d"] = int(block_size_2d)
+            tmp_data["num_blocks"] = int(num_blocks)
+            tmp_data["block_size_str"] = block_size_1d + ",8"
+            tmp_data["total_iterations"] = int(total_iterations)
+            data_tmp += [tmp_data]
+            
+    data = pd.concat(data_tmp, ignore_index=True)
+    data["num_iter"] -= skip_iter
+    
+    # Clean names;
+    data["exec_policy"].replace({"default": ASYNC_POLICY_NAME}, inplace=True)
+    data["prefetch"].replace({"none": "false"}, inplace=True)
+
+    # Reorder columns;
+    columns = ["benchmark", "exec_policy", "prefetch", "block_size_1d", "block_size_2d", "num_blocks", "block_size_str",
+               "total_iterations", "size", "gpus", "num_iter", "gpu_result", "total_time_sec", "overhead_sec", "computation_sec"]
+    data = data[columns]
+    
+    # Clean columns with 0 computation time;
+    if remove_time_zero:
+        data = data[data["computation_sec"] > 0].reset_index(drop=True)
+   
+    # Compute speedups;
+    pu.compute_speedup_df(data, ["benchmark", "prefetch", "block_size_1d", "block_size_2d", "size"],
+                          baseline_filter_col=["exec_policy", "gpus"], baseline_filter_val=[ASYNC_POLICY_NAME, 1],
+                          time_column="computation_sec")
+    
+    # Clean columns with infinite speedup;
+    if remove_inf:
+        data = data[data["speedup"] != np.inf].reset_index(drop=True)
+        
+    data["benchmark"] = data["benchmark"].replace(BENCHMARK_NAMES)
+    data["exec_policy"] = data["exec_policy"].replace(POLICY_NAMES)
+    data["benchmark"] = pd.Categorical(data["benchmark"], list(BENCHMARK_NAMES.values()))
+    data["exec_policy"] = pd.Categorical(data["exec_policy"], list(POLICY_NAMES.values()))
+    
+    data = data.sort_values(["benchmark", "exec_policy", "size", "num_iter"]).reset_index(drop=True)
+         
+    return data
+
+
+def compute_speedup(data, key, speedup_col_name="computation_speedup", time_column="computation_sec",
+                    baseline_filter_col="gpus", baseline_filter_val=1, baseline_col_name="baseline_time_sec",
+                    correction=True, aggregation=np.median):
+    
+    # Initialize speedup values;
+    data[speedup_col_name] = 1
+    data[baseline_col_name] = 0
+    
+    grouped_data = data.groupby(key, as_index=False)
+    for group_key, group in grouped_data:
+        # Compute the median baseline computation time;
+        median_baseline = aggregation(group.loc[group[baseline_filter_col] == baseline_filter_val, time_column])
+        # Compute the speedup for this group;
+        group.loc[:, speedup_col_name] = median_baseline / group[time_column]
+        group.loc[:, baseline_col_name] = median_baseline
+        data.loc[group.index, :] = group
+    
+        # Guarantee that the geometric mean of speedup referred to the baseline is 1, and adjust speedups accordingly;
+        if correction:
+            gmean_speedup = gmean(group.loc[group[baseline_filter_col] == baseline_filter_val, speedup_col_name])
+            group.loc[:, speedup_col_name] /= gmean_speedup
+            data.loc[group.index, :] = group
+        
+        
+def join_tables(t1, t2, key=["benchmark", "exec_policy", "block_size_1d", "block_size_2d", "block_size_str",
+               "size", "num_iter"], keep_common_columns=True):
+    t1_tmp = t1.copy()
+    t2_tmp = t2.copy()
+    t1_tmp = t1_tmp.set_index(key)
+    t2_tmp = t2_tmp.set_index(key)
+    if keep_common_columns:
+        common_columns = [x for x in t1_tmp.columns if x in t2_tmp.columns]
+        t1_tmp = t1_tmp[common_columns]
+        t2_tmp = t2_tmp[common_columns]
+
+    merged = t1_tmp.merge(t2_tmp, suffixes=("_grcuda", "_cuda"), left_index=True, right_index=True, sort=True).reset_index()
+    # merged = merged.merge(t2_tmp, suffixes=("_cuda2", ""), left_index=True, right_index=True, sort=True).reset_index()
+    merged["grcuda_cuda_speedup"] = merged["computation_sec_cuda"] / merged["computation_sec_grcuda"]
+    return merged
+
+
+def join_tables_baseline(data_cuda_in, data_grcuda_in):
+    data_cuda = data_cuda_in.copy()
+    data_grcuda = data_grcuda_in.copy()
+    baseline_policies = data_cuda["exec_policy"].unique()
+    for b in baseline_policies:
+        data_grcuda["speedup_" + b] = 1
+    
+    filter_df = ["benchmark", "block_size_str", "size", "exec_policy"]
+    for k, g in data_grcuda.groupby(filter_df):
+        curr_data = data_cuda[functools.reduce(np.logical_and, [data_cuda[k_b] == k_a for k_a, k_b in zip(k[:-1], filter_df[:-1])])]
+        for k1, g1 in curr_data.groupby(["exec_policy"]):
+            mean_exec_time = np.mean(g1["computation_sec"])
+            data_grcuda.at[g.index, "speedup_" + k1] = mean_exec_time / g["computation_sec"]
+    return data_grcuda
+
+
+if __name__ == "__main__":
+    # input_date = "2021_10_03_12_30_18_grcuda_b5(new)_2GPU_noPrefetch_noStrAttach_allParents_dataLocality"
+    # data = load_data(input_date, skip_iter=5)
+    # data.to_csv("2GPU_allParents_vs_1GPU_Async.csv", sep = ';')
+    
+    res_list = [
+        "2021_10_04_15_13_11_cuda_1gpu_v100",
+        "2021_10_04_15_15_29_cuda_2gpu_v100",
+        "2021_10_04_15_15_49_cuda_4gpu_v100",
+        "2021_10_04_15_33_23_cuda_8gpu_v100",
+        ]
+    res_cuda = load_data_cuda_multigpu(res_list, skip_iter=3)
+    res_cuda_grouped = res_cuda.groupby(["benchmark", "exec_policy", "num_gpu"]).mean().reset_index()
+    res_cuda.to_csv(os.path.join(DEFAULT_RES_CUDA_DIR, "res_cuda.csv"), index=False)
+    res_cuda_grouped.to_csv(os.path.join(DEFAULT_RES_CUDA_DIR, "res_cuda_grouped.csv"), index=False)
+
+    # data3 = join_tables(data[data["benchmark"] == "b1"], data2)
\ No newline at end of file
diff --git a/projects/resources/python/plotting/multi_gpu_parse_nvprof_log.py b/projects/resources/python/plotting/multi_gpu_parse_nvprof_log.py
new file mode 100644
index 00000000..d7ad937d
--- /dev/null
+++ b/projects/resources/python/plotting/multi_gpu_parse_nvprof_log.py
@@ -0,0 +1,65 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Thu Oct  7 20:46:10 2021
+
+@author: albyr
+"""
+
+import pandas as pd
+import numpy as np
+import os
+
+from compute_transfer_computation_overlap import read_nvprof_log, read_nvprof_log_a100
+from load_data import DEFAULT_RES_CUDA_DIR
+
+# V100;
+GPU = "V100"
+INPUT_FOLDER = "V100/nvprof_cuda/2021_10_07"
+
+# A100;
+GPU = "A100"
+INPUT_FOLDER = "A100/nvprof_cuda/2021_10_18"
+
+TRANSFERS = ["htod", "dtod", "dtoh"]
+
+def create_nondirectional_transfer_matrix(matrix: pd.DataFrame) -> pd.DataFrame:
+    devices = matrix.columns
+    transfer_matrix_nondirectional = matrix + matrix.transpose()
+    for i in range(len(devices)):
+        for j in range(i, len(devices)):
+            transfer_matrix_nondirectional.iloc[j, i] = 0
+    return transfer_matrix_nondirectional
+
+if __name__ == "__main__":
+    
+    res = {}
+    res_summary = {}
+    
+    for f in os.listdir(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER)):
+        if "transfer_matrix" in f:
+            continue
+        print(f"reading {f}")
+        if GPU == "V100":
+            data = read_nvprof_log(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER, f))
+        elif GPU == "A100":
+            data = read_nvprof_log_a100(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER, f))
+        else:
+            raise ValueError(f"Unknown GPU={GPU}")
+            
+        # Keep only memory transfer;
+        data = data[data["name"].isin(TRANSFERS)]
+        
+        data_grouped = data.groupby(["device_start", "device_end"])["transferred_data_byte"].sum().reset_index()
+        devices = sorted(list(set(data_grouped["device_start"].unique()).union(set(data_grouped["device_end"].unique()))))
+        transfer_matrix = np.zeros((len(devices), len(devices)))
+        transfer_matrix = pd.DataFrame(transfer_matrix, index=devices, columns=devices)
+        for i, r in data_grouped.iterrows():
+            transfer_matrix.loc[r["device_start"]][r["device_end"]] += r["transferred_data_byte"]
+        transfer_matrix_nondirectional = create_nondirectional_transfer_matrix(transfer_matrix)
+        res[f] = data
+        res_summary[f] = transfer_matrix_nondirectional
+        
+        basename = os.path.splitext(f)[0]
+        basename = basename.replace("_gputrace", "").replace("m", "")
+        transfer_matrix.to_csv(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER, f"{basename}_transfer_matrix.csv"))
+        
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_memory_throughput.py b/projects/resources/python/plotting/plot_memory_throughput.py
new file mode 100755
index 00000000..a23379e7
--- /dev/null
+++ b/projects/resources/python/plotting/plot_memory_throughput.py
@@ -0,0 +1,361 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Using nvprof we measure the total amount of bytes read/written by each kernel,
+and analyse how the GPU memory throughput is affected by space-sharing. 
+Note that nvprof affects the kernel execution and limits the execution of concurrent kernels due to the
+high overhead introduced by collecting memory access metrics for each kernel.
+Instead, we measure the execution times obtained without metric collection 
+(so that nvprof influence over the execution times is minimal) and combine them with memory access metrics collected in a separate run.
+The assumption here is the total amount of memory accesses is not significantly impacted by nvprof profiling,
+and this evaluation is still useful to obtain performance insights.
+
+Created on Tue Jul 28 09:10:07 2020
+
+@author: alberto.parravicini
+"""
+
+
+import pandas as pd
+import json
+import os
+import numpy as np
+from compute_transfer_computation_overlap import get_overlap, get_total_segment_set_length
+
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from scipy.stats.mstats import gmean
+from matplotlib.patches import Patch, Rectangle
+from matplotlib.collections import PatchCollection, LineCollection
+import matplotlib.lines as lines
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot
+
+
+DEFAULT_RES_DIR = "../../../../grcuda-data/results/scheduling_nvprof_log"
+
+INPUT_DATE = "2020_09_23_960"
+OUTPUT_DATE = "2020_09_17"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+BENCHMARK_NAMES = {
+    "b1": "VEC",
+    "b5": "B&S", 
+    "b8": "IMG",
+    "b6": "ML",
+    "b7": "HITS", 
+    "b10": "DL"
+    }
+# ASYNC_POLICY_NAME = "async"   # If parsing new results;
+ASYNC_POLICY_NAME = "default"  # If parsing older results;
+POLICIES = ["sync", ASYNC_POLICY_NAME]
+POLICIES_DICT = {ASYNC_POLICY_NAME: "Parallel Scheduler", "sync": "Serial Scheduler"}
+
+NVPROF_HEADER_NOMETRIC = ["start_ms", "duration_ms", "Grid X", "Grid Y", "Grid Z", "Block X", "Block Y", "Block Z",
+                 "Registers Per Thread"," Static SMem", "Dynamic SMem", "Device", "Context", "Stream",
+                 "transferred_data_byte", "Virtual Address", "name", "Correlation_ID"]
+NVPROF_HEADER_NOMETRIC_FILTERED = NVPROF_HEADER_NOMETRIC[:2] + [NVPROF_HEADER_NOMETRIC[-2]]
+
+NVPROF_HEADER_METRIC = ["Device", "Context", "Stream", "name", "Correlation_ID",
+                        "dram_read_throughput", "dram_write_throughput", "dram_read_bytes", "dram_write_bytes", 
+                        "l2_global_atomic_store_bytes", "l2_global_load_bytes", "l2_global_reduction_bytes", "l2_local_global_store_bytes", "l2_local_load_bytes", "l2_read_throughput", "l2_write_throughput", 
+                        "inst_executed", "ipc", "flop_count_dp", "flop_count_sp"]
+NVPROF_HEADER_METRIC_FILTERED = [NVPROF_HEADER_METRIC[3]] + NVPROF_HEADER_METRIC[5:]
+
+OPERATIONS_TO_MERGE = set(["htod", "dtoh"])
+
+NUM_ITER = 30
+
+# Maximum memory bandiwth, in GB/s. of the GPU (currently: GTX 960);
+MAX_GPU_BANDWIDTH = 112
+MAX_L2_GPU_BANDWIDTH = 450  # Not publicly known, estimated using nvvp;
+GPU_CLOCK_HZ = 1_177_000_000
+GPU_NUM_SM = 8
+
+
+def load_data(b, p, files):
+    ##############################
+    # Process file with execution time;
+    ##############################
+    
+    input_file = os.path.join(DEFAULT_RES_DIR, INPUT_DATE, files_dict[(b, p, "nometric")])
+    data_nometric = pd.read_csv(input_file, skiprows=5, names=NVPROF_HEADER_NOMETRIC)
+    
+    # Keep only a subset of columns;
+    data_nometric = data_nometric[NVPROF_HEADER_NOMETRIC_FILTERED]
+    
+    # Remove rows with NaN Duration;
+    data_nometric = data_nometric.dropna(subset=["duration_ms"]).reset_index(drop=True)
+    
+    # Convert start from seconds to milliseconds;
+    data_nometric["start_ms"] *= 1000
+    
+    # Set the start of the computation equal to 0;
+    data_nometric["start_ms"] -= data_nometric["start_ms"].iloc[0]
+       
+    # Set the end of the computation;
+    data_nometric["end_ms"] = data_nometric["duration_ms"] + data_nometric["start_ms"]
+    
+    # Clean names of operations;
+    data_nometric["name"] = data_nometric["name"].replace({
+        "[Unified Memory Memcpy HtoD]": "htod",
+        "[Unified Memory Memcpy DtoH]": "dtoh"
+        })
+    
+    # Keep only kernel computations;
+    data_nometric = data_nometric[~data_nometric["name"].isin(["htod", "dtoh"])].reset_index(drop=True)
+    
+    # Keep just the name of kernels;
+    data_nometric["name"] = data_nometric["name"].apply(lambda x: x.split("(")[0])
+    
+    ##############################
+    # Process file with memory access information;
+    ##############################
+    
+    input_file = os.path.join(DEFAULT_RES_DIR, INPUT_DATE, files_dict[(b, p, "metric")])
+    data_metric = pd.read_csv(input_file, skiprows=6, names=NVPROF_HEADER_METRIC)
+    # Keep only a subset of columns;
+    data_metric = data_metric[NVPROF_HEADER_METRIC_FILTERED]
+    
+    # Keep only kernel computations;
+    data_metric["name"] = data_metric["name"].apply(lambda x: x.split("(")[0])
+    
+    # Rename the "name" column to allow debugging after merging;
+    data_metric = data_metric.rename(columns={"name": "name_metric"})
+    
+    # Turn bytes into GB;
+    data_metric["dram_read_bytes"] /= 2**30
+    data_metric["dram_write_bytes"] /= 2**30
+    data_metric["l2_global_atomic_store_bytes"] /= 2**30
+    data_metric["l2_global_load_bytes"] /= 2**30
+    data_metric["l2_global_reduction_bytes"] /= 2**30
+    data_metric["l2_local_global_store_bytes"] /= 2**30
+    data_metric["l2_local_load_bytes"] /= 2**30
+    data_metric["total_flop"] = data_metric["flop_count_dp"] + data_metric["flop_count_sp"]
+    
+    data_metric["total_l2_read_bytes"] = data_metric["l2_global_load_bytes"] + data_metric["l2_local_load_bytes"]
+    data_metric["total_l2_write_bytes"] = data_metric["l2_global_atomic_store_bytes"] + data_metric["l2_global_reduction_bytes"] + data_metric["l2_local_global_store_bytes"]
+    
+    # Concatenate the 2 tables;
+    data = pd.concat([data_nometric, data_metric], axis=1)
+    
+    # Look for inconsistencies;
+    assert(len(data_metric) == len(data_nometric))
+    # Note: this check can fail, as kernels with dependencies can be scheduled in different order from the sync kernels.
+    # It doesn't matter for the memory throughput computation, as we consider the total execution time;
+    # assert((data["name"] == data["name_metric"]).all())  
+
+    # Check if throughput is close to the one computed by nvprof, for debugging.
+    # This is relevant only for "sync" policies, as the execution times for the 2 tables are consistent;
+    data["estimated_read_througput"] = data["dram_read_bytes"] / (data["duration_ms"] / 1000)
+    data["estimated_l2_read_througput"] = data["total_l2_read_bytes"] / (data["duration_ms"] / 1000)
+    data["estimated_l2_write_througput"] = data["total_l2_write_bytes"] / (data["duration_ms"] / 1000)
+    data["gigaflops"] = (data["total_flop"] / 10**9) / (data["duration_ms"] / 1000)
+    
+    data["estimated_ipc"] = data["inst_executed"] / (GPU_CLOCK_HZ * (data["duration_ms"] / 1000)) / GPU_NUM_SM
+    
+    # Add index columns;
+    data["benchmark"] = b
+    data["policy"] = p
+    return data
+
+
+def get_computation_time_with_overlap(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself;
+    """
+    curr_start = 0
+    curr_end = 0
+    total_duration = 0
+    for i, r in data.iterrows():
+        if r["start_ms"] < curr_end:
+            curr_end = r["end_ms"]
+        else:
+            # Found the end of a contiguous computation segment;
+            total_duration += curr_end - curr_start
+            curr_start = r["start_ms"]
+            curr_end = r["end_ms"]
+    
+    # Add the last computation;
+    total_duration += curr_end - curr_start
+        
+    return total_duration
+
+
+def autolabel(ax, rects1, rects2):
+    """Attach a text label above each bar in *rects*, displaying its height."""
+    for i, rect in enumerate(rects2):
+        height1 = rects1[i].get_height()
+        height2 = rect.get_height()
+        ax.annotate('{:.2f}x'.format(height2 / height1),
+                    xy=(rect.get_x(), height2),
+                    xytext=(0, 2),  # 3 points vertical offset
+                    textcoords="offset points",
+                    ha='center', va='bottom',
+                    fontsize=7)
+        
+def barplot(data, ax, title, y_column, y_limit, annotation_title, y_ticks=6, y_tick_format=lambda l: f"{l:.2f}", baseline_annotation_format=lambda l: f"{l:.2f}"):
+    
+    # Obtain x values for the plot;
+    x = np.arange(len(data["benchmark"].unique()))
+
+    # Obtain labels;
+    x_labels = [BENCHMARK_NAMES[l] for l in data["benchmark"].unique()]
+
+    peach = "#fab086"
+    green = "#6cb77c"
+    palette = [peach, green]
+    edgecolor = "#2f2f2f"
+    
+    bar_width = 0.35
+    
+    # Obtain y;
+    y_sync = data[data["policy"] == "sync"][y_column]
+    y_async = data[data["policy"] == ASYNC_POLICY_NAME][y_column]
+
+    rects1 = ax.bar(x - bar_width / 2, y_sync, bar_width, label="sync", color=palette[0], edgecolor=edgecolor)
+    rects2 = ax.bar(x + bar_width / 2, y_async, bar_width, label=ASYNC_POLICY_NAME, color=palette[1], edgecolor=edgecolor)
+    
+    ax.set_xticks(x)
+    ax.set_xticklabels(x_labels, fontsize=8, va="center")
+    
+    # ax.set_ylim((0, 1.1 * summary["memory_throughput"].max()))
+    ax.set_ylim(y_limit)
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(y_ticks))
+    ax.set_yticklabels(labels=[y_tick_format(l) for l in ax.get_yticks()], ha="right", fontsize=8)
+    ax.grid(True, axis="y")
+    
+    # ax.annotate(title, fontsize=9, x=.02, y=0.95, ha="left")
+    plt.suptitle("Hardware metrics for each\nbenchmark and execution policy,\nGTX 960", fontsize=14, x=.01, y=0.97, ha="left")
+    ax.annotate(title, xy=(0, 1.08), fontsize=10, ha="left", xycoords="axes fraction")#, xycoords="data", xytext=(0, 100), textcoords="offset points")
+    autolabel(ax, rects1, rects2)
+    
+    # Add baseline annotations;
+    for i, b in enumerate(BENCHMARK_NAMES):
+        position = x[i]
+        serial_throughput = summary[(summary["benchmark"] == b) & (summary["policy"] == "sync")][y_column].iloc[0]
+        if i == 0: 
+            ax.annotate(annotation_title, xy=(0, 0), fontsize=9, ha="left", va="center", xycoords="data", xytext=(-32, -20), textcoords="offset points")
+        print((position - bar_width, -0.1))
+        ax.annotate(baseline_annotation_format(serial_throughput), xy=(position - bar_width, 0), fontsize=9, ha="center", va="center", xycoords="data", color=palette[0], xytext=(7, -30), textcoords="offset points")
+    
+    # Legend;  
+    labels = [POLICIES_DICT[p] for p in POLICIES]
+    custom_lines = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l)
+                    for i, l in enumerate(labels)]
+    leg = fig.legend(custom_lines, labels, bbox_to_anchor=(1, 1), fontsize=10, ncol=1)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_facecolor('white')        
+
+
+if __name__ == "__main__":
+    
+    files = os.listdir(os.path.join(DEFAULT_RES_DIR, INPUT_DATE))
+    
+    # Associate each file to a key that represents its content;
+    files_dict = {tuple(file.split("_")[:3]): file for file in files}
+    
+    output_res = []
+    for b in BENCHMARK_NAMES.keys():
+        for p in POLICIES:
+            output_res += [load_data(b, p, files)]
+            
+    # Create a single table;
+    res = pd.concat(output_res, ignore_index=True)
+    # Sort columns;
+    res = res[list(res.columns[-2:]) + [res.columns[2]] + [res.columns[0]] + [res.columns[3]] + [res.columns[1]] + list(res.columns[5:-2])]
+    
+    # For each benchmark and policy, compute the total computation time;
+    summary_list = []
+    for (b, p), group in res.groupby(by=["benchmark", "policy"], sort=False):
+        overlap_computation_time = get_computation_time_with_overlap(group)
+        
+        # Device memory;
+        total_memory_accessed = group["dram_read_bytes"].sum() + group["dram_write_bytes"].sum()
+        memory_throughput = total_memory_accessed / (overlap_computation_time / 1000)
+        
+        # L2 cache;
+        total_l2_accessed = group["total_l2_read_bytes"].sum() + group["total_l2_write_bytes"].sum()
+        l2_throughput = total_l2_accessed / (overlap_computation_time / 1000)
+        
+        # IPC;
+        total_instructions = group["inst_executed"].sum() 
+        ipc = total_instructions / (GPU_CLOCK_HZ * (overlap_computation_time / 1000)) / GPU_NUM_SM
+        
+        # GigaFLOPS;
+        total_flop = group["total_flop"].sum() 
+        gigaflops = (total_flop / 10**9) / (overlap_computation_time / 1000)
+        
+        summary_list += [[b, p, overlap_computation_time, total_memory_accessed, memory_throughput, memory_throughput / MAX_GPU_BANDWIDTH, l2_throughput, l2_throughput / MAX_L2_GPU_BANDWIDTH,  ipc, gigaflops]]
+        
+    summary = pd.DataFrame(summary_list, columns=["benchmark", "policy", "duration_ms", "dram_accessed_GB", "memory_throughput", "max_memory_throughput_perc", "l2_throughput", "max_l2_throughput_perc", "ipc", "gigaflops"])
+    
+    #%% Create barplot with memory throughput;   
+    
+    sns.set_style("white", {"ytick.left": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['axes.titlepad'] = 25 
+    plt.rcParams['axes.labelpad'] = 9 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    plt.rcParams['xtick.major.pad'] = 5
+    
+    num_col = 2
+    num_rows = 2
+    
+    fig, axes = plt.subplots(num_rows, num_col, figsize=(2.4 * num_col, 2.4 * num_rows)) 
+    plt.subplots_adjust(top=0.80,
+                    bottom=0.10,
+                    left=0.13,
+                    right=.99,
+                    hspace=0.6,
+                    wspace=0.4)
+   
+    barplot(summary, axes[0, 0], "Device memory throughput",
+            "memory_throughput", (0, 50), "Serial throughput (GB/s):", y_ticks=6, y_tick_format=lambda l: f"{int(l)} GB/s", baseline_annotation_format=lambda l: f"{int(l)}")
+    barplot(summary, axes[0, 1], "L2 cache throughput",
+            "l2_throughput", (0, 250), "Serial throughput (GB/s):", y_ticks=6, y_tick_format=lambda l: f"{int(l)} GB/s", baseline_annotation_format=lambda l: f"{int(l)}")
+    barplot(summary, axes[1, 0], "IPC",
+            "ipc", (0, 1.75), "Serial IPC:", y_ticks=8, y_tick_format=lambda l: f"{l:.2f}", baseline_annotation_format=lambda l: f"{l:.2f}")
+    barplot(summary, axes[1, 1], "GFLOPS32/64",
+            "gigaflops", (0, 90), "GFLOPS32/64:", y_ticks=6, y_tick_format=lambda l: f"{int(l)}", baseline_annotation_format=lambda l: f"{int(l)}")
+    
+    save_plot(PLOT_DIR, "memory_throughput_{}.{}", OUTPUT_DATE)
+    
+    #%%
+    tmp = res[res["policy"] == "sync"].groupby(by=["benchmark", "policy", "name"]).mean()
+    tmp["ipc_fix"] = tmp["estimated_ipc"] / 8
+    tmp["ipc_perc"] = ( tmp["ipc_fix"] -  tmp["ipc"]) /  tmp["ipc"]
+    
+    print(np.median(tmp["ipc_perc"]))
diff --git a/projects/resources/python/plotting/plot_memory_throughput_turing.py b/projects/resources/python/plotting/plot_memory_throughput_turing.py
new file mode 100644
index 00000000..f5b6f49c
--- /dev/null
+++ b/projects/resources/python/plotting/plot_memory_throughput_turing.py
@@ -0,0 +1,402 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Tue Jul 28 09:10:07 2020
+
+@author: alberto.parravicini
+"""
+
+
+import pandas as pd
+import json
+import os
+import numpy as np
+from compute_transfer_computation_overlap import get_overlap, get_total_segment_set_length
+
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from scipy.stats.mstats import gmean
+from matplotlib.patches import Patch, Rectangle
+from matplotlib.collections import PatchCollection, LineCollection
+import matplotlib.lines as lines
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot
+
+
+DEFAULT_RES_DIR = "../../../../grcuda-data/results/scheduling_nvprof_log"
+
+INPUT_DATE = "2020_10_11_1660_2"
+OUTPUT_DATE = "2020_10_11"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+BENCHMARK_NAMES = {
+    "b1": "VEC",
+    "b5": "B&S", 
+    "b8": "IMG",
+    "b6": "ML",
+    "b7": "HITS", 
+    "b10": "DL"
+    }
+# ASYNC_POLICY_NAME = "async"   # If parsing new results;
+ASYNC_POLICY_NAME = "default"  # If parsing older results;
+POLICIES = ["sync", ASYNC_POLICY_NAME]
+POLICIES_DICT = {ASYNC_POLICY_NAME: "Parallel Scheduler", "sync": "Serial Scheduler"}
+
+NVPROF_HEADER_NOMETRIC = ["start_ms", "duration_ms", "Grid X", "Grid Y", "Grid Z", "Block X", "Block Y", "Block Z",
+                 "Registers Per Thread"," Static SMem", "Dynamic SMem", "Device", "Context", "Stream",
+                 "transferred_data_byte", "Virtual Address", "name", "Correlation_ID"]
+NVPROF_HEADER_NOMETRIC_FILTERED = NVPROF_HEADER_NOMETRIC[:2] + [NVPROF_HEADER_NOMETRIC[-2]]
+
+# NVPROF_HEADER_METRIC = ["Device", "Context", "Stream", "name", "Correlation_ID",
+#                         "dram_read_throughput", "dram_write_throughput", "dram_read_bytes", "dram_write_bytes", 
+#                         "l2_global_atomic_store_bytes", "l2_global_load_bytes", "l2_global_reduction_bytes", "l2_local_global_store_bytes", "l2_local_load_bytes", "l2_read_throughput", "l2_write_throughput", 
+#                         "inst_executed", "ipc", "flop_count_dp", "flop_count_sp"]
+# NVPROF_HEADER_METRIC_FILTERED = [NVPROF_HEADER_METRIC[3]] + NVPROF_HEADER_METRIC[5:]
+
+NVPROF_HEADER_METRIC = ["ID", "Process ID", "Process Name", "Host Name", "Kernel Name", "Kernel Time", "Context", "Stream", "Section Name", "Metric Name", "Metric Unit", "Metric Value"]
+NVPROF_HEADER_METRIC_FILTERED = [NVPROF_HEADER_METRIC[0], NVPROF_HEADER_METRIC[4], NVPROF_HEADER_METRIC[-3], NVPROF_HEADER_METRIC[-1]]
+
+OPERATIONS_TO_MERGE = set(["htod", "dtoh"])
+
+NUM_ITER = 30
+
+# Maximum memory bandiwth, in GB/s. of the GPU (currently: GTX 1660);
+MAX_GPU_BANDWIDTH = 336 
+MAX_L2_GPU_BANDWIDTH = 450  # Not publicly known, estimated using nvvp;
+GPU_CLOCK_HZ = 1_785_000_000
+GPU_NUM_SM = 22
+
+def load_data(b, p, files):
+    
+    # Associate each file to a key that represents its content;
+    files_dict = {tuple(file.split(".")[0].split("_")[:4]): file for file in files}
+    
+    ##############################
+    # Process file with execution time;
+    ##############################
+    
+    input_file = os.path.join(DEFAULT_RES_DIR, INPUT_DATE, files_dict[(b, p, "nometric", "True")])
+    data_nometric = pd.read_csv(input_file, skiprows=5, names=NVPROF_HEADER_NOMETRIC)
+    header = pd.read_csv(input_file, skiprows=3, nrows=1)
+    start_unit = header.iloc[0, 0]
+    duration_unit = header.iloc[0, 1]
+    
+    # Keep only a subset of columns;
+    data_nometric = data_nometric[NVPROF_HEADER_NOMETRIC_FILTERED]
+    
+    # Remove rows with NaN Duration;
+    data_nometric = data_nometric.dropna(subset=["duration_ms"]).reset_index(drop=True)
+    
+    # Convert start and duration from seconds to milliseconds;
+    if start_unit == "s":
+        data_nometric["start_ms"] *= 1000
+    elif start_unit == "us":
+        data_nometric["start_ms"] /= 1000
+    if duration_unit == "s":
+        data_nometric["duration_ms"] *= 1000
+    elif duration_unit == "us":
+        data_nometric["duration_ms"] /= 1000
+    
+    # Set the start of the computation equal to 0;
+    data_nometric["start_ms"] -= data_nometric["start_ms"].iloc[0]
+       
+    # Set the end of the computation;
+    data_nometric["end_ms"] = data_nometric["duration_ms"] + data_nometric["start_ms"]
+    
+    # Clean names of operations;
+    data_nometric["name"] = data_nometric["name"].replace({
+        "[Unified Memory Memcpy HtoD]": "htod",
+        "[Unified Memory Memcpy DtoH]": "dtoh",
+        "[Unified Memory GPU page faults]": "pagefault",
+        "[Unified Memory page throttle]": "throttle"
+        })
+    
+    # Keep only kernel computations;
+    data_nometric = data_nometric[~data_nometric["name"].isin(["htod", "dtoh", "pagefault", "throttle"])].reset_index(drop=True)
+    
+    # Keep just the name of kernels;
+    data_nometric["name"] = data_nometric["name"].apply(lambda x: x.split("(")[0])
+    
+    ##############################
+    # Process file with memory access information;
+    ##############################
+    
+    input_file = os.path.join(DEFAULT_RES_DIR, INPUT_DATE, files_dict[(b, p, "metric", "True" if p == ASYNC_POLICY_NAME else "False")])
+    print(b, p)
+    data_metric = pd.read_csv(input_file, skiprows=3, names=NVPROF_HEADER_METRIC)
+    # Keep only a subset of columns;
+    data_metric = data_metric[NVPROF_HEADER_METRIC_FILTERED]
+    data_metric = data_metric.fillna(0)
+
+    # Keep only kernel computations;
+    data_metric["Kernel Name"] = data_metric["Kernel Name"].apply(lambda x: x.split("(")[0])
+    # Rename the "name" column to allow debugging after merging;
+    data_metric = data_metric.rename(columns={"Kernel Name": "name_metric"})
+    data_metric["Metric Value"] = data_metric["Metric Value"].str.replace(",", "").astype(float)
+
+    # Pivot the table to obtain metrics for each kernel;
+    data_metric = pd.pivot_table(data_metric, values="Metric Value", index=["ID", "name_metric"], columns="Metric Name").reset_index()
+    
+    # Create a new table with derived metrics;
+    data_metric_2 = data_metric[["name_metric"]].copy()
+    data_metric_2["dram_read_bytes"] = data_metric["dram__bytes_read.sum"]
+    data_metric_2["dram_write_bytes"] = data_metric["dram__bytes_write.sum"]
+    data_metric_2["l2_global_atomic_store_bytes"] = data_metric["lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_atom.sum"]
+    data_metric_2["l2_global_load_bytes"] = data_metric["lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum"]
+    data_metric_2["l2_global_reduction_bytes"] = 0
+    data_metric_2["l2_local_global_store_bytes"] = data_metric["lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_st.sum"] + \
+        data_metric["lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_st.sum"]
+    data_metric_2["l2_local_load_bytes"] = data_metric["lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_local_op_ld.sum"]
+    data_metric_2["flop_count_dp"] = data_metric["smsp__sass_thread_inst_executed_op_dadd_pred_on.sum"] + \
+        data_metric["smsp__sass_thread_inst_executed_op_dmul_pred_on.sum"] + \
+        data_metric["smsp__sass_thread_inst_executed_op_dfma_pred_on.sum"] * 2
+    data_metric_2["flop_count_sp"] = data_metric["smsp__sass_thread_inst_executed_op_fadd_pred_on.sum"] + \
+        data_metric["smsp__sass_thread_inst_executed_op_fmul_pred_on.sum"] + \
+        data_metric["smsp__sass_thread_inst_executed_op_ffma_pred_on.sum"] * 2
+    data_metric_2["inst_executed"] = data_metric["smsp__inst_executed.sum"]
+    data_metric_2["ipc"] = data_metric["smsp__inst_executed.avg.per_cycle_active"]
+    
+    # Turn bytes into GB;
+    data_metric_2["dram_read_bytes"] /= 2**30
+    data_metric_2["dram_write_bytes"] /= 2**30
+    
+    data_metric_2["l2_global_atomic_store_bytes"] /= 2**30
+    data_metric_2["l2_global_load_bytes"] /= 2**30
+    data_metric_2["l2_global_reduction_bytes"] /= 2**30
+    
+    data_metric_2["l2_local_global_store_bytes"] /= 2**30
+    data_metric_2["l2_local_load_bytes"] /= 2**30
+    
+    data_metric_2["total_flop"] = data_metric_2["flop_count_dp"] + data_metric_2["flop_count_sp"]
+    
+    data_metric_2["total_l2_read_bytes"] = data_metric_2["l2_global_load_bytes"] + data_metric_2["l2_local_load_bytes"]
+    data_metric_2["total_l2_write_bytes"] = data_metric_2["l2_global_atomic_store_bytes"] + data_metric_2["l2_global_reduction_bytes"] + data_metric_2["l2_local_global_store_bytes"]
+
+    # Concatenate the 2 tables;
+    data = pd.concat([data_nometric, data_metric_2], axis=1)
+
+    # Look for inconsistencies;
+    assert(len(data_metric_2) == len(data_nometric))
+    # Note: this check can fail, as kernels with dependencies can be scheduled in different order from the sync kernels.
+    # It doesn't matter for the memory throughput computation, as we consider the total execution time;
+    # assert((data["name"] == data["name_metric"]).all())  
+
+    # Check if throughput is close to the one computed by nvprof, for debugging.
+    # This is relevant only for "sync" policies, as the execution times for the 2 tables are consistent;
+    data["estimated_read_througput"] = data["dram_read_bytes"] / (data["duration_ms"] / 1000)
+    data["estimated_write_througput"] = data["dram_write_bytes"] / (data["duration_ms"] / 1000)
+    data["estimated_memory_througput"] = data["estimated_read_througput"] + data["estimated_write_througput"]
+    data["estimated_l2_read_througput"] = data["total_l2_read_bytes"] / (data["duration_ms"] / 1000)
+    data["estimated_l2_write_througput"] = data["total_l2_write_bytes"] / (data["duration_ms"] / 1000)
+    data["estimated_l2_througput"] = data["estimated_l2_read_througput"] + data["estimated_l2_write_througput"]
+    data["gigaflops"] = (data["total_flop"] / 10**9) / (data["duration_ms"] / 1000)
+    
+    data["estimated_ipc"] = data["inst_executed"] / (GPU_CLOCK_HZ * (data["duration_ms"] / 1000)) / GPU_NUM_SM
+    
+    # Add index columns;
+    data["benchmark"] = b
+    data["policy"] = p
+    return data
+
+
+def get_computation_time_with_overlap(data):
+    """
+    For each computation, look at the computations before it and compute the length of the overlap with them, in seconds.
+    By definition, a computation has 0 overlap with itself;
+    """
+    curr_start = 0
+    curr_end = 0
+    total_duration = 0
+    for i, r in data.iterrows():
+        if r["start_ms"] < curr_end:
+            curr_end = r["end_ms"]
+        else:
+            # Found the end of a contiguous computation segment;
+            total_duration += curr_end - curr_start
+            curr_start = r["start_ms"]
+            curr_end = r["end_ms"]
+    
+    # Add the last computation;
+    total_duration += curr_end - curr_start
+        
+    return total_duration
+
+
+def autolabel(ax, rects1, rects2):
+    """Attach a text label above each bar in *rects*, displaying its height."""
+    for i, rect in enumerate(rects2):
+        height1 = rects1[i].get_height()
+        height2 = rect.get_height()
+        # ax.annotate('{:.2f}x'.format(height2 / height1),
+        ax.annotate('{:.2f}x'.format(max(height2 / height1, 1)),
+                    xy=(rect.get_x(), height2),
+                    xytext=(0, 2),  # 3 points vertical offset
+                    textcoords="offset points",
+                    ha='center', va='bottom',
+                    fontsize=7)
+        
+def barplot(data, ax, title, y_column, y_limit, annotation_title, y_ticks=6, y_tick_format=lambda l: f"{l:.2f}", baseline_annotation_format=lambda l: f"{l:.2f}"):
+    
+    # Obtain x values for the plot;
+    x = np.arange(len(data["benchmark"].unique()))
+
+    # Obtain labels;
+    x_labels = [BENCHMARK_NAMES[l] for l in data["benchmark"].unique()]
+
+    peach = "#fab086"
+    green = "#6cb77c"
+    palette = [peach, green]
+    edgecolor = "#2f2f2f"
+    
+    bar_width = 0.35
+    
+    # Obtain y;
+    y_sync = data[data["policy"] == "sync"][y_column]
+    y_async = data[data["policy"] == ASYNC_POLICY_NAME][y_column]
+
+    rects1 = ax.bar(x - bar_width / 2, y_sync, bar_width, label="sync", color=palette[0], edgecolor=edgecolor)
+    rects2 = ax.bar(x + bar_width / 2, y_async, bar_width, label=ASYNC_POLICY_NAME, color=palette[1], edgecolor=edgecolor)
+    
+    ax.set_xticks(x)
+    ax.set_xticklabels(x_labels, fontsize=8, va="center")
+    
+    # ax.set_ylim((0, 1.1 * summary["memory_throughput"].max()))
+    ax.set_ylim(y_limit)
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(y_ticks))
+    ax.set_yticklabels(labels=[y_tick_format(l) for l in ax.get_yticks()], ha="right", fontsize=8)
+    ax.grid(True, axis="y")
+    
+    # ax.annotate(title, fontsize=9, x=.02, y=0.95, ha="left")
+    plt.suptitle("Hardware metrics for each\nbenchmark and execution policy,\nGTX 1660 Super", fontsize=14, x=.01, y=0.99, ha="left")
+    ax.annotate(title, xy=(0, 1.08), fontsize=10, ha="left", xycoords="axes fraction")#, xycoords="data", xytext=(0, 100), textcoords="offset points")
+    autolabel(ax, rects1, rects2)
+    
+    # Add baseline annotations;
+    for i, b in enumerate(BENCHMARK_NAMES):
+        position = x[i]
+        serial_throughput = summary[(summary["benchmark"] == b) & (summary["policy"] == "sync")][y_column].iloc[0]
+        if i == 0: 
+            ax.annotate(annotation_title, xy=(0, 0), fontsize=9, ha="left", va="center", xycoords="data", xytext=(-32, -20), textcoords="offset points")
+        ax.annotate(baseline_annotation_format(serial_throughput), xy=(position - bar_width, 0), fontsize=9, ha="center", va="center", xycoords="data", color=palette[0], xytext=(7, -30), textcoords="offset points")
+    
+    # Legend;  
+    labels = [POLICIES_DICT[p] for p in POLICIES]
+    custom_lines = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l)
+                    for i, l in enumerate(labels)]
+    leg = fig.legend(custom_lines, labels, bbox_to_anchor=(1, 1), fontsize=10, ncol=1)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_facecolor('white')        
+
+
+if __name__ == "__main__":
+    
+    files = os.listdir(os.path.join(DEFAULT_RES_DIR, INPUT_DATE))
+       
+    output_res = []
+    for b in BENCHMARK_NAMES.keys():
+        for p in POLICIES:
+            output_res += [load_data(b, p, files)]
+            
+    # Create a single table;
+    res = pd.concat(output_res, ignore_index=True)
+    # Sort columns;
+    res = res[list(res.columns[-2:]) + [res.columns[2]] + [res.columns[0]] + [res.columns[3]] + [res.columns[1]] + list(res.columns[5:-2])]
+    
+    # For each benchmark and policy, compute the total computation time;
+    total = []
+    summary_list = []
+    for (b, p), group in res.groupby(by=["benchmark", "policy"], sort=False):
+        total += [group]
+        overlap_computation_time = get_computation_time_with_overlap(group)
+        print(b, p, f"{overlap_computation_time:.2f}")
+        
+        # Device memory;
+        total_memory_accessed = group["dram_read_bytes"].sum() + group["dram_write_bytes"].sum()
+        memory_throughput = total_memory_accessed / (overlap_computation_time / 1000)
+        
+        # L2 cache;
+        total_l2_accessed = group["total_l2_read_bytes"].sum() + group["total_l2_write_bytes"].sum()
+        l2_throughput = total_l2_accessed / (overlap_computation_time / 1000)
+        
+        # IPC;
+        total_instructions = group["inst_executed"].sum() 
+        ipc = total_instructions / (GPU_CLOCK_HZ * (overlap_computation_time / 1000)) / GPU_NUM_SM
+        
+        # GigaFLOPS;
+        total_flop = group["total_flop"].sum() 
+        gigaflops = (total_flop / 10**9) / (overlap_computation_time / 1000)
+        
+        print(total_memory_accessed, total_l2_accessed)
+        
+        summary_list += [[b, p, overlap_computation_time, total_memory_accessed, memory_throughput, memory_throughput / MAX_GPU_BANDWIDTH, l2_throughput, l2_throughput / MAX_L2_GPU_BANDWIDTH,  ipc, gigaflops]]
+    data = pd.concat(total)   
+    summary = pd.DataFrame(summary_list, columns=["benchmark", "policy", "duration_ms", "dram_accessed_GB", "memory_throughput", "max_memory_throughput_perc", "l2_throughput", "max_l2_throughput_perc", "ipc", "gigaflops"])
+    
+    #%% Create barplot with memory throughput;   
+    
+    sns.set_style("white", {"ytick.left": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['axes.titlepad'] = 25 
+    plt.rcParams['axes.labelpad'] = 9 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    plt.rcParams['xtick.major.pad'] = 5
+    
+    num_col = 2
+    num_rows = 2
+    
+    fig, axes = plt.subplots(num_rows, num_col, figsize=(2.4 * num_col, 2.4 * num_rows)) 
+    plt.subplots_adjust(top=0.80,
+                    bottom=0.10,
+                    left=0.13,
+                    right=.99,
+                    hspace=0.6,
+                    wspace=0.4)
+   
+    barplot(summary, axes[0, 0], "Device memory throughput",
+            "memory_throughput", (0, 120), "Serial throughput (GB/s):", y_ticks=7, y_tick_format=lambda l: f"{int(l)} GB/s", baseline_annotation_format=lambda l: f"{int(l)}")
+    barplot(summary, axes[0, 1], "L2 cache throughput",
+            "l2_throughput", (0, 150), "Serial throughput (GB/s):", y_ticks=6, y_tick_format=lambda l: f"{int(l)} GB/s", baseline_annotation_format=lambda l: f"{int(l)}")
+    barplot(summary, axes[1, 0], "IPC",
+            "ipc", (0, 1.0), "Serial IPC:", y_ticks=6, y_tick_format=lambda l: f"{l:.2f}", baseline_annotation_format=lambda l: f"{l:.2f}")
+    barplot(summary, axes[1, 1], "GFLOPS32/64",
+            "gigaflops", (0, 120), "GFLOPS32/64:", y_ticks=7, y_tick_format=lambda l: f"{int(l)}", baseline_annotation_format=lambda l: f"{int(l)}")
+    
+    save_plot(PLOT_DIR, "memory_throughput_{}.{}", OUTPUT_DATE)
+    
+    #%%
+    tmp = res[res["policy"] == "sync"].groupby(by=["benchmark", "policy", "name"]).mean()
+    tmp["ipc_fix"] = tmp["estimated_ipc"] / 22
+    tmp["ipc_perc"] = ( tmp["ipc_fix"] -  tmp["ipc"]) /  tmp["ipc"]
+    
+    print(np.median(tmp["ipc_perc"]))
diff --git a/projects/resources/python/plotting/plot_multi_gpu_bandwidth_heatmap.py b/projects/resources/python/plotting/plot_multi_gpu_bandwidth_heatmap.py
new file mode 100644
index 00000000..0ddc7bde
--- /dev/null
+++ b/projects/resources/python/plotting/plot_multi_gpu_bandwidth_heatmap.py
@@ -0,0 +1,205 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Wed Jan 19 16:00:58 2022
+
+@author: albyr
+"""
+
+import pandas as pd
+import numpy as np
+import os
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot
+from matplotlib.colors import LinearSegmentedColormap, ListedColormap, to_rgba
+from load_data import PLOT_DIR
+
+OUTPUT_DATE = "2022_01_19"
+
+DEFAULT_RES_DIR = "../../connection_graph/datasets"
+NUM_GPU = 8
+DATASET = "connection_graph_{}_{}.csv"
+V100 = "V100"
+V100_DATA = DATASET.format(NUM_GPU, V100.lower())
+A100 = "A100"
+A100_DATA = DATASET.format(NUM_GPU, A100.lower())
+
+##############################
+##############################
+
+def plot_heatmap(data_dict: dict) -> plt.Figure:
+    
+    def plot_gpu(data: pd.DataFrame, ax_gpu: plt.Axes, column: int, ax_cbar: plt.Axes=None):
+        # Do not plot CPU, we plot it separetely;
+        data_gpu = data[data.index != "CPU"]
+        # Mask the lower anti-triagonal matrix, excluding the main diagonal, so it's not shown;
+        mask = np.zeros_like(data_gpu)
+        mask[np.tril_indices_from(mask)] = True
+        mask ^= np.eye(NUM_GPU).astype(bool)
+
+        # Main heatmap plot;
+        ax_gpu = sns.heatmap(data_gpu, square=True, mask=mask, vmin=0, vmax=max_bandwidth, linewidth=LINEWIDTH, 
+                             linecolor=linecolors, cmap=custom_cm, ax=ax_gpu, cbar_ax=ax_cbar, cbar=ax_cbar is not None,
+                             cbar_kws={"ticks": [0] + sorted_steps_colorbar})
+        # Add hatches to the main diagonal (https://stackoverflow.com/questions/55285013/adding-hatches-to-seaborn-heatmap-plot);
+        x = np.arange(len(data_gpu.columns) + 1)
+        y = np.arange(len(data_gpu.index) + 1)
+        zm = np.ma.masked_less(data_gpu.values, 200)
+        ax_gpu.pcolor(x, y, zm, hatch="//" * 3, alpha=0.0)
+        
+        # Add borders to the plot;
+        sns.despine(ax=ax_gpu, top=False, right=False)
+        # Hide axis labels;
+        ax_gpu.set(xlabel=None, ylabel=None)
+        # Hide tick labels;
+        ax_gpu.tick_params(labelbottom=False, top=False, labelsize=FONTSIZE, pad=2)   
+        ax_gpu.set_yticks([i + 0.5 for i in range(NUM_GPU)])
+        ax_gpu.set_yticklabels([f"GPU{i}" for i in range(NUM_GPU)])
+        
+        # Dotted lines from left to main diagonal;
+        for i in range(1, NUM_GPU):
+            ax_gpu.axhline(i + 0.5, xmin=0, xmax=i / NUM_GPU, color="#2f2f2f", linewidth=1, linestyle=":")
+        
+        # Customize colorbar;
+        if ax_cbar is not None:
+            # Add border around colorbar;
+            cbar = ax_gpu.collections[0].colorbar
+            for spine in cbar.ax.spines.values():
+                spine.set(visible=True, linewidth=LINEWIDTH, edgecolor="black")
+            # Customize labels of colorbar
+            cbar.ax.set_yticklabels([f"{x}" for x in [0] + sorted_steps_colorbar]) 
+            cbar.ax.tick_params(labelsize=FONTSIZE, pad=1, size=2)
+            cbar.ax.annotate("GB/s", xy=(0, 1.02), fontsize=FONTSIZE, ha="left", xycoords="axes fraction", color="#2f2f2f")
+        return ax_gpu, ax_cbar
+        
+    def plot_cpu(data: pd.DataFrame, ax: plt.Axes, column: int):
+        # Draw the heatmap for the CPU;
+        data_cpu = data[data.index == "CPU"]
+        ax_cpu = sns.heatmap(data_cpu, square=True, vmin=0, vmax=max_bandwidth, linewidth=LINEWIDTH, 
+                             linecolor=linecolors, cmap=custom_cm, ax=ax, cbar=False)
+        # Put x-tick labels on top;
+        ax_cpu.xaxis.tick_top()
+        ax_cpu.xaxis.set_label_position("top")
+        # Hide axis labels;
+        ax_cpu.set(xlabel=None, ylabel=None)
+        # Show x-tick labels;
+        ax_cpu.set_yticks([0.5])
+        ax_cpu.set_yticklabels(["CPU"])
+        ax_cpu.set_xticks([i + 0.5 for i in range(NUM_GPU)])
+        ax_cpu.set_xticklabels([f"G{i}" for i in range(NUM_GPU)])
+        ax_cpu.tick_params(labeltop=True, top=True, pad=0.1, labelsize=FONTSIZE)    
+        
+        # Draw lines between CPU heatmap and GPU heatmap;
+        for i in range(0, NUM_GPU):
+            ax_cpu.axvline(i + 0.5, ymin=0, ymax=-2, color="#2f2f2f", linewidth=1, linestyle=":", clip_on=False)
+            
+        # Add tree above GPUs to show CPU interconnection;
+        base = 1.7
+        for i in range(0, NUM_GPU):
+            ax_cpu.axvline(i + 0.5, ymin=base, ymax=base + 0.3, color="#2f2f2f", linewidth=0.5, linestyle="-", clip_on=False)
+        for i in range(0, NUM_GPU, 2):
+            ax_cpu.axhline(-1, xmin=(i + 0.5) / NUM_GPU, xmax=(i + 1.5) / NUM_GPU, color="#2f2f2f", linewidth=0.5, linestyle="-", clip_on=False, zorder=89)
+        for i in range(0, NUM_GPU, 2):
+            ax_cpu.axvline(i + 1, ymin=base + 0.3, ymax=base + 0.6, color="#2f2f2f", linewidth=0.5, linestyle="-", clip_on=False)   
+        for i in range(0, NUM_GPU, 4):
+            ax_cpu.axhline(-1.3, xmin=(i + 1) / NUM_GPU, xmax=(i + 3) / NUM_GPU, color="#2f2f2f", linewidth=0.5, linestyle="-", clip_on=False, zorder=89)
+            ax.annotate(f"PCIe Tree {i // 4}", xy=((i + 2) / NUM_GPU, 2.4), fontsize=FONTSIZE, ha="center", color="#2f2f2f", clip_on=False, xycoords="axes fraction")
+        return ax_cpu
+    
+    #################
+    # Preprocessing #
+    #################
+    
+    # Obtain maximum of the matrix, excluding the main diagonal;
+    max_bandwidth = 0
+    for d in data_dict.values():
+        data_tmp = d[d.index != "CPU"]
+        max_bandwidth = max(max_bandwidth, (data_tmp.to_numpy() - np.eye(NUM_GPU) * data_tmp.to_numpy().diagonal()).max())
+    
+    # Black outline for non-empty cells, else white;
+    linecolors = ["#2f2f2f" if i <= j else (0, 0, 0, 0) for i in range(NUM_GPU) for j in range(NUM_GPU)]
+    # Black and white colormap, from black to white (https://stackoverflow.com/questions/58597226/how-to-customize-the-colorbar-of-a-heatmap);
+    num_colors = 200
+    cm = LinearSegmentedColormap.from_list("gray-custom", ["0.2", "white"], N=num_colors)
+    custom_colors = np.array([list(cm(i)) for i in np.linspace(0, 1, num_colors)])
+    # Add discrete steps, including the CPU;
+    values = set()
+    for d in data_dict.values():
+        values = values.union(set(d.to_numpy().reshape(-1)))
+    sorted_steps_colorbar = sorted([c for c in values if c <= max_bandwidth], reverse=True)
+    for c in sorted_steps_colorbar:
+        custom_colors[:int(num_colors * c / max_bandwidth) + 1, :] = cm(c / max_bandwidth)
+    custom_cm = ListedColormap(custom_colors)
+        
+    ##############
+    # Setup plot #
+    ##############
+    FONTSIZE = 4
+    LINEWIDTH = 0.5
+    plt.rcdefaults()
+    sns.set_style("white", {"ytick.left": True, "xtick.top": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams["hatch.linewidth"] = 0.2
+    plt.rcParams["axes.linewidth"] = LINEWIDTH
+    
+    # 2 x 2 as we draw the CPU heatmap in the top row, https://matplotlib.org/stable/gallery/subplots_axes_and_figures/gridspec_and_subplots.html#sphx-glr-gallery-subplots-axes-and-figures-gridspec-and-subplots-py;
+    fig, axes = plt.subplots(2, 3, sharex="col", figsize=(3.34, 1.6), dpi=200,
+                             gridspec_kw={"width_ratios": [100, 100, 8], "height_ratios": [12.5, 100]})
+    gs = axes[0, 2].get_gridspec()
+    # Remove the existing axes in the right column;
+    for ax in axes[0:, -1]:
+        ax.remove()
+    plt.subplots_adjust(top=0.86,
+                        bottom=0.08,
+                        left=0.05,
+                        right=0.95,
+                        hspace=0.2,
+                        wspace=0.0)
+    # Create a large axis;
+    ax_cbar = fig.add_subplot(gs[0:, 2])
+    
+    for i, (gpu, data) in enumerate(data_dict.items()):
+        ax_gpu = axes[1, i]
+        ax_cpu = axes[0, i]
+        # GPU;
+        ax_gpu, ax_cbar = plot_gpu(data, ax_gpu, i, ax_cbar if i == 0 else None)
+        # CPU;
+        ax_cpu = plot_cpu(data, ax_cpu, i)
+        # GPU Label;
+        ax_gpu.annotate(f"{gpu}", xy=(0.5, -0.1), fontsize=FONTSIZE + 2, ha="center", color="#2f2f2f", clip_on=False, xycoords="axes fraction")
+    
+    return fig
+
+##############################
+##############################
+
+if __name__ == "__main__":
+    
+    d = {}
+    for g in [V100, A100]:
+    
+        # Read data (TODO: use A100 data);
+        data = pd.read_csv(os.path.join(DEFAULT_RES_DIR, DATASET.format(NUM_GPU, V100.lower())), names=["from", "to", "bandwidth"], skiprows=1)
+        
+        # Mock data for tha A100 (non-CPU interconnection is 2x faster);
+        if g == A100:
+            data.loc[data["from"] != -1, "bandwidth"] *= 2
+        
+        # Round to integer;
+        data["bandwidth"] = data["bandwidth"].astype(int)
+        for c in ["from", "to"]:
+            # Replace "-1" with CPU and other numbers with the GPU name;
+            data[c].replace({-1: "CPU", **{i: f"GPU{i}" for i in range(NUM_GPU)}}, inplace=True)
+            # Use categorical labels for devices;
+            data[c] = pd.Categorical(data[c], categories=["CPU"] + [f"GPU{i}" for i in range(NUM_GPU)], ordered=True)
+        # Sort values;
+        data.sort_values(["from", "to"], inplace=True)
+        # Turn the dataframe into a matrix;
+        data_matrix = data.pivot(index="from", columns="to", values="bandwidth")
+        d[g] = data_matrix
+        
+    # Plot heatmap;
+    fig = plot_heatmap(d)
+    save_plot(PLOT_DIR, "bandwidth_gpus" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+
diff --git a/projects/resources/python/plotting/plot_multi_gpu_cuda_exec_time.py b/projects/resources/python/plotting/plot_multi_gpu_cuda_exec_time.py
new file mode 100644
index 00000000..268f9f7b
--- /dev/null
+++ b/projects/resources/python/plotting/plot_multi_gpu_cuda_exec_time.py
@@ -0,0 +1,455 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Wed Oct  6 20:51:56 2021
+
+@author: albyr
+"""
+#%%
+import math
+import os
+
+import matplotlib.gridspec as gridspec
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+from matplotlib.patches import Patch
+from matplotlib.transforms import blended_transform_factory
+
+from load_data import PLOT_DIR, load_data_cuda_multigpu
+from segretini_matplottini.src.plot_utils import (add_labels, PALETTE_OG, PALETTE_G3, 
+                                                  get_exp_label, save_plot, get_upper_ci_size)
+
+##############################
+##############################
+
+OUTPUT_DATE = "2022_01_20"
+
+# V100;
+V100 = "V100"
+V100_RES_FOLDERS = [
+    # "2021_10_04_15_13_11_cuda_1gpu_v100",
+    # "2021_10_04_15_15_29_cuda_2gpu_v100",
+    # "2021_10_04_15_15_49_cuda_4gpu_v100",
+    "2022_01_16_18_09_04_cuda_1-2gpu_v100",
+    "2022_01_16_18_17_05_cuda_4gpu_v100",
+    "2021_10_04_15_33_23_cuda_8gpu_v100",
+    ]
+
+# A100;
+A100 = "A100"
+A100_RES_FOLDERS = [
+    "2021_10_18_11_50_56_cuda_1gpu_a100",
+    "2021_10_18_12_57_50_cuda_2gpu_a100",
+    "2021_10_18_13_21_05_cuda_4gpu_a100",
+    "2021_10_18_13_44_18_cuda_8gpu_a100",
+    ]
+    
+##############################
+##############################    
+
+def plot_speedup_bars(data_in,
+                      gpu,
+                      speedup_column="speedup",
+                      baseline_is_async: bool=True,
+                      keep_only_max_size: bool=False,
+                      ylabel: str="Speedup",
+                      legend_title: str="Baseline: ASYNC, 1 GPU",
+                      legend_baseline_label: str=None,
+                      ymax: float=6,
+                      yticks: int=7):
+    plt.rcdefaults()
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": False})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['hatch.linewidth'] = 0.3
+    plt.rcParams['axes.labelpad'] = 5
+    plt.rcParams['xtick.major.pad'] = 4.2
+    plt.rcParams['ytick.major.pad'] = 1
+    plt.rcParams['axes.linewidth'] = 0.5
+    
+    fig = plt.figure(figsize=(2, 0.95), dpi=600)
+    gs = gridspec.GridSpec(1, 1)
+    plt.subplots_adjust(top=0.78,
+                        bottom=0.17,
+                        left=0.12,
+                        right=0.99,
+                        hspace=0.15,
+                        wspace=0.8)
+    ax = fig.add_subplot(gs[0, 0])
+    
+    effective_benchmarks = ["MEAN"] + list(data_in["benchmark"].unique())
+    fontsize = 4
+    
+    # Remove async with 1 GPU, it is the baseline;
+    data = data_in[~((data_in["gpus"] == 1) & (data_in["exec_policy"] == ("ASYNC" if baseline_is_async else "SYNC")))]
+    
+    # Keep only the experiments on the largest dataset;
+    if keep_only_max_size:
+        max_sizes = data.groupby(["benchmark"])["size"].max().reset_index()
+        data = data.merge(max_sizes, how="inner", on=["benchmark", "size"])
+        
+    # Compute mean of all benchmarks, grouped by number of GPUs;
+    data_mean = data.groupby("gpus").mean().reset_index()
+    data_mean["benchmark"] = "MEAN"
+    new_data = [data_mean]
+    new_data += [data]
+    data = pd.concat(new_data, ignore_index=True)
+
+    num_gpus = len(data["gpus"].unique())
+    palette = PALETTE_OG[:num_gpus]
+
+    ##############
+    # Main plot  #
+    ##############
+    
+    ax = sns.barplot(x="benchmark", y=speedup_column, order=effective_benchmarks,
+                     hue="gpus",
+                     palette=palette,
+                     data=data,
+                     ci=95, capsize=.05, errwidth=0.3, linewidth=0.3,
+                     ax=ax, edgecolor="#2f2f2f", estimator=np.mean, saturation=1, zorder=2)
+    ax.legend_.remove()  # Hack to remove legend;
+    
+    ################
+    # Refine style #
+    ################
+
+    # Grid and axis limits;
+    ax.yaxis.grid(True, linewidth=0.4)
+    ax.xaxis.grid(False)
+    ax.set_xlim((-0.5, len(data["benchmark"].unique()) - 0.5))
+
+    # Axis limits;
+    ax.set_ylim((0, ymax))
+    # Color background to represent linear scaling of performance;
+    ax.fill_between(ax.get_xlim(), 0, 1, facecolor="0.9", alpha=0.5, zorder=0.4, edgecolor="0.9", linewidth=0.1)
+    for i in [1, 2, 4, 8][:num_gpus]:
+        ax.axhline(y=i, color="0.6", linestyle="--", zorder=1, linewidth=0.4)
+
+    # Ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(yticks))
+    ax.set_yticklabels(labels=[f"{l:.1f}x" for l in ax.get_yticks()], ha="right", fontsize=fontsize)
+        
+    ax.set_xticks([i for i, l in enumerate(data["benchmark"].unique()) if "empty" not in l])
+    ax.tick_params(length=2, width=0.5)
+    ax.set_xticklabels(labels=effective_benchmarks, ha="center", va="top", rotation=0, fontsize=fontsize - 0.2)
+        
+    # Set "MEAN" labels to a different color;
+    for i, l in enumerate(ax.get_xticklabels()):
+        if "MEAN" in l._text:
+            l.set_color(PALETTE_OG[3])
+    # Change color of mean patches, start by filtering the ones with height > 0 and then look for labels with "MEAN";
+    try:
+        patches_to_color = [a for a in ax.patches if a.get_height() > 0]
+        for i, p in enumerate(patches_to_color):
+            if i % len(effective_benchmarks) == 0:
+                p.set_facecolor(sns.desaturate(p._facecolor, 0.6))
+    except IndexError as e:
+        print(e)
+    # Separate mean with a vertical line;
+    ax.axvline(x=0.5, color="0.6", linestyle="--", zorder=1, linewidth=0.3)
+    
+    plt.ylabel(ylabel, fontsize=fontsize, labelpad=1)
+    plt.xlabel(None)
+    
+    # Add speedup labels over bars;
+    offsets = []
+    for j, g_tmp in data.groupby(["benchmark", "gpus"]):
+        offsets += [get_upper_ci_size(g_tmp[speedup_column], ci=0.95)]
+    offsets = [o if not np.isnan(o) else 0.05 for o in offsets]
+    add_labels(ax, vertical_offsets=offsets, rotation=0, format_str="{:.2f}", fontsize=2.2, skip_zero=False)
+    
+    # Add label with GPU name;
+    ax.annotate(gpu, xy=(0.99, 0.9), xycoords="axes fraction", ha="right", color="#2f2f2f", fontsize=fontsize, alpha=1)   
+    
+    # Create hierarchical x ticks;
+    y_min = -0.08
+    y_max = -0.13
+    group_labels = effective_benchmarks
+    bar_width = ax.patches[0].get_width()  # Get width of a bar
+    labels = ["S" if baseline_is_async else "A", 2, 4, 8][:num_gpus]
+    for i in range(len(group_labels)):
+        x_start = i - bar_width * (num_gpus / 2)
+        x_end = i + bar_width * (num_gpus / 2)
+        x_middle = (x_start + x_end) / 2
+        ax.hlines(y_min, x_start, x_end, color="#2f2f2f", linewidth=0.3, clip_on=False, transform=blended_transform_factory(ax.transData, ax.transAxes))
+        ax.vlines(x_middle, y_min, y_max, color="#2f2f2f", linewidth=0.3, clip_on=False, transform=blended_transform_factory(ax.transData, ax.transAxes))
+        for l_i, l in enumerate(labels):
+            start = bar_width * (num_gpus / 2)
+            ax.annotate(l, xy=(i - start + bar_width / 2 + bar_width * l_i, y_min + 0.02),
+                        xycoords=blended_transform_factory(ax.transData, ax.transAxes), clip_on=False,
+                        fontsize=3.2, ha="center")
+ 
+    # Add legend;
+    if legend_baseline_label is None:
+        legend_baseline_label = f"{'' if baseline_is_async else 'A'}SYNC, 1 GPU"
+    labels = [legend_baseline_label, "2 GPU", "4 GPU", "8 GPU"][:num_gpus]
+    patches = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l, linewidth=0.5) for i, l in enumerate(labels)]
+    leg = fig.legend(patches, labels, bbox_to_anchor=(0.55, 1.01), fontsize=fontsize,
+                     ncol=num_gpus, loc="upper center", handlelength=1.2, 
+                     handletextpad=0.2, columnspacing=0.5, title=legend_title, title_fontsize=fontsize)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_linewidth(0.5)
+    leg.get_frame().set_facecolor('white')
+    
+    return ax
+
+
+def plot_speedup_line(data_in, gpu,
+                      speedup_column: str="speedup",
+                      baseline_time_column: str="baseline_time",
+                      baseline_is_async: bool=True,
+                      kind: str="CUDA"):
+    
+    # Remove async with 1 GPU, it is the baseline;
+    data = data_in[~((data_in["gpus"] == 1) & (data_in["exec_policy"] == ("ASYNC" if baseline_is_async else "SYNC")))].copy()
+    data["size_str"] = data["size"].astype(str)
+    num_gpus = len(data["gpus"].unique())
+    
+    ##############
+    # Plot setup #
+    ##############
+    
+    plt.rcdefaults()
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['hatch.linewidth'] = 0.3
+    plt.rcParams['axes.labelpad'] = 5
+    plt.rcParams['xtick.major.pad'] = 4.2
+    plt.rcParams['ytick.major.pad'] = 0.5
+    plt.rcParams['axes.linewidth'] = 1
+    
+    palette = PALETTE_OG[:num_gpus]
+    markers = ["P", "X", "o", "D"][:num_gpus]
+    fontsize = 8
+    cols = 3
+    rows = (len(data_in["benchmark"].unique()) + 1) // cols
+    
+    fig = plt.figure(figsize=(1.8 * cols + 0.1, 1.5 * rows), dpi=600)
+    gs = gridspec.GridSpec(rows, cols)
+    plt.subplots_adjust(top=0.9,
+                        bottom=0.17,
+                        left=0.07,
+                        right=0.97,
+                        hspace=1,
+                        wspace=0.3)
+
+    ##############
+    # Main plot  #
+    ##############
+    
+    for b_i, (b, d) in enumerate(data.groupby("benchmark")):
+        col = b_i % cols
+        row = b_i // cols
+        ax = fig.add_subplot(gs[row, col])
+        ax = sns.lineplot(x="size_str", y=speedup_column, hue="gpus", data=d, palette=palette, ax=ax, estimator=np.mean,
+                          legend=None, ci=99, zorder=2)
+        data_averaged = d.groupby(["size_str", "gpus"]).mean()[speedup_column].reset_index()
+        ax = sns.scatterplot(x="size_str", y=speedup_column, hue="gpus", style="gpus", data=data_averaged, palette=palette,
+                             ax=ax, estimator=np.mean, legend=None, markers=markers, edgecolor="#2f2f2f", zorder=3, size=1, linewidth=0.5)
+        plt.xlabel(None)
+        plt.ylabel(None)
+        
+        # Set axis limits;
+        x_lim = list(ax.get_xlim())
+        ax.set_xlim(x_lim)
+        ax.set_ylim((0, 6) if row == 0 else (0, 4))
+        
+        # Add benchmark name;
+        ax.annotate(b, xy=(0.5, 1.05), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=fontsize, alpha=1)   
+        # Color background to represent linear scaling of performance;
+        ax.fill_between(ax.get_xlim(), 0, 1, facecolor="#dddddd", alpha=0.4, zorder=0.4)
+        # Grid and axis limits;
+        ax.yaxis.grid(True, linewidth=0.5)
+        ax.xaxis.grid(False)
+        # Line for speedup = 1;
+        ax.axhline(y=1, color="0.6", linestyle="--", zorder=1, linewidth=0.5)
+        # Ticks;
+        ax.yaxis.set_major_locator(plt.LinearLocator(7 if row == 0 else 5))
+        ax.set_yticklabels(labels=[f"{l:.1f}x" for l in ax.get_yticks()], ha="right", fontsize=fontsize)
+        x_ticks = list(d["size"].unique())
+        ax.set_xticks([str(l) for l in x_ticks])
+        ax.tick_params(length=2, width=0.5)
+        ax.set_xticklabels(labels=[get_exp_label(l, decimal_places=1) for l in x_ticks], ha="center",
+                           va="top", rotation=0, fontsize=fontsize - 0.5)
+    
+        # Add baseline times;
+        ax.annotate(f"Baseline {kind} exec. time (ms):", xy=(0, -0.4), fontsize=fontsize - 1, ha="left", xycoords="axes fraction", color="#949494")
+        if col == 0:
+            ax.annotate(f"{gpu}:", xy=(-0.4, -0.56), fontsize=fontsize - 1, color="#949494", ha="right", xycoords=("data", "axes fraction"))
+        for l_i, l in enumerate(x_ticks):
+            vals = d[(d["size"] == int(l))][baseline_time_column]
+            baseline_mean = np.mean(vals) if len(vals) > 0 else np.nan
+            if not math.isnan(baseline_mean):
+                ax.annotate(f"{int(1000 * baseline_mean)}", xy=(l_i, -0.56), fontsize=fontsize - 1, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend;
+    labels = ["SYNC, 1 GPU", "2 GPU", "4 GPU", "8 GPU"][:num_gpus]
+    patches = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l, linewidth=0.5) for i, l in enumerate(labels)]
+    leg = fig.legend(patches, labels, bbox_to_anchor=(0.95, 0.14), fontsize=fontsize,
+                     ncol=1, loc="lower right", handlelength=1.2, title=f"{kind}, {gpu},\nvs. ASYNC, 1 GPU",
+                     handletextpad=0.2, columnspacing=0.5, title_fontsize=fontsize)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_linewidth(0.5)
+    leg.get_frame().set_facecolor('white')
+    
+    
+def plot_ablation_bars(data_in: pd.DataFrame,
+                       gpu: str,
+                       gpus: int,
+                       speedup_column="speedup",
+                       ymax: float=1.6,
+                       yticks: int=9,
+                       fig: plt.Figure=None,
+                       ax: plt.Axes=None,
+                       plot_speedup_labels: bool=True):
+    
+    if fig is None or ax is None:
+        plt.rcdefaults()
+        sns.set_style("white", {"ytick.left": True, "xtick.bottom": False})
+        plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+        plt.rcParams['hatch.linewidth'] = 0.3
+        plt.rcParams['axes.labelpad'] = 5
+        plt.rcParams['xtick.major.pad'] = 4.2
+        plt.rcParams['ytick.major.pad'] = 1
+        plt.rcParams['axes.linewidth'] = 0.5
+        
+        fig = plt.figure(figsize=(2, 0.95), dpi=600)
+        gs = gridspec.GridSpec(1, 1)
+        plt.subplots_adjust(top=0.78,
+                            bottom=0.17,
+                            left=0.12,
+                            right=0.99,
+                            hspace=0.15,
+                            wspace=0.8)
+        ax = fig.add_subplot(gs[0, 0])
+    
+    effective_benchmarks = ["MEAN"] + list(data_in["benchmark"].unique())
+    fontsize = 4
+
+    data = data_in.copy()
+        
+    # Compute mean of all benchmarks, grouped by number of GPUs;
+    data_mean = data.groupby("policy").mean().reset_index()
+    data_mean["benchmark"] = "MEAN"
+    new_data = [data_mean]
+    new_data += [data]
+    data = pd.concat(new_data, ignore_index=True)
+
+    num_policies = len(data["policy"].unique())
+    palette = PALETTE_G3[:num_policies]
+
+    ##############
+    # Main plot  #
+    ##############
+    
+    ax = sns.barplot(x="benchmark", y=speedup_column, order=effective_benchmarks,
+                     hue="policy",
+                     palette=palette,
+                     data=data,
+                     ci=95, capsize=.05, errwidth=0.3, linewidth=0.3,
+                     ax=ax, edgecolor="#2f2f2f", estimator=np.mean, saturation=1, zorder=2)
+    ax.legend_.remove()  # Hack to remove legend;
+    
+    ################
+    # Refine style #
+    ################
+
+    # Grid and axis limits;
+    ax.yaxis.grid(True, linewidth=0.4)
+    ax.xaxis.grid(False)
+    ax.set_xlim((-0.5, len(data["benchmark"].unique()) - 0.5))
+
+    # Axis limits;
+    ax.set_ylim((0, ymax))
+    # Color background to represent linear scaling of performance;
+    ax.fill_between(ax.get_xlim(), 0, 1, facecolor="0.9", alpha=0.5, zorder=0.4, edgecolor="0.9", linewidth=0.1)
+
+    # Ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(yticks))
+    ax.set_yticklabels(labels=[f"{l:.1f}x" for l in ax.get_yticks()], ha="right", fontsize=fontsize)
+        
+    ax.set_xticks([i for i, l in enumerate(data["benchmark"].unique()) if "empty" not in l])
+    ax.tick_params(length=2, width=0.5)
+    ax.tick_params(axis="x", pad=2)
+    ax.set_xticklabels(labels=effective_benchmarks, ha="center", va="top", rotation=0, fontsize=fontsize - 0.2)
+        
+    # Set "MEAN" labels to a different color;
+    for i, l in enumerate(ax.get_xticklabels()):
+        if "MEAN" in l._text:
+            l.set_color(PALETTE_OG[3])
+    # Change color of mean patches, start by filtering the ones with height > 0 and then look for labels with "MEAN";
+    try:
+        patches_to_color = [a for a in ax.patches if a.get_height() > 0]
+        for i, p in enumerate(patches_to_color):
+            if i % len(effective_benchmarks) == 0:
+                p.set_facecolor(sns.desaturate(p._facecolor, 0.6))
+    except IndexError as e:
+        print(e)
+    # Separate mean with a vertical line;
+    ax.axvline(x=0.5, color="0.6", linestyle="--", zorder=1, linewidth=0.3)
+    
+    plt.ylabel(None)
+    plt.xlabel(None)
+    
+    # Add speedup labels over bars;
+    if plot_speedup_labels:
+        offsets = []
+        for j, g_tmp in data.groupby(["benchmark", "policy"]):
+            offsets += [get_upper_ci_size(g_tmp[speedup_column], ci=0.95)]
+        offsets = [o if not np.isnan(o) else 0.05 for o in offsets]
+        add_labels(ax, vertical_offsets=offsets, rotation=0, format_str="{:.2f}", fontsize=2.2, skip_zero=False)
+    
+    # Add label with GPU name;
+    ax.annotate(f"{gpus} {gpu}s", xy=(0.99, 0.9), xycoords="axes fraction", ha="right", color="#2f2f2f", fontsize=fontsize, alpha=1)   
+    
+    # Create hierarchical x ticks;
+    y_min = -0.03
+    y_max = -0.08
+    group_labels = effective_benchmarks
+    bar_width = ax.patches[0].get_width()  # Get width of a bar
+    # labels = ["S" if baseline_is_async else "A", 2, 4, 8][:num_gpus]
+    for i in range(len(group_labels)):
+        x_start = i - bar_width * (num_policies / 2)
+        x_end = i + bar_width * (num_policies / 2)
+        x_middle = (x_start + x_end) / 2
+        ax.hlines(y_min, x_start, x_end, color="#2f2f2f", linewidth=0.3, clip_on=False, transform=blended_transform_factory(ax.transData, ax.transAxes))
+        ax.vlines(x_middle, y_min, y_max, color="#2f2f2f", linewidth=0.3, clip_on=False, transform=blended_transform_factory(ax.transData, ax.transAxes))
+        # for l_i, l in enumerate(labels):
+        #     start = bar_width * (num_gpus / 2)
+        #     ax.annotate(l, xy=(i - start + bar_width / 2 + bar_width * l_i, y_min + 0.02),
+        #                 xycoords=blended_transform_factory(ax.transData, ax.transAxes), clip_on=False,
+        #                 fontsize=3.2, ha="center")
+ 
+    # Add legend;
+    legend_title = "Speedup vs. best policy (MD-Min-Transfer-Time)"
+    labels = ["D-Round-Robin", "D-Stream-Aware", "D-Min-Transfer-Time", "MD-Min-Transfer-Time"]
+    patches = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l, linewidth=0.5) for i, l in enumerate(labels)]
+    leg = fig.legend(patches, labels, bbox_to_anchor=(0.55, 1), fontsize=fontsize,
+                     ncol=num_policies, loc="upper center", handlelength=1.2, 
+                     handletextpad=0.2, columnspacing=0.5, title=legend_title, title_fontsize=fontsize)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_linewidth(0.5)
+    leg.get_frame().set_facecolor('white')
+    
+    return ax
+
+
+#%%###########################
+##############################
+
+if __name__ == "__main__":
+    
+    for g, folder in zip([V100, A100], [V100_RES_FOLDERS, A100_RES_FOLDERS]):
+        res_cuda = load_data_cuda_multigpu([os.path.join(g, x) for x in folder], skip_iter=3)
+        res_cuda_grouped = res_cuda.groupby(["benchmark", "exec_policy", "gpus"]).mean().dropna().reset_index()
+   
+        #%% Plot speedup divided by benchmark and number of GPUs;
+        plot_speedup_bars(res_cuda_grouped, g)
+        save_plot(PLOT_DIR, f"cuda_bars_{g}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+     
+        #%% Plot speedup divided by size, benchmark and number of GPUs;
+        plot_speedup_line(res_cuda, g)
+        save_plot(PLOT_DIR, f"cuda_lines_{g}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+
+
diff --git a/projects/resources/python/plotting/plot_multi_gpu_cuda_partition_scaling.py b/projects/resources/python/plotting/plot_multi_gpu_cuda_partition_scaling.py
new file mode 100644
index 00000000..37dea083
--- /dev/null
+++ b/projects/resources/python/plotting/plot_multi_gpu_cuda_partition_scaling.py
@@ -0,0 +1,473 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Wed Nov  3 09:34:37 2021
+
+@author: albyr
+"""
+
+import os
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from matplotlib.ticker import NullFormatter
+import seaborn as sns
+from matplotlib.patches import Patch
+import matplotlib.gridspec as gridspec
+from segretini_matplottini.src.plot_utils import remove_outliers_df_iqr_grouped, compute_speedup_df, save_plot, PALETTE_G3, PALETTE_O, transpose_legend_labels
+from load_data import PLOT_DIR
+
+##############################
+##############################
+
+INPUT_DATE = "2021_11_02"
+OUTPUT_DATE = "2021_11_02"
+
+# V100;
+GPU = "V100"
+# A100;
+GPU = "A100"
+SIZE = 20000 # 2048
+
+RES_FOLDER = f"../../../../grcuda-data/results/scheduling_multi_gpu/{GPU}"
+SUBFOLDERS = {
+    f"{INPUT_DATE}_partition_scaling_b11_low": 4,
+    f"{INPUT_DATE}_partition_scaling_b11_high": 12,
+    f"{INPUT_DATE}_partition_scaling_b11_veryhigh": 16,
+    }
+
+##############################
+##############################
+
+def load_data() -> (pd.DataFrame, pd.DataFrame):
+    res_folder = os.path.join(RES_FOLDER, f"{INPUT_DATE}_partition_scaling")
+    data = []
+    for res in os.listdir(res_folder):
+        size, gpus, partitions = [int(x) for x in os.path.splitext(res)[0].split("_")]
+        try:
+            res_data = pd.read_csv(os.path.join(res_folder, res))
+            res_data["size"] = size
+            res_data["gpus"] = gpus
+            res_data["partitions"] = partitions
+            data += [res_data]
+        except pd._libs.parsers.ParserError as e:
+            print(f"error parsing {res}, error={e}")      
+    data = pd.concat(data, ignore_index=True)
+    # Filter first few iterations;
+    data = data[data["num_iter"] > 1]
+    # Use only some data size;
+    data = data[data["size"] == SIZE]
+    # Remove outliers;
+    remove_outliers_df_iqr_grouped(data, column="computation_sec", group=["size", "gpus", "partitions"],
+                                    reset_index=True, quantile=0.75, drop_index=True, debug=True)
+    # Sort data;
+    data = data.sort_values(by=["size", "gpus", "partitions", "num_iter"]).reset_index(drop=True)
+    # Compute speedups;
+    compute_speedup_df(data, key=["size"],
+                       baseline_filter_col=["gpus", "partitions"], baseline_filter_val=[1, 1],  
+                       speedup_col_name="speedup", time_column="computation_sec",
+                       baseline_col_name="baseline_sec", aggregation=np.mean, correction=False)
+    # Obtain mean of computation times, grouped;
+    data_agg = data.groupby(["size", "gpus", "partitions"]).mean()[["computation_sec", "speedup"]].reset_index()
+    return data, data_agg
+
+
+def load_data_multiconfig(global_speedup: bool=True, remove_outliers: bool=True, main_res_folder: str=RES_FOLDER) -> (pd.DataFrame, pd.DataFrame):
+    data = []
+    for folder in SUBFOLDERS.keys():
+        res_folder = os.path.join(main_res_folder, folder)
+        for res in os.listdir(res_folder):
+            size, gpus, partitions = [int(x) for x in os.path.splitext(res)[0].split("_")]
+            try:
+                res_data = pd.read_csv(os.path.join(res_folder, res))
+                res_data["size"] = size
+                res_data["gpus"] = gpus
+                res_data["partitions"] = partitions
+                res_data["config"] = SUBFOLDERS[folder]
+                data += [res_data]
+            except pd._libs.parsers.ParserError as e:
+                print(f"error parsing {res}, error={e}")      
+                
+    data = pd.concat(data, ignore_index=True)
+    # Filter first few iterations;
+    data = data[data["num_iter"] > 1]
+    # Use only some data size;
+    data = data[data["size"] == SIZE]
+    # Remove outliers;
+    if remove_outliers:
+        remove_outliers_df_iqr_grouped(data, column="computation_sec", group=["size", "config", "gpus", "partitions" ],
+                                        reset_index=True, quantile=0.75, drop_index=True, debug=True)
+    # Sort data;
+    data = data.sort_values(by=["size", "config", "gpus", "partitions", "num_iter"]).reset_index(drop=True)
+    # Compute speedups;
+    if global_speedup:
+        compute_speedup_df(data, key=["size"],
+                           baseline_filter_col=["gpus", "partitions", "config"], baseline_filter_val=[1, 1, min(SUBFOLDERS.values())],  
+                           speedup_col_name="speedup", time_column="computation_sec",
+                           baseline_col_name="baseline_sec", aggregation=np.mean, correction=False)
+    else:
+        compute_speedup_df(data, key=["size", "config"],
+                           baseline_filter_col=["gpus", "partitions"], baseline_filter_val=[1, 1],  
+                           speedup_col_name="speedup", time_column="computation_sec",
+                           baseline_col_name="baseline_sec", aggregation=np.mean, correction=False)
+    # Obtain mean of computation times, grouped;
+    data_agg = data.groupby(["size", "config", "gpus", "partitions"]).mean()[["computation_sec", "speedup"]].reset_index()
+    return data, data_agg
+
+
+def load_data_a100(global_speedup: bool=True, remove_outliers: bool=True, main_res_folder: str=RES_FOLDER) -> (pd.DataFrame, pd.DataFrame):
+    data = []
+    res_folder = os.path.join(main_res_folder, f"{INPUT_DATE}_partition_scaling")
+    for res in os.listdir(res_folder):
+        try:
+            size, gpus, partitions, config, prefetch = os.path.splitext(res)[0].split("_")
+            prefetch = True
+        except ValueError:
+            size, gpus, partitions, config = os.path.splitext(res)[0].split("_")
+            prefetch = False
+        size, gpus, partitions, config = [int(x) for x in [size, gpus, partitions, config]]
+        try:
+            res_data = pd.read_csv(os.path.join(res_folder, res))
+            res_data["size"] = size
+            res_data["gpus"] = gpus
+            res_data["partitions"] = partitions
+            res_data["config"] = config
+            res_data["prefetch"] = prefetch
+            data += [res_data]
+        except pd._libs.parsers.ParserError as e:
+            print(f"error parsing {res}, error={e}")      
+                
+    data = pd.concat(data, ignore_index=True)
+    # Filter first few iterations;
+    data = data[data["num_iter"] > 1]
+    # Use only some data size;
+    data = data[data["size"] == SIZE]
+    # Remove outliers;
+    if remove_outliers:
+        remove_outliers_df_iqr_grouped(data, column="computation_sec", group=["size", "config", "prefetch", "gpus", "partitions" ],
+                                        reset_index=True, quantile=0.75, drop_index=True, debug=True)
+    # Sort data;
+    data = data.sort_values(by=["size", "prefetch", "config", "gpus", "partitions", "num_iter"]).reset_index(drop=True)
+    # Compute speedups;
+    if global_speedup:
+        compute_speedup_df(data, key=["size", "prefetch"],
+                           baseline_filter_col=["gpus", "partitions", "config"], baseline_filter_val=[1, 1, min(data["config"])],  
+                           speedup_col_name="speedup", time_column="computation_sec",
+                           baseline_col_name="baseline_sec", aggregation=np.mean, correction=False)
+    else:
+        compute_speedup_df(data, key=["size", "prefetch", "config"],
+                           baseline_filter_col=["gpus", "partitions"], baseline_filter_val=[1, 1],  
+                           speedup_col_name="speedup", time_column="computation_sec",
+                           baseline_col_name="baseline_sec", aggregation=np.mean, correction=False)
+    # Obtain mean of computation times, grouped;
+    data_agg = data.groupby(["size", "prefetch", "config", "gpus", "partitions"]).mean()[["computation_sec", "speedup"]].reset_index()
+    return data, data_agg
+
+
+def plot_scaling(data_in, skip_low_partition: bool=False, ax=None, fig=None, speedup: bool=False):
+    
+    # Remove values where the number of partitions is < than the number of GPUs;
+    if skip_low_partition:
+        data = data_in[data_in["partitions"] >= data_in["gpus"]]
+    else:
+        data = data_in.copy()
+
+    FONTSIZE = 8
+    PALETTE = [PALETTE_G3[i] for i in [1, 2, 3, 4]]
+    PALETTE[0] = "#C5E8C5"
+    
+    new_figure = False
+    if fig == None and ax == None:  
+        new_figure = True # If true, we are plotting on a new figure;
+        plt.rcdefaults()
+        sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+        plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+        plt.rcParams['hatch.linewidth'] = 0.3
+        plt.rcParams['axes.labelpad'] = 3
+        plt.rcParams['xtick.major.pad'] = 4.2
+        plt.rcParams['ytick.major.pad'] = 1
+        plt.rcParams['axes.linewidth'] = 1
+        fig = plt.figure(figsize=(3.5, 2.5), dpi=600)
+        plt.subplots_adjust(top=0.95,
+                            bottom=0.2,
+                            left=0.15,
+                            right=0.95)
+        
+    ax = sns.lineplot(data=data, x="partitions", y="speedup" if speedup else "computation_sec", hue="gpus", ax=ax, legend=False,
+                      palette=PALETTE, ci=95, linewidth=0.8)
+    # Axes labels;
+    plt.xlabel("Number of partitions" if new_figure else None)
+    plt.ylabel("Speedup" if speedup else "Exec. time [s]", fontsize=FONTSIZE - 1)
+    # Grid and axis limits;
+    # ax.set_yscale("log")
+    ax.yaxis.grid(True, linewidth=0.3)
+    ax.xaxis.grid(False)
+    # Axes limits;
+    ax.set_xlim((data["partitions"].min(), data["partitions"].max()))
+    ax.set_ylim((0.05, 0.10))
+    # Ticks;
+    ax.tick_params(length=2, width=0.8)
+    ax.yaxis.set_major_locator(plt.LinearLocator(6))
+    ax.set_yticklabels(labels=[f"{l:.2f}x" if speedup else f"{l:.2f}" for l in ax.get_yticks()], ha="right", fontsize=FONTSIZE - 2)
+    x_ticks = [x for x in sorted(list(data["partitions"].unique())) if x % 4 == 0 or x == 1]
+    ax.set_xticks(x_ticks)
+    ax.set_xticklabels(labels=x_ticks, ha="center", va="top", rotation=0, fontsize=FONTSIZE - 2)
+    
+    # Title label, if necessary;
+    if len(data["gpus"].unique()) > 1 and speedup:
+        ax.annotate("vs. 1 A100, P=1", xy=(0.5, 1.03), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=FONTSIZE - 1, alpha=1)   
+    
+    # Legend;
+    labels = ["1 A100", "2 A100s", "4 A100s", "8 A100s"]
+    patches = [Patch(facecolor=PALETTE[i], edgecolor="#2f2f2f", label=l, linewidth=0.5) for i, l in enumerate(labels)]
+    labels, patches = transpose_legend_labels(labels, patches)
+    leg = ax.legend(patches, labels, bbox_to_anchor=(1.02, 1.03), fontsize=FONTSIZE - 1, borderpad=0.3,
+                    ncol=2, loc="upper right", handlelength=1.2, title=None,
+                    handletextpad=0.2, columnspacing=0.5)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_linewidth(0.5)
+    leg.get_frame().set_facecolor('white')
+    
+    # Add arrow to mark performance gap;
+    x_coord = data["partitions"].max() - 0.7
+    agg_data = data[data["partitions"] == data["partitions"].max()].groupby("gpus").mean()["computation_sec"]
+    plt.annotate(text="", xy=(x_coord, agg_data.min()), xytext=(x_coord, agg_data.max()), annotation_clip=False,
+                 arrowprops=dict(arrowstyle="<|-|>", linewidth=0.4, mutation_scale=FONTSIZE / 2, shrinkA=0, capstyle="butt", shrinkB=0, color="#2f2f2f"))
+    plt.annotate(text=f"{int(100 * (agg_data.max() - agg_data.min()) / agg_data.min())}%",
+                 xy=(x_coord - 0.1, (agg_data.min() + agg_data.max()) / 2), xytext=(0, 0), ha="right", va="center", 
+                 textcoords="offset points", annotation_clip=False, fontsize=FONTSIZE - 2)
+    return fig, ax
+    
+
+def plot_scaling_minmax(data_in_1, data_in_2):
+    
+    plt.rcdefaults()
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['hatch.linewidth'] = 0.3
+    plt.rcParams['axes.labelpad'] = 3
+    plt.rcParams['xtick.major.pad'] = 2
+    plt.rcParams['ytick.major.pad'] = 1
+    plt.rcParams['axes.linewidth'] = 1
+    FONTSIZE = 8
+    PALETTE = [PALETTE_G3[i] for i in [1, 2, 3, 4]]
+    PALETTE[0] = "#C5E8C5"
+    fig = plt.figure(figsize=(3.5, 1.5), dpi=600)
+    gs = gridspec.GridSpec(1, 2)
+    plt.subplots_adjust(top=0.92,
+                        bottom=0.2,
+                        left=0.15,
+                        right=0.95,
+                        wspace=0.12)
+    
+    for i, data_in in enumerate([data_in_1, data_in_2]):
+        # Keep just 1 GPU;
+        data = data_in[data_in["gpus"] == 1]
+        # Remove values where the number of partitions is < than the number of GPUs;
+        data = data[data["partitions"] >= data["gpus"]]
+        # Remove thr "18" datapoint for consistent scaling;
+        data = data[data["partitions"] != 18]
+
+        ax = fig.add_subplot(gs[0, i])
+        ax = sns.lineplot(data=data[data["config"] == min(data["config"])], x="partitions", y="speedup", legend=False, color=PALETTE[1], linestyle="--", linewidth=1)
+        ax = sns.lineplot(data=data[data["config"] == max(data["config"])], x="partitions", y="speedup", legend=False, color=PALETTE[1], linestyle="--", linewidth=1)
+        plt.fill_between(sorted(data["partitions"].unique()),
+                         data[data["config"] == min(data["config"])]["speedup"], 
+                         data[data["config"] == max(data["config"])]["speedup"],
+                         color=PALETTE[0], alpha=0.2)
+        # Axes labels;
+        plt.xlabel(None)
+        if i == 0:
+            plt.ylabel("Speedup", fontsize=FONTSIZE)
+        else:
+            plt.ylabel(None)
+        # Grid and axis limits;
+        ax.yaxis.grid(True, linewidth=0.5)
+        ax.xaxis.grid(False)
+        # Axes limits;
+        ax.set_xlim((data["partitions"].min(), data["partitions"].max()))
+        ax.set_ylim((1.0, 2.5))
+        # Ticks;
+        ax.tick_params(length=4, width=1)
+        ax.yaxis.set_major_locator(plt.LinearLocator(7))
+        if i == 0:
+            ax.set_yticklabels(labels=[f"{l:.2f}x" for l in ax.get_yticks()], ha="right", fontsize=FONTSIZE - 2)
+        else:
+            ax.set_yticklabels(labels=[])
+        x_ticks = [x for x in sorted(list(data["partitions"].unique())) if x % 4 == 0 or x == 1]
+        ax.set_xticks(x_ticks)
+        ax.set_xticklabels(labels=x_ticks, ha="center", va="top", rotation=0, fontsize=FONTSIZE - 2)
+        
+        # Occupancy labels;
+        x = sorted(data["partitions"].unique())[-3]
+        y1 = data[(data["partitions"] == x) & (data["config"] == min(data["config"]))]["speedup"]
+        y2 = data[(data["partitions"] == x) & (data["config"] == max(data["config"]))]["speedup"]
+        ax.annotate("Low occupancy", xy=(x, y1), xytext=(-5, -10 if i == 0 else 3), textcoords="offset points", ha="center", color="#2f2f2f", fontsize=FONTSIZE - 1, alpha=1)   
+        ax.annotate("High occupancy", xy=(x, y2), xytext=(-5, 3), textcoords="offset points", ha="center", color="#2f2f2f", fontsize=FONTSIZE - 1, alpha=1)   
+        # Other labels;
+        if i == 0:
+            ax.annotate("vs. worst config., P=1", xy=(0.5, 1.03), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=FONTSIZE, alpha=1)   
+        else:
+            ax.annotate("vs. itself, P=1", xy=(0.5, 1.03), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=FONTSIZE, alpha=1)   
+        ax.annotate("Number of partitions P", xy=(0.55, 0.02), xycoords="figure fraction", ha="center", color="#2f2f2f", fontsize=FONTSIZE, alpha=1)   
+    
+    return fig, ax
+
+
+def plot_scaling_minmax_2gpu(data_in, fig=None, ax=None):
+    
+    FONTSIZE = 8   
+    PALETTE = ["#5CCCA7", "#B3767F"]
+    
+    if fig == None and ax == None:
+        plt.rcdefaults()
+        sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+        plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+        plt.rcParams['hatch.linewidth'] = 0.3
+        plt.rcParams['axes.labelpad'] = 3
+        plt.rcParams['xtick.major.pad'] = 2
+        plt.rcParams['ytick.major.pad'] = 1
+        plt.rcParams['axes.linewidth'] = 0.8
+        fig = plt.figure(figsize=(3.5, 1.2), dpi=600)
+        gs = gridspec.GridSpec(1, 2)
+        plt.subplots_adjust(top=0.92,
+                            bottom=0.2,
+                            left=0.12,
+                            right=0.95,
+                            wspace=0.12)
+        # Plot gap between low and high occupancy on the 2 GPUs;
+        ax = fig.add_subplot(gs[0, 0])
+    
+    for i, (gpu, d) in enumerate(data_in.groupby("gpu", sort=False)):
+        # Keep just 1 GPU;
+        d = d[d["gpus"] == 1]
+        # Remove values where the number of partitions is < than the number of GPUs;
+        d = d[d["partitions"] >= d["gpus"]]
+        
+        ax = sns.lineplot(data=d[d["config"] == min(d["config"])], x="partitions", y="computation_sec", legend=False, color=PALETTE[i], linestyle="--", linewidth=0.7)
+        ax = sns.lineplot(data=d[d["config"] == max(d["config"])], x="partitions", y="computation_sec", legend=False, color=PALETTE[i], linestyle="--", linewidth=0.7)
+
+        plt.fill_between(sorted(d["partitions"].unique()),
+                         d[d["config"] == min(d["config"])]["computation_sec"], 
+                         d[d["config"] == max(d["config"])]["computation_sec"],
+                         color=PALETTE[i], alpha=0.2)
+        # Axes labels;
+        plt.xlabel(None)
+        plt.ylabel("Exec. time, log-scale [s]", fontsize=FONTSIZE - 1)
+        # Grid and axis limits;
+        ax.yaxis.grid(True, linewidth=0.3)
+        ax.xaxis.grid(False)
+        # Axes limits;
+        ax.set_xlim((d["partitions"].min(), d["partitions"].max()))
+        ax.set_ylim((0.05, 0.3))
+        # Ticks;
+        ax.tick_params(length=2, width=0.8)
+        x_ticks = [x for x in sorted(list(d["partitions"].unique())) if x % 4 == 0 or x == 1]
+        ax.set_xticks(x_ticks)
+        ax.set_xticklabels(labels=x_ticks, ha="center", va="top", rotation=0, fontsize=FONTSIZE - 2)
+        
+        # Occupancy labels;
+        x = sorted(d["partitions"].unique())[1]
+        y1 = d[(d["partitions"] == x) & (d["config"] == min(d["config"]))]["computation_sec"]
+        y2 = d[(d["partitions"] == x) & (d["config"] == max(d["config"]))]["computation_sec"]
+        ax.annotate("Low occupancy", xy=(x, y1), xytext=(6, -6 if i == 0 else -2), textcoords="offset points", ha="left", color="#2f2f2f", fontsize=FONTSIZE - 3, alpha=0.8)   
+        ax.annotate("High occupancy", xy=(x, y2), xytext=(6, -8 if i == 0 else -6), textcoords="offset points", ha="left", color="#2f2f2f", fontsize=FONTSIZE - 3, alpha=0.8)   
+        # Other labels;
+        # ax.annotate("vs. worst config., P=1", xy=(0.5, 1.03), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=FONTSIZE - 1, alpha=1)   
+        ax.annotate("Number of partitions P", xy=(0.55, 0.02), xycoords="figure fraction", ha="center", color="#2f2f2f", fontsize=FONTSIZE - 1, alpha=1)   
+    
+    # Set scale to logarithmic on the y axis, update tick labels;
+    ax.set_yscale("log")   
+    ax.yaxis.set_major_locator(plt.LinearLocator(6))
+    ax.set_yticklabels(labels=[f"{l:.2f}" for l in ax.get_yticks()], ha="right", fontsize=FONTSIZE - 2)
+    ax.yaxis.set_minor_formatter(NullFormatter())  # Hide any weird log tick label
+    
+    for i, (gpu, d) in enumerate(data_in.groupby("gpu", sort=False)):
+        # Keep just 1 GPU;
+        d = d[d["gpus"] == 1]
+        # Remove values where the number of partitions is < than the number of GPUs;
+        d = d[d["partitions"] >= d["gpus"]]
+        # Add arrow to mark performance gap;
+        x_coord = d["partitions"].max() - 0.7
+        low = d[(d["config"] == min(d["config"])) & (d["partitions"] == d["partitions"].max())]["computation_sec"].iloc[0]
+        high = d[(d["config"] == max(d["config"])) & (d["partitions"] == d["partitions"].max())]["computation_sec"].iloc[0]
+        y_diff = np.abs(high - low) / min([high, low])
+        y_coord =  (high + low) / 2 
+        ax.annotate(text=f"{100 * y_diff:.2f}%",
+                    xy=(x_coord, y_coord), xytext=(0, -8), textcoords="offset points",
+                    ha="right", va="center",  fontsize=FONTSIZE - 2,
+                    arrowprops=dict(arrowstyle="-|>", linewidth=0.4, mutation_scale=FONTSIZE / 2, shrinkA=2, capstyle="butt", shrinkB=0, color="#2f2f2f"),
+                    bbox=dict(boxstyle="square,pad=0.05", fc="w", alpha=0, ec=None))
+        
+    # Legend;
+    labels = list(data_in["gpu"].unique())
+    patches = [Patch(facecolor=PALETTE[i], edgecolor="#2f2f2f", label=l, linewidth=0.5) for i, l in enumerate(labels)]
+    leg = ax.legend(patches, labels, bbox_to_anchor=(1.02, 1.03), fontsize=FONTSIZE - 1, borderpad=0.3,
+                      ncol=1, loc="upper right", handlelength=1.2, title=None,
+                      handletextpad=0.2, columnspacing=0.5)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_linewidth(0.5)
+    leg.get_frame().set_facecolor('white')
+
+    return fig, ax
+
+
+def v100_vs_a100_and_multigpu_plot(data_agg, data_a100):
+    # Setup plot;
+    plt.rcdefaults()
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['hatch.linewidth'] = 0.3
+    plt.rcParams['axes.labelpad'] = 1
+    plt.rcParams['xtick.major.pad'] = 2
+    plt.rcParams['ytick.major.pad'] = 1
+    plt.rcParams['axes.linewidth'] = 0.8
+    
+    fig = plt.figure(figsize=(3.5, 1.2), dpi=600)
+    gs = gridspec.GridSpec(1, 2)
+    plt.subplots_adjust(top=0.95,
+                        bottom=0.2,
+                        left=0.10,
+                        right=0.97,
+                        wspace=0.3)
+     
+    # Plot gap between low and high occupancy on the 2 GPUs;
+    fig, ax = plot_scaling_minmax_2gpu(data_agg, fig=fig, ax=fig.add_subplot(gs[0, 0]))
+    # Plot multi-GPU scaling of A100;
+    fig, ax = plot_scaling(data_a100[data_a100["prefetch"] == True], ax=fig.add_subplot(gs[0, 1]), fig=fig, speedup=False) 
+    return fig, ax
+
+#%%###########################
+##############################
+
+if __name__ == "__main__":
+    
+    # if GPU == "V100":
+    #     # # Plot;
+    #     # data, data_agg = load_data()
+    #     # fig, ax = plot_scaling(data)    
+    #     # save_plot(PLOT_DIR, "cuda_partition_scaling" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+        
+    #     # Second plot;
+    #     _, data_agg_1 = load_data_multiconfig(True)
+    #     _, data_agg_2 = load_data_multiconfig(False)
+    #     fig, ax = plot_scaling_minmax(data_agg_1, data_agg_2)    
+    #     save_plot(PLOT_DIR, "cuda_partition_scaling_minmax" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    # elif GPU == "A100":
+    #     data, data_agg = load_data_a100(global_speedup=True, remove_outliers=False)
+    #     for i, g in data.groupby("config"):
+    #         fig, ax = plot_scaling(g[g["prefetch"] == True]) 
+            
+    #%% Compare V100 and A100;
+    _, data_agg_v100 = load_data_multiconfig(True, remove_outliers=False, main_res_folder="../../../../grcuda-data/results/scheduling_multi_gpu/V100")
+    data_agg_v100["prefetch"] = True
+    # Remove thr "18" datapoint for consistent scaling;
+    data_agg_v100 = data_agg_v100[data_agg_v100["partitions"] != 18]
+    data_a100, data_agg_a100 = load_data_a100(global_speedup=True, remove_outliers=False, main_res_folder="../../../../grcuda-data/results/scheduling_multi_gpu/A100")
+    data_agg_a100 = data_agg_a100[data_agg_a100["prefetch"] == True]
+    # Concatenate results;
+    data_agg_v100["gpu"] = "V100"
+    data_agg_a100["gpu"] = "A100"
+    data_agg = pd.concat([data_agg_v100, data_agg_a100], ignore_index=True)
+    fig, ax = v100_vs_a100_and_multigpu_plot(data_agg, data_a100)    
+    save_plot(PLOT_DIR, "cuda_partition_scaling_minmax_2gpu" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_multi_gpu_cuda_transfer.py b/projects/resources/python/plotting/plot_multi_gpu_cuda_transfer.py
new file mode 100644
index 00000000..96dab57c
--- /dev/null
+++ b/projects/resources/python/plotting/plot_multi_gpu_cuda_transfer.py
@@ -0,0 +1,672 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Thu Oct  7 12:54:09 2021
+
+@author: albyr
+"""
+
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+import seaborn as sns
+import numpy as np
+import pandas as pd
+import os
+import matplotlib.gridspec as gridspec
+from abc import ABC, abstractmethod
+from segretini_matplottini.src.plot_utils import *
+from multi_gpu_parse_nvprof_log import create_nondirectional_transfer_matrix
+from load_data import PLOT_DIR, DEFAULT_RES_CUDA_DIR
+
+##############################
+# Setup ######################
+##############################
+
+# V100;
+GPU = "V100"
+INPUT_FOLDER = "V100/nvprof_cuda/2021_10_07"
+
+# # A100;
+# GPU = "A100"
+# INPUT_FOLDER = "A100/nvprof_cuda/2021_10_18"
+
+OUTPUT_DATE = "2021_10_18"
+
+BENCHMARKS = [
+    "b1",
+    "b5",
+    "b6",
+    "b6_4",
+    "b9",
+    "b9_4",
+    "b11"
+    ]
+
+##############################
+# Drawers ####################
+##############################
+
+class GPUDrawer(ABC):
+    
+    @abstractmethod
+    def setup(self, fig=None, ax=None):
+        pass
+
+    @abstractmethod
+    def draw_topology(self, fig=None, ax=None, **kwargs):
+        pass
+    
+    @abstractmethod
+    def draw_transfer(self, ax, transfer_matrix_nondirectional, max_transfer: float=None, min_transfer: float=None, 
+                      redraw_points: bool=True, **kwargs):
+        pass
+
+    @abstractmethod
+    def add_benchmark_name(self, ax, b):
+        pass
+    
+    @abstractmethod
+    def setup_large_plot(self, num_benchmarks: int):
+        pass
+
+
+class V100Drawer(GPUDrawer):
+    EDGE = 0.8
+    ANGLE = 35
+    FORSHORTENING = 1 / 3  # Proportional scaling of the axonometry;
+    X_STEP = EDGE * FORSHORTENING
+    Y_STEP = EDGE * FORSHORTENING * np.tan(np.deg2rad(ANGLE))
+    CPU_VSTEP = EDGE / 3
+     
+    POINTS = [
+        # Front face;
+        [0, 0],  # 0: Lower left;
+        [0, EDGE],  # 1: Upper left;
+        [EDGE, EDGE],  # 2: Upper right;
+        [EDGE, 0],  # 3: Lower right;
+        # Right face;
+        [EDGE + X_STEP, Y_STEP],  # 4: Lower
+        [EDGE + X_STEP, EDGE + Y_STEP],  # 5: Upper
+        # 6: Lower left corner;
+        [X_STEP, Y_STEP],
+        # 7: Upper left corner;
+        [X_STEP, EDGE + Y_STEP], 
+        ]
+    
+    # Associate GPUs to points;
+    GPU = {
+        0: POINTS[2],   
+        1: POINTS[1],   
+        2: POINTS[7],   
+        3: POINTS[5],   
+        4: POINTS[6],   
+        5: POINTS[4],   
+        6: POINTS[3],   
+        7: POINTS[0],   
+        }
+    
+    CPU_POINTS = [
+        [EDGE / 2 + X_STEP / 2, Y_STEP * FORSHORTENING - CPU_VSTEP],
+        [EDGE / 2 + X_STEP / 2, EDGE + Y_STEP * FORSHORTENING + CPU_VSTEP],
+    ]
+    
+    CPU = {
+       0: CPU_POINTS[0],    
+       1: CPU_POINTS[1], 
+       }       
+
+    def setup(self, fig=None, ax=None):
+        if fig == None and ax == None:
+            plt.rcdefaults()
+            plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+            # Create figure and axes
+            fig, ax = plt.subplots()
+        
+        # Have axes with the same scale;
+        plt.xlim(-0.6, self.EDGE * 3/2 + 0.4)
+        plt.ylim(-0.6, self.EDGE * 3/2 + 0.4)
+        ax.set_aspect("equal")
+        plt.axis("off")
+        return fig, ax
+    
+    # Draw corners of the cube;
+    def draw_points(self, ax, alpha_scale=1):
+        # Obtain list of coordinates;
+        x = [x[0] for x in self.POINTS]
+        y = [y[1] for y in self.POINTS]
+        ax.scatter(x, y, color="#2f2f2f", alpha=alpha_scale, zorder=10)
+        return ax
+    
+    # Draw names of GPUs;
+    def draw_gpu_names(self, ax):
+        x_offset = {
+            2: -self.EDGE / 5,
+            3: -self.EDGE / 10,
+            6: -self.EDGE / 8,
+            7: -self.EDGE / 5,
+            }
+        y_offset = {
+            0: -self.EDGE / 10,
+            1: -self.EDGE / 10,
+            2: self.EDGE / 25,
+            3: self.EDGE / 25,
+            6: -self.EDGE / 8,
+            7: -self.EDGE / 8,
+            }
+        for i, (g, p) in enumerate(self.GPU.items()):
+            x = p[0] + self.EDGE * 0.02 + (x_offset[g] if g in x_offset else 0)
+            y = p[1] + self.EDGE * 0.01 + (y_offset[g] if g in y_offset else self.EDGE / 70)
+            ax.annotate(f"GPU{g}", xy=(x, y), color="#2f2f2f", fontsize=10, ha="left")
+        return ax 
+    
+    # Draw a single line between GPUs;
+    def draw_line_gpu(self, ax, x, y, style):
+            x = self.GPU[x]
+            y = self.GPU[y]
+            ax.plot((x[0], y[0]), (x[1], y[1]), **style)
+                 
+    # Join corners;
+    def draw_edges(self, ax, alpha_scale=1):    
+        # Double NVLink;
+        style_nv2 = dict(
+            linewidth=2,
+            linestyle="-",
+            color="#2f2f2f",
+            alpha=0.9 * alpha_scale,
+            solid_capstyle="round",
+        )
+        # Single NVLink;
+        style_nv1 = dict(
+            linewidth=0.8,
+            linestyle="--",
+            color="#2f2f2f",
+            alpha=0.7 * alpha_scale,
+            solid_capstyle="round",
+        )
+        # Missing edge is PCIe;  
+        
+        # Connect GPUs;
+        self.draw_line_gpu(ax, 0, 1, style_nv1)
+        self.draw_line_gpu(ax, 0, 2, style_nv1)
+        self.draw_line_gpu(ax, 0, 3, style_nv2)
+        self.draw_line_gpu(ax, 0, 6, style_nv2)
+        self.draw_line_gpu(ax, 1, 2, style_nv2)
+        self.draw_line_gpu(ax, 1, 3, style_nv1)
+        self.draw_line_gpu(ax, 1, 7, style_nv2)
+        self.draw_line_gpu(ax, 2, 3, style_nv2)
+        self.draw_line_gpu(ax, 2, 4, style_nv1)
+        self.draw_line_gpu(ax, 3, 5, style_nv1)
+        self.draw_line_gpu(ax, 4, 5, style_nv2)
+        self.draw_line_gpu(ax, 4, 6, style_nv1)
+        self.draw_line_gpu(ax, 4, 7, style_nv2)
+        self.draw_line_gpu(ax, 5, 6, style_nv2)
+        self.draw_line_gpu(ax, 5, 7, style_nv1)
+        self.draw_line_gpu(ax, 6, 7, style_nv1)
+        
+        return ax
+    
+    # Draw faces of the cube;
+    def draw_faces(self, ax):
+        style = dict(
+            linewidth=1,
+            linestyle="--",
+            edgecolor="#2f2f2f",
+            facecolor="#2f2f2f",
+            alpha=0.1,
+        )
+        points = self.POINTS
+        patches_list = [
+            patches.Polygon(xy=[points[0], points[1], points[2], points[3]], **style),
+            patches.Polygon(xy=[points[2], points[5], points[4], points[3]], **style),
+            patches.Polygon(xy=[points[2], points[5], points[7], points[1]], **style),
+            patches.Polygon(xy=[points[0], points[3], points[4], points[6]], **style),
+            patches.Polygon(xy=[points[0], points[1], points[7], points[6]], **style),
+            patches.Polygon(xy=[points[6], points[4], points[5], points[7]], **style),
+            ]
+        for p in patches_list:
+            ax.add_patch(p)
+        return ax
+    
+    def draw_cpu_points(self, ax, alpha_scale=1): 
+        # Obtain list of coordinates;
+        x = [x[0] for x in self.CPU_POINTS]
+        y = [y[1] for y in self.CPU_POINTS]
+        ax.scatter(x, y, color="#888888", alpha=alpha_scale, zorder=10)
+        return ax
+    
+    def draw_pci(self, ax, gpu0, gpu1, vertical_start, upper=True, style=None):
+        medium_point = [(gpu0[0] + gpu1[0]) / 2, gpu0[1]]
+        x_step = self.X_STEP / 2 - gpu0[0]
+        y_step = self.Y_STEP * self.FORSHORTENING + self.CPU_VSTEP * (1 if upper else -1)
+        cpu_point = [medium_point[0] + x_step, vertical_start + y_step]    
+        
+        t = np.sqrt(y_step**2 + x_step**2) / 2
+        alpha = np.arctan(y_step / np.abs(x_step))
+        y_offset = np.sin(alpha) * t
+        
+        x_offset = (cpu_point[0] - medium_point[0]) / 2
+        split_point = [medium_point[0] + x_offset, vertical_start + y_offset + (gpu0[1] - vertical_start) / 2]
+        
+        ax.plot((cpu_point[0], split_point[0]), (cpu_point[1], split_point[1]), **style)
+        ax.plot((split_point[0], gpu0[0]), (split_point[1], gpu0[1]), **style)
+        ax.plot((split_point[0], gpu1[0]), (split_point[1], gpu1[1]), **style)        
+    
+    def draw_cpu_lines(self, ax, alpha_scale=1):
+        style = dict(
+            color="#888888",
+            alpha=0.8 * alpha_scale,
+            linestyle="-",
+            linewidth=1,
+            solid_capstyle="round",
+        )
+                    
+        self.draw_pci(ax, self.GPU[1], self.GPU[0], self.EDGE, style=style)
+        self.draw_pci(ax, self.GPU[2], self.GPU[3], self.EDGE, style=style)
+        self.draw_pci(ax, self.GPU[7], self.GPU[6], 0, False, style=style)
+        self.draw_pci(ax, self.GPU[4], self.GPU[5], 0, False, style=style)
+        return ax
+    
+    # Draw names of CPUs;
+    def draw_cpu_names(self, ax):
+        y_offset = {
+            0: -self.EDGE / 10,
+            }
+        for c in [0, 1][::-1]:
+            p = self.CPU_POINTS[c]
+            x = p[0] + self.EDGE * 0.02
+            y = p[1] + self.EDGE * 0.01 + (y_offset[c] if c in y_offset else self.EDGE / 70)
+            ax.annotate(f"CPU{c}", xy=(x, y), color="#888888", fontsize=10, ha="left")
+        return ax
+    
+    # Draw the GPU topology;
+    def draw_topology(self, fig=None, ax=None, **kwargs):
+        fig, ax = self.setup(fig, ax)
+        ax = self.draw_cpu_lines(ax, **kwargs)
+        ax = self.draw_edges(ax, **kwargs)
+        # ax = draw_faces(ax)
+        ax = self.draw_points(ax, **kwargs)
+        ax = self.draw_cpu_points(ax, **kwargs)
+        ax = self.draw_gpu_names(ax)
+        ax = self.draw_cpu_names(ax)
+        return fig, ax
+    
+    def draw_pci_transfer(self, ax, cpu, gpu, other_gpu, vertical_start, upper=True, style=None):
+        medium_point = [(gpu[0] + other_gpu[0]) / 2, gpu[1]]
+        x_step = self.X_STEP / 2 - min(gpu[0], other_gpu[0])
+        y_step = self.Y_STEP * self.FORSHORTENING + self.CPU_VSTEP * (1 if upper else -1)
+        cpu_point = [medium_point[0] + x_step, vertical_start + y_step]    
+        
+        t = np.sqrt(y_step**2 + x_step**2) / 2
+        alpha = np.arctan(y_step / np.abs(x_step))
+        y_offset = np.sin(alpha) * t
+        
+        x_offset = (cpu_point[0] - medium_point[0]) / 2
+        split_point = [medium_point[0] + x_offset, vertical_start + y_offset + (gpu[1] - vertical_start) / 2]
+        ax.plot((split_point[0], gpu[0]), (split_point[1], gpu[1]), **style)
+    
+    def draw_pci_transfer_cpu(self, ax, cpu, gpu, other_gpu, vertical_start, upper=True, style=None, zorder=None):
+        medium_point = [(gpu[0] + other_gpu[0]) / 2, gpu[1]]
+        x_step = self.X_STEP / 2 - min(gpu[0], other_gpu[0])
+        y_step = self.Y_STEP * self.FORSHORTENING + self.CPU_VSTEP * (1 if upper else -1)
+        cpu_point = [medium_point[0] + x_step, vertical_start + y_step]    
+        
+        t = np.sqrt(y_step**2 + x_step**2) / 2
+        alpha = np.arctan(y_step / np.abs(x_step))
+        y_offset = np.sin(alpha) * t
+        
+        x_offset = (cpu_point[0] - medium_point[0]) / 2
+        split_point = [medium_point[0] + x_offset, vertical_start + y_offset + (gpu[1] - vertical_start) / 2]
+        ax.plot((cpu_point[0], split_point[0]), (cpu_point[1], split_point[1]), zorder=zorder, **style)
+    
+    # Draw the transfer between devices;
+    def draw_transfer(self, ax, transfer_matrix_nondirectional, max_transfer: float=None, min_transfer: float=None, 
+                      redraw_points: bool=True, **kwargs):
+       
+        PALETTE = sns.color_palette("YlOrBr", as_cmap=True)# sns.color_palette("YlOrBr", as_cmap=True)
+        MIN_PAL = 0.2
+        MAX_PAL = 0.5
+        MAX_WIDTH = 4
+        MIN_WIDTH = 0.5
+        if max_transfer is None:
+            max_transfer = transfer_matrix_nondirectional.max().max()
+        if min_transfer is None:
+            min_transfer = transfer_matrix_nondirectional.min().min()
+            
+        def style_gpu(transfer):
+            return dict(
+                linewidth=transfer * (MAX_WIDTH - MIN_WIDTH) + MIN_WIDTH,
+                linestyle="-",
+                color=PALETTE(transfer * (MAX_PAL - MIN_PAL) + MIN_PAL),
+                alpha=0.7,
+                solid_capstyle="round",
+            )
+        
+        # Shared PCI express channels;
+        total_pci_01 = transfer_matrix_nondirectional.loc[transfer_matrix_nondirectional.index.isin(["0", "1"])]["CPU"].sum()
+        total_pci_23 = transfer_matrix_nondirectional.loc[transfer_matrix_nondirectional.index.isin(["2", "3"])]["CPU"].sum()   
+        total_pci_45 = transfer_matrix_nondirectional.loc[transfer_matrix_nondirectional.index.isin(["4", "5"])]["CPU"].sum()
+        total_pci_67 = transfer_matrix_nondirectional.loc[transfer_matrix_nondirectional.index.isin(["6", "7"])]["CPU"].sum()
+        self.draw_pci_transfer_cpu(ax, self.CPU[0], self.GPU[1], self.GPU[0], self.EDGE, style=style_gpu(total_pci_01), zorder=9)
+        self.draw_pci_transfer_cpu(ax, self.CPU[0], self.GPU[3], self.GPU[2], self.EDGE, style=style_gpu(total_pci_23), zorder=9)
+        self.draw_pci_transfer_cpu(ax, self.CPU[1], self.GPU[4], self.GPU[5], 0, False, style=style_gpu(total_pci_45))
+        self.draw_pci_transfer_cpu(ax, self.CPU[1], self.GPU[7], self.GPU[6], 0, False, style=style_gpu(total_pci_67))
+        
+        # All the other channels;  
+        for ii, i in enumerate(transfer_matrix_nondirectional.index):
+            for jj, j in enumerate(transfer_matrix_nondirectional.columns):
+                # Symmetric matrix, the lower triangular part is skipped;
+                if ii > jj:
+                    continue
+                transfer = transfer_matrix_nondirectional.loc[i, j]
+                if transfer > 0:
+                    # Draw GPU-GPU transfer;
+                    if i != "CPU" and j != "CPU":
+                        self.draw_line_gpu(ax, int(i), int(j), style_gpu(transfer))
+                    # Draw CPU-GPU transfer;
+                    else:
+                        if j == "CPU":
+                            gpu = int(i)
+                            if gpu < 4:
+                                self.draw_pci_transfer(ax, self.CPU[0], self.GPU[gpu], self.GPU[(gpu + 1) % 2 + (2 if gpu > 1 else 0)], self.EDGE, style=style_gpu(transfer))
+                            elif gpu >= 4:
+                                self.draw_pci_transfer(ax, self.CPU[1], self.GPU[gpu], self.GPU[(gpu + 1) % 2 + (6 if gpu > 5 else 4)], 0, False, style=style_gpu(transfer))
+        return ax
+   
+    def add_benchmark_name(self, ax, b):
+        ax.annotate(b.upper(), xy=(0.78, 0.85), xycoords="axes fraction", ha="left", color="#2f2f2f", fontsize=14, alpha=1)   
+        return ax
+    
+    def setup_large_plot(self, num_benchmarks: int):
+        plt.rcdefaults()
+        plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+        
+        cols = num_benchmarks
+        rows = 1
+        scale = 2.5
+        fig, ax = plt.subplots(figsize=(cols * scale, scale))
+        gs = gridspec.GridSpec(rows, cols)
+        plt.subplots_adjust(
+                    top=1,
+                    bottom=0,
+                    left=0,
+                    right=0.992,
+                    hspace=0.01,
+                    wspace=0.01)
+        plt.axis("off")
+        return fig, ax, gs
+    
+    
+class A100Drawer(GPUDrawer):
+    
+    NUM_GPUS = 8
+    NUM_CPUS = 2
+    X_MIN = 0
+    X_MAX = 3.8
+    Y_MIN = 0
+    Y_MAX = 1.2
+    
+    X_RANGE = X_MAX - X_MIN
+    Y_RANGE = Y_MAX - Y_MIN
+    X_OFFSET = X_RANGE * 0.05
+    GPU_RANGE = (X_RANGE - 2 * X_OFFSET)
+    X_SHIFT = GPU_RANGE / NUM_GPUS
+    Y_GPU = 0.5 * Y_RANGE
+    Y_CPU = 0.9 * Y_RANGE
+    GPU_GROUP_SIZE = NUM_GPUS // NUM_CPUS
+    STEP_SIZE_GPU = GPU_RANGE / (NUM_GPUS - 1)
+    X_OFFSET_CPU = X_OFFSET + STEP_SIZE_GPU * (GPU_GROUP_SIZE - 1) / 2
+    Y_NVSWITCH = 0.1 * Y_RANGE
+    Y_OFFSET_NVSWITCH = Y_GPU - Y_NVSWITCH
+    
+    X_GPU_POINTS = np.linspace(X_OFFSET, X_RANGE - X_OFFSET, NUM_GPUS)
+    Y_GPU_POINTS = [Y_GPU] * NUM_GPUS
+    X_CPU_POINTS = np.linspace(X_OFFSET_CPU, X_RANGE - X_OFFSET_CPU, NUM_CPUS)
+    Y_CPU_POINTS = [Y_CPU] * NUM_CPUS
+    CPU = [[x, y] for x, y in zip(X_CPU_POINTS, Y_CPU_POINTS)]
+    GPU = [[x, y] for x, y in zip(X_GPU_POINTS, Y_GPU_POINTS)]
+    
+    def setup(self, fig=None, ax=None):
+        if fig == None and ax == None:
+            plt.rcdefaults()
+            plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+            # Create figure and axes
+            fig, ax = plt.subplots(figsize=(self.X_RANGE, self.Y_RANGE))
+            plt.subplots_adjust(
+                        top=1,
+                        bottom=0,
+                        left=0,
+                        right=0.94)
+            
+        # Have axes with the same scale;
+        plt.xlim(self.X_MIN, self.X_MAX)
+        plt.ylim(self.Y_MIN, self.Y_MAX)
+        ax.set_aspect("equal")
+        plt.axis("off")
+        return fig, ax
+    
+    def draw_points(self, ax, alpha_scale=1):
+        ax.scatter(self.X_GPU_POINTS, self.Y_GPU_POINTS, color="#2f2f2f", alpha=alpha_scale, zorder=100)                
+        return ax
+    
+    def draw_cpu_points(self, ax, alpha_scale=1): 
+        ax.scatter(self.X_CPU_POINTS, self.Y_CPU_POINTS, edgecolor="#888888", color="w", alpha=alpha_scale, zorder=100)         
+        return ax
+    
+    def draw_cpu_lines(self, ax, alpha_scale=1):
+        style = dict(
+            color="#888888",
+            alpha=0.8 * alpha_scale,
+            linestyle="-",
+            linewidth=1,
+            solid_capstyle="round",
+        )
+        style_cpu_link = dict(
+            color="#888888",
+            alpha=0.8 * alpha_scale,
+            linestyle=":",
+            linewidth=1,
+            solid_capstyle="round",
+        )
+        
+        for cpu in range(self.NUM_CPUS):
+            for gpu in range(self.GPU_GROUP_SIZE):
+                gpu_tot = gpu + cpu * self.GPU_GROUP_SIZE
+                ax.plot((self.CPU[cpu][0], self.GPU[gpu_tot][0]), (self.CPU[cpu][1], self.GPU[gpu_tot][1]), **style)
+        ax.plot((self.CPU[0][0], self.CPU[-1][0]), (self.CPU[-1][1], self.CPU[-1][1]), **style_cpu_link)        
+        
+        return ax
+    
+    def draw_gpu_lines(self, ax, alpha_scale=1):
+        style_nv = dict(
+            linewidth=1,
+            linestyle="-",
+            color="#2f2f2f",
+            alpha=1 * alpha_scale,
+            solid_capstyle="round",
+        )
+        style_switch = dict(
+            linewidth=5,
+            linestyle="-",
+            color="#2f2f2f",
+            alpha=1 * alpha_scale,
+            solid_capstyle="round",
+        )
+        
+        for gpu in range(self.NUM_GPUS):
+            ax.plot((self.GPU[gpu][0], self.GPU[gpu][0]), (self.Y_NVSWITCH, self.GPU[gpu][1]), **style_nv)
+        ax.plot((self.X_OFFSET, self.X_OFFSET + self.GPU_RANGE), (self.Y_NVSWITCH, self.Y_NVSWITCH), **style_switch)
+        return ax
+    
+    # Draw names of CPUs;
+    def draw_cpu_names(self, ax):
+        for i, c in enumerate(self.CPU):
+            x = c[0] + self.X_RANGE * 0.01
+            y = c[1] + self.Y_RANGE * 0.01
+            ax.annotate(f"CPU{i}", xy=(x, y), color="#2f2f2f", fontsize=9, ha="left")
+        return ax
+    
+    # Draw names of GPUs;
+    def draw_gpu_names(self, ax):
+        for i, g in enumerate(self.GPU):
+            x = g[0] + self.X_RANGE * 0.005
+            y = g[1] - self.Y_RANGE * 0.14
+            ax.annotate(f"GPU{i}", xy=(x, y), color="#2f2f2f", fontsize=9, ha="left")
+        return ax 
+    
+    def draw_other_names(self, ax):
+        x = self.GPU[-1][0] + self.X_RANGE * 0.005
+        y = self.Y_NVSWITCH + self.Y_RANGE * 0.04
+        ax.annotate("NV12", xy=(x, y), color="#888888", fontsize=7, ha="left")
+        
+        x = self.GPU[0][0] + self.X_RANGE * 0.005
+        y = self.Y_NVSWITCH - self.Y_RANGE * 0.1
+        ax.annotate("NVSwitch", xy=(x, y), color="#888888", fontsize=7, ha="left")
+        
+        x = np.mean([c[0] for c in self.CPU]) 
+        y = self.CPU[0][1] - self.Y_RANGE * 0.08
+        ax.annotate("Infinity Fabric", xy=(x, y), ha="center", color="#888888", fontsize=7)
+        
+        x = np.mean([self.CPU[0][0], self.GPU[0][0]]) - self.X_RANGE * 0.02
+        y = np.mean([self.CPU[0][1], self.GPU[0][1]]) + self.Y_RANGE * 0.02
+        angle = np.rad2deg(np.arctan((self.CPU[0][1] - self.GPU[0][1]) / (self.CPU[0][0] - self.GPU[0][0])))
+        ax.annotate("PCIe 4.0 x16", xy=(x, y), ha="center", color="#888888",
+                    fontsize=7, rotation=angle, rotation_mode="anchor")
+        
+        return ax 
+
+    def draw_topology(self, fig=None, ax=None, **kwargs):
+        fig, ax = self.setup(fig, ax)
+        ax = self.draw_cpu_lines(ax, **kwargs)
+        ax = self.draw_gpu_lines(ax, **kwargs)
+        ax = self.draw_points(ax, **kwargs)
+        ax = self.draw_cpu_points(ax, **kwargs)
+        ax = self.draw_gpu_names(ax)
+        ax = self.draw_cpu_names(ax)
+        ax = self.draw_other_names(ax)
+        return fig, ax
+    
+    def draw_transfer(self, ax, transfer_matrix_nondirectional, max_transfer: float=None, min_transfer: float=None, 
+                      redraw_points: bool=True, **kwargs):
+        PALETTE = sns.color_palette("YlOrBr", as_cmap=True)
+        MIN_PAL = 0.2
+        MAX_PAL = 0.5
+        MAX_WIDTH = 6
+        MIN_WIDTH = 1.5
+        if max_transfer is None:
+            max_transfer = transfer_matrix_nondirectional.max().max()
+        if min_transfer is None:
+            min_transfer = transfer_matrix_nondirectional.min().min()
+            
+        def style_gpu(transfer):
+            return dict(
+                linewidth=transfer * (MAX_WIDTH - MIN_WIDTH) + MIN_WIDTH,
+                linestyle="-",
+                color=PALETTE(transfer * (MAX_PAL - MIN_PAL) + MIN_PAL),
+                alpha=0.7,
+                solid_capstyle="round",
+            )
+        
+        # PCI express channels;
+        for cpu in range(self.NUM_CPUS):
+            for gpu in range(self.GPU_GROUP_SIZE):
+                gpu_tot = gpu + cpu * self.GPU_GROUP_SIZE
+                if str(gpu_tot) in transfer_matrix_nondirectional.index:
+                    cpu_gpu_transfer = transfer_matrix_nondirectional.loc[str(gpu_tot), :]["CPU"]
+                    if cpu_gpu_transfer > 0:
+                        ax.plot((self.CPU[cpu][0], self.GPU[gpu_tot][0]), (self.CPU[cpu][1], self.GPU[gpu_tot][1]), **style_gpu(cpu_gpu_transfer))
+        
+        # All the other channels;
+        for gpu in range(self.NUM_GPUS):
+            if str(gpu) in transfer_matrix_nondirectional.index and "NVSwitch" in transfer_matrix_nondirectional.index:
+                gpu_switch_transfer = transfer_matrix_nondirectional.loc[str(gpu), :]["NVSwitch"]
+                if gpu_switch_transfer > 0:
+                    ax.plot((self.GPU[gpu][0], self.GPU[gpu][0]), (self.Y_NVSWITCH, self.GPU[gpu][1]), **style_gpu(gpu_switch_transfer))
+        if "NVSwitch" in transfer_matrix_nondirectional.index:
+            switch_transfer_tot = transfer_matrix_nondirectional["NVSwitch"].sum()
+            if switch_transfer_tot > 0:
+                ax.plot((self.X_OFFSET, self.X_OFFSET + self.GPU_RANGE), (self.Y_NVSWITCH, self.Y_NVSWITCH), **style_gpu(switch_transfer_tot))
+        return ax
+
+    def add_benchmark_name(self, ax, b):
+        ax.annotate(b.upper(), xy=(0.9, 0.88), xycoords="axes fraction", ha="left", color="#2f2f2f", fontsize=14, alpha=1)   
+        return ax
+      
+    def setup_large_plot(self, num_benchmarks: int):
+        plt.rcdefaults()
+        plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+        
+        cols = num_benchmarks
+        rows = 1
+        
+        fig, ax = plt.subplots(figsize=(self.X_RANGE * cols, self.Y_RANGE))
+        gs = gridspec.GridSpec(rows, cols)
+        plt.subplots_adjust(
+                    top=1,
+                    bottom=0,
+                    left=0,
+                    right=0.992)
+        plt.axis("off")
+        return fig, ax, gs
+    
+
+
+##############################
+##############################
+
+
+if __name__ == "__main__":
+    
+    if GPU == "V100":
+        gpu_drawer = V100Drawer()
+    elif GPU == "A100":
+        gpu_drawer = A100Drawer()
+    else:
+        raise ValueError(f"Unknown GPU={GPU}")
+    
+    fig, ax = gpu_drawer.draw_topology(alpha_scale=1)
+    save_plot(PLOT_DIR, f"{GPU.lower()}_topology" + "_{}.{}", date=OUTPUT_DATE, dpi=600)    
+    
+    #%% Draw transfer of GPUs;
+    
+    # Obtain transfer max and min to normalize plots;
+    maximum_transfer = 0
+    minimum_transfer = np.inf
+    num_benchmarks = 0
+    for b in BENCHMARKS:
+        transfer_matrix = pd.read_csv(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER, b + "_transfer_matrix.csv"), index_col=0)
+        maximum_transfer = max(transfer_matrix.max().max(), maximum_transfer)
+        minimum_transfer = min(transfer_matrix.min().min(), minimum_transfer)
+        num_benchmarks += 1
+    
+    #%%  A plot for every benchmark;
+    for b in BENCHMARKS:
+        fig, ax = gpu_drawer.draw_topology(alpha_scale=1)
+        transfer_matrix = pd.read_csv(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER, b + "_transfer_matrix.csv"), index_col=0)
+        # Create non-directional matrix;
+        transfer_matrix_nondirectional = create_nondirectional_transfer_matrix(transfer_matrix)
+        # Normalize matrix;
+        transfer_matrix_nondirectional /= transfer_matrix_nondirectional.max().max()
+        # Draw colored edges;
+        ax = gpu_drawer.draw_transfer(ax, transfer_matrix_nondirectional, max_transfer=maximum_transfer, min_transfer=minimum_transfer)
+        # Add benchmark name;
+        ax = gpu_drawer.add_benchmark_name(ax, b)
+        save_plot(PLOT_DIR, f"{GPU.lower()}_topology_{b}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)    
+        
+    #%% A single plot with all benchmarks;
+    
+    fig, ax, gs = gpu_drawer.setup_large_plot(num_benchmarks)
+    
+    for bi, b in enumerate(BENCHMARKS):
+        ax = fig.add_subplot(gs[0, bi])
+        fig, ax = gpu_drawer.draw_topology(alpha_scale=1, fig=fig, ax=ax)
+        transfer_matrix = pd.read_csv(os.path.join(DEFAULT_RES_CUDA_DIR, INPUT_FOLDER, b + "_transfer_matrix.csv"), index_col=0)
+        # Create non-directional matrix;
+        transfer_matrix_nondirectional = create_nondirectional_transfer_matrix(transfer_matrix)
+        # Normalize matrix;
+        transfer_matrix_nondirectional /= transfer_matrix_nondirectional.max().max()
+        # Draw colored edges;
+        ax = gpu_drawer.draw_transfer(ax, transfer_matrix_nondirectional, max_transfer=maximum_transfer, min_transfer=minimum_transfer)
+        # Add benchmark name;
+        ax = gpu_drawer.add_benchmark_name(ax, b)
+        save_plot(PLOT_DIR, f"{GPU.lower()}_topology" + "_{}.{}", date=OUTPUT_DATE, dpi=600) 
+        
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_multi_gpu_grcuda_vs_cuda_exec_time.py b/projects/resources/python/plotting/plot_multi_gpu_grcuda_vs_cuda_exec_time.py
new file mode 100644
index 00000000..a7c1e33b
--- /dev/null
+++ b/projects/resources/python/plotting/plot_multi_gpu_grcuda_vs_cuda_exec_time.py
@@ -0,0 +1,236 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Thu Jan 20 16:02:05 2022
+
+@author: albyr
+"""
+
+import os
+
+import matplotlib.gridspec as gridspec
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+
+from load_data import (PLOT_DIR, BENCHMARK_NAMES, load_data_cuda_multigpu,
+                       load_data_grcuda_multigpu)
+import segretini_matplottini.src.plot_utils as pu
+from plot_multi_gpu_cuda_exec_time import plot_speedup_bars, plot_speedup_line, plot_ablation_bars
+
+##############################
+##############################
+
+OUTPUT_DATE = "2022_01_20"
+
+# V100;
+V100 = "V100"
+V100_RES_FOLDERS_CUDA = [
+    "2022_01_16_18_09_04_cuda_1-2gpu_v100",
+    "2022_01_16_18_17_05_cuda_4gpu_v100",
+    ]
+V100_RES_FOLDERS_GRCUDA = [
+    "2022_01_18_16_10_41_grcuda_1-2gpu_v100",
+    "2022_01_18_10_01_23_grcuda_4gpu_v100",
+    ]
+
+
+class ResultGPU:
+    def __init__(self, gpu: str, results_grcuda: list, results_cuda: list):
+        self.gpu = gpu
+        self.results_grcuda_folders = results_grcuda
+        self.results_cuda_folders = results_cuda
+        self.results_grcuda = None
+        self.results_cuda = None
+        self.results_merged = None
+        
+    @staticmethod
+    def keep_useful_columns(data: pd.DataFrame, extra_columns: list=None, extra_index_columns: list=None) -> pd.DataFrame:
+        useful_columns = ["benchmark", "size", "gpus", "exec_policy", "device_selection_policy",
+                          "num_iter", "computation_sec", "speedup", "baseline_time"]
+        index_columns = useful_columns[:6]
+        if extra_columns is not None:
+            useful_columns += [e for e in extra_columns if e not in useful_columns]
+        if extra_index_columns is not None:
+            index_columns += [e for e in extra_index_columns if (e not in index_columns) and (e in useful_columns)]
+        # Drop any missing column;
+        final_columns = [c for c in useful_columns if c in data.columns]
+        data = data[final_columns]
+        return data.sort_values(final_columns[:len(index_columns)]).reset_index(drop=True)
+    
+    def load_cuda(self):
+        self.results_cuda = load_data_cuda_multigpu([os.path.join(self.gpu, f) for f in self.results_cuda_folders], skip_iter=3)
+        return self.results_cuda
+    
+    def load_grcuda(self):
+        self.results_grcuda = load_data_grcuda_multigpu([os.path.join(self.gpu, f) for f in self.results_grcuda_folders], skip_iter=3)
+        return self.results_grcuda
+    
+    def group_grcuda_results(self, group_sizes: bool=False, drop_sync: bool=False, drop_nan: bool=True):
+        if self.results_grcuda is None:
+            self.load_grcuda()
+        group_fields = ["benchmark", "exec_policy", "gpus"] + \
+            (["size"] if not group_sizes else []) + \
+            ["parent_stream_policy", "device_selection_policy"]
+        grouped = self.results_grcuda.groupby(group_fields).mean()[["computation_sec", "speedup"]].reset_index()
+        if drop_nan:
+            grouped = grouped.dropna().reset_index(drop=True)
+        if drop_sync:
+            grouped = grouped[grouped["exec_policy"] != "SYNC"]
+        return grouped 
+    
+    def join_grcuda_and_cuda_results(self):
+        if self.results_grcuda is None:
+            self.load_grcuda()
+        if self.results_cuda is None:
+            self.load_cuda()    
+        res_merged = self.results_grcuda.merge(self.results_cuda, how="left",
+                                               on=["benchmark", "size", "gpus", "exec_policy",
+                                                   "prefetch", "num_blocks", "block_size_1d", 
+                                                   "block_size_2d", "num_iter",
+                                                   "total_iterations", "block_size_str"],
+                                               suffixes=["_grcuda", "_cuda"])
+        # Keep only the GrCUDA speedup vs. GrCUDA sync, and the raw execution time of GrCUDA and CUDA;
+        res_merged.rename(columns={"speedup_grcuda": "speedup_grcuda_vs_grcuda_async", "speedup_cuda": "speedup_cuda_vs_cuda_async"}, inplace=True)
+        columns_to_keep = [c for c in res_merged.columns if "cuda" not in c] + \
+            ["computation_sec_grcuda", "computation_sec_cuda", "baseline_time_grcuda", "baseline_time_cuda"]
+        res_merged = res_merged[columns_to_keep + ["speedup_grcuda_vs_grcuda_async", "speedup_cuda_vs_cuda_async"]]
+        # Compute speedup of GrCUDA vs CUDA;
+        res_merged["speedup_grcuda_vs_cuda"] = res_merged["computation_sec_cuda"] / res_merged["computation_sec_grcuda"]
+        self.res_merged = res_merged
+        return self.res_merged
+    
+    def group_merged_results(self, group_sizes: bool=False, drop_sync: bool=False, drop_nan: bool=True):
+        if self.res_merged is None:
+            self.join_grcuda_and_cuda_results()
+        group_fields = ["benchmark", "exec_policy", "gpus"] + \
+            (["size"] if not group_sizes else []) + \
+            ["parent_stream_policy", "device_selection_policy"]
+        grouped = self.res_merged.groupby(group_fields).mean()[["computation_sec_grcuda", "computation_sec_cuda", 
+                                                                "speedup_grcuda_vs_grcuda_async", "speedup_cuda_vs_cuda_async", "speedup_grcuda_vs_cuda"]].reset_index()
+        if drop_nan:
+            grouped = grouped.dropna().reset_index(drop=True)
+        if drop_sync:
+            grouped = grouped[grouped["exec_policy"] != "SYNC"]
+        return grouped 
+    
+
+V100_RESULTS = ResultGPU(
+    gpu=V100,
+    results_grcuda=V100_RES_FOLDERS_GRCUDA,
+    results_cuda=V100_RES_FOLDERS_CUDA
+    ) 
+
+##############################
+##############################   
+
+def plot_grcuda_ablation(gpu_results: list[ResultGPU]):
+    
+    gpus = [2, 4]
+    
+    # Setup plot;
+    plt.rcdefaults()
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": False})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['hatch.linewidth'] = 0.3
+    plt.rcParams['axes.labelpad'] = 5
+    plt.rcParams['xtick.major.pad'] = 4.2
+    plt.rcParams['ytick.major.pad'] = 1
+    plt.rcParams['axes.linewidth'] = 0.5
+    
+    fig = plt.figure(figsize=(3.4, 1.6), dpi=600)
+    gs = gridspec.GridSpec(len(gpus), len(gpu_results))
+    plt.subplots_adjust(top=0.85,
+                        bottom=0.08,
+                        left=0.05,
+                        right=0.99,
+                        hspace=0.2,
+                        wspace=0.15)
+    # Draw plot;
+    for g_i, g in enumerate(gpu_results):
+        data = g.join_grcuda_and_cuda_results()
+        for g_j, num_gpus in enumerate(gpus):
+            
+            # Create new axis;
+            ax = fig.add_subplot(gs[g_j, g_i])
+            
+            res_for_ablation = data.query(f"gpus == {num_gpus}")
+            # FIXME: don't keep just these sizes;
+            chosen_sizes = pd.DataFrame.from_dict({k: [v] for k, v in {"VEC": 950000000, "B&S": 35000000, "ML": 1200000, "CG": 50000, "MUL": 60000}.items()}, orient="index").reset_index().rename(columns={"index": "benchmark", 0: "size"})
+            res_for_ablation = res_for_ablation.merge(chosen_sizes, how="inner", on=["benchmark", "size"])
+            res_for_ablation["benchmark"] = pd.Categorical(res_for_ablation["benchmark"], list(BENCHMARK_NAMES.values()))
+            # Clean data;
+            res_for_ablation = ResultGPU.keep_useful_columns(res_for_ablation, extra_columns=["parent_stream_policy", "device_selection_policy", "computation_sec_grcuda"])
+            res_for_ablation = res_for_ablation.groupby(["benchmark", "size", "gpus", "exec_policy", "parent_stream_policy", "device_selection_policy"]).mean().dropna().reset_index()
+            # Compute speedup;
+            pu.compute_speedup_df(res_for_ablation, ["benchmark"],
+                                  baseline_filter_col=["parent_stream_policy", "device_selection_policy"],
+                                  baseline_filter_val=["multigpu-disjoint", "minmax-transfer-time"],
+                                  time_column="computation_sec_grcuda", aggregation=np.mean)
+            # Add a new column to identify the policy, and make it categorical;
+            res_for_ablation["policy"] = res_for_ablation["parent_stream_policy"] + "-" + res_for_ablation["device_selection_policy"]
+            res_for_ablation["policy"] = pd.Categorical(res_for_ablation["policy"],
+                                                        ["disjoint-round-robin", "disjoint-stream-aware", 
+                                                         "disjoint-minmax-transfer-time", "multigpu-disjoint-minmax-transfer-time"])
+            res_for_ablation = res_for_ablation.dropna().reset_index(drop=True)
+            # Plot;
+            plot_ablation_bars(res_for_ablation, g.gpu, num_gpus, fig=fig, ax=ax, ymax=1.2, yticks=7, plot_speedup_labels=False)
+    pu.save_plot(PLOT_DIR, f"grcuda_ablation_{g.gpu}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+
+#%%###########################
+##############################    
+
+if __name__ == "__main__":
+    g = V100_RESULTS
+    res_cuda = g.load_cuda()
+    res_grcuda = g.load_grcuda()
+    res_grcuda_grouped = g.group_grcuda_results()
+    res_grcuda_grouped_small = g.group_grcuda_results(group_sizes=True, drop_sync=True)
+    res_merged = g.join_grcuda_and_cuda_results()
+    res_merged_grouped = g.group_merged_results(group_sizes=True, drop_sync=True)
+    
+    #%% 1: Barplot with average speedup of GrCUDA vs async, 1 GPU. Consider only the best policy;
+    res_for_barplot = res_merged[(res_merged["gpus"] == 1) | \
+                                 ((res_merged["parent_stream_policy"] == "multigpu-disjoint") & \
+                                 (res_merged["device_selection_policy"] == "minmax-transfer-time"))]
+    # FIXME: CG and ML are incomplete. Keep only the specified sizes; 
+    chosen_sizes = pd.DataFrame.from_dict({k: [v] for k, v in {"VEC": 950000000, "B&S": 35000000, "ML": 1200000, "CG": 50000, "MUL": 60000}.items()}, orient="index").reset_index().rename(columns={"index": "benchmark", 0: "size"})
+    res_for_barplot = res_for_barplot.merge(chosen_sizes, how="inner", on=["benchmark", "size"])
+    res_for_barplot["benchmark"] = pd.Categorical(res_for_barplot["benchmark"], list(BENCHMARK_NAMES.values()))
+    # Simplify and aggregate;
+    res_for_barplot = ResultGPU.keep_useful_columns(res_for_barplot, extra_columns=list(res_for_barplot.columns[-5:]))
+    res_for_barplot = res_for_barplot.groupby(["benchmark", "size", "gpus", "exec_policy", "device_selection_policy"]).mean().dropna().reset_index()
+    # Plot;
+    plot_speedup_bars(res_for_barplot, f"GrCUDA, {g.gpu}", speedup_column="speedup_grcuda_vs_grcuda_async", baseline_is_async=True)
+    pu.save_plot(PLOT_DIR, f"grcuda_vs_grcuda_bars_{g.gpu}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    
+    # Also plot results for CUDA;
+    # FIXME: don't filter sizes;
+    res_for_barplot_cuda = res_merged.merge(chosen_sizes, how="inner", on=["benchmark", "size"]) 
+    res_for_barplot_cuda["benchmark"] = pd.Categorical(res_for_barplot_cuda["benchmark"], list(BENCHMARK_NAMES.values()))
+    # Clean data;
+    res_for_barplot_cuda = ResultGPU.keep_useful_columns(res_for_barplot_cuda, extra_columns=["speedup_cuda_vs_cuda_async", "baseline_time_cuda"])
+    res_for_barplot_cuda = res_for_barplot_cuda.groupby(["benchmark", "size", "gpus", "exec_policy"]).mean().dropna().reset_index()
+    # Plot;
+    plot_speedup_bars(res_for_barplot_cuda, f"CUDA, {g.gpu}", speedup_column="speedup_cuda_vs_cuda_async", baseline_is_async=True)
+    pu.save_plot(PLOT_DIR, f"cuda_vs_cuda_bars_{g.gpu}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    
+    #%% 2: Lineplot with speedup of GrCUDA, best policy, divided by size;
+    res_for_lineplot = res_merged[(res_merged["gpus"] == 1) | \
+                                 ((res_merged["parent_stream_policy"] == "multigpu-disjoint") & \
+                                 (res_merged["device_selection_policy"] == "minmax-transfer-time"))]
+    plot_speedup_line(res_for_lineplot, g.gpu, speedup_column="speedup_grcuda_vs_grcuda_async", baseline_time_column="baseline_time_grcuda", kind="GrCUDA")
+    pu.save_plot(PLOT_DIR, f"grcuda_vs_grcuda_lines_{g.gpu}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    plot_speedup_line(res_for_lineplot, g.gpu, speedup_column="speedup_cuda_vs_cuda_async", baseline_time_column="baseline_time_cuda")
+    pu.save_plot(PLOT_DIR, f"cuda_vs_cuda_lines_{g.gpu}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    
+    #%% 3: Barplot of GrCUDA, best policy, versus CUDA;
+    plot_speedup_bars(res_for_barplot, f"GrCUDA vs. CUDA, {g.gpu}", speedup_column="speedup_grcuda_vs_cuda", 
+                      baseline_is_async=False, legend_title="Baseline: ASYNC, 1 GPU, CUDA",
+                      legend_baseline_label="ASYNC, 1 GPU, GrCUDA", ymax=1.6, yticks=9)
+    pu.save_plot(PLOT_DIR, f"grcuda_vs_cuda_bars_{g.gpu}" + "_{}.{}", date=OUTPUT_DATE, dpi=600)
+    
+    #%% 4: Barplot of GrCUDA, best policy, versus other policies. Do it just for 4 GPUs;
+    plot_grcuda_ablation([g, g])
+    
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_speedup_baseline.py b/projects/resources/python/plotting/plot_speedup_baseline.py
new file mode 100755
index 00000000..2a237942
--- /dev/null
+++ b/projects/resources/python/plotting/plot_speedup_baseline.py
@@ -0,0 +1,681 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sat Jun 20 14:14:30 2020
+
+@author: alberto.parravicini
+"""
+
+import numpy as np
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from scipy.stats.mstats import gmean
+from matplotlib.patches import Patch, Rectangle
+from matplotlib.collections import PatchCollection, LineCollection
+import matplotlib.lines as lines
+import math
+
+import os
+from load_data import load_data, compute_speedup
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot, remove_outliers_df_grouped
+
+
+# INPUT_DATE = "2020_09_19_grcuda"
+OUTPUT_DATE = "2020_10_14"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+BENCHMARK_NAMES = {"b1": "Vector Squares", "b5": "B&S", "b8": "Images", "b6": "ML Ensemble", "b7": "HITS", "b10": "DL"}
+
+INPUT_DATE_960 = "960/2020_10_11_13_15_09_grcuda_baseline"
+INPUT_DATE_P100 = "P100/2020_10_13_10_03_48_grcuda_baseline" # "2020_09_29_17_30_03_grcuda_forceprefetch"
+# INPUT_DATE_P100_NP = "P100/2020_09_19_grcuda_no_prefetch"
+INPUT_DATE_1660 = "1660/2020_10_13_18_21_04_grcuda_baseline"
+
+
+def build_exec_time_plot(data, gridspec, x, y):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+    
+    ax = sns.lineplot(x="size_str", y="computation_speedup", data=data, color=COLORS["bb1"], ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=False, sort=False, ci=None, zorder=2)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Add rectangles to represent variance
+    rectangles = []
+    for s_i, s in enumerate(labels):
+        curr_data = data[data["size"] == s]
+        upper_ci_size, lower_ci_size, center = get_ci_size(curr_data["computation_speedup"], estimator=gmean, ci=0.90)
+        bottom = center - lower_ci_size
+        width = 0.1
+        lower_left = [s_i - width / 2, bottom]
+        # Add an offset to the x position, to avoid overlapping;
+        rectangles += [Rectangle(lower_left, width, upper_ci_size + lower_ci_size)]
+        
+    pc = PatchCollection(rectangles, facecolor="white", edgecolor="#2f2f2f", linewidth=0.5, zorder=3, clip_on=True, alpha=0.7)         
+    ax.add_collection(pc)         
+
+    # Top y-lim is depends on the benchmark, and is multiple of 1.5;
+    max_y_val = np.max(data.groupby(["block_size_str", "size_str"])["computation_speedup"].median())
+    fixed_max_y_val = np.ceil(max_y_val / 1.5) * 1.5
+    
+    ax.set_ylim((0.8, fixed_max_y_val))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=45, ha="right", fontsize=9, rotation_mode="anchor")
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(8))
+    ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+
+    # if y == 0:
+    #     ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=12)
+    # else:
+    #     ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+    #     # Hide tick markers;
+    #     for tic in ax.yaxis.get_major_ticks():
+    #         tic.tick1line.set_visible(False) 
+    #         tic.tick2line.set_visible(False) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=14, ha="center", xycoords="axes fraction")
+    ax.annotate(f"Baseline exec. time (ms):", xy=(0, -0.37), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["r4"])
+    
+    for i, l in enumerate(labels):
+        baseline_median = np.median(data[data["size"] == int(l)]["baseline_time_sec"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.47), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Add block size annotation;
+    if y == 0:
+        ax.annotate(f"Block size:\n1D={data['block_size_1d'].iloc[0]}, 2D={data['block_size_2d'].iloc[0]}x{data['block_size_2d'].iloc[0]}", xy=(-0.65, 1.25), fontsize=14, ha="left", xycoords="axes fraction") 
+    
+    # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    return ax
+
+
+def build_exec_time_plot_1_row(data, gridspec, y):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]]
+    markers = ["o", "X", "D", "P"]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[0, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+    ax = sns.lineplot(x="size_str", y="computation_speedup", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    print(data.groupby(["size_str", "block_size_str"])["computation_speedup"].apply(gmean))
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["computation_speedup"].apply(gmean).reset_index()
+    order = data["block_size_str"].unique()
+    ax = sns.scatterplot(x="size_str", y="computation_speedup", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Top y-lim is depends on the benchmark, and is multiple of 1.5;
+    max_y_val = np.max(data.groupby(["block_size_str", "size_str"])["computation_speedup"].median())
+    fixed_max_y_val = np.ceil(max_y_val / 1.5) * 1.5
+    
+    ax.set_ylim((0.8, fixed_max_y_val))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=8)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(8))
+    ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+
+    # if y == 0:
+    #     ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=12)
+    # else:
+    #     ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+    #     # Hide tick markers;
+    #     for tic in ax.yaxis.get_major_ticks():
+    #         tic.tick1line.set_visible(False) 
+    #         tic.tick2line.set_visible(False) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=14, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.2), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["r4"])
+    for i, l in enumerate(labels):
+        baseline_median = np.median(data[data["size"] == int(l)]["baseline_time_sec"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.27), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend;   
+    if y == 0:
+        legend_labels = [f"1D={x.split(',')[0]}, 2D={x.split(',')[1]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        
+        leg = fig.legend(custom_lines, legend_labels,
+                                 bbox_to_anchor=(0.955, 0.94), fontsize=12, ncol=len(legend_labels), handletextpad=0.1)
+        leg.set_title("Block size:")
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_exec_time_plot_2_row(data, gridspec, fig, i, j):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]]
+    markers = ["o", "X", "D", "P"]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[i, j])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+    ax = sns.lineplot(x="size_str", y="computation_speedup", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    print(data.groupby(["size_str", "block_size_str"])["computation_speedup"].apply(gmean))
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["computation_speedup"].apply(gmean).reset_index()
+    order = data["block_size_str"].unique()
+    ax = sns.scatterplot(x="size_str", y="computation_speedup", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Top y-lim is depends on the benchmark, and is multiple of 1.5;
+    max_y_val = np.max(data.groupby(["block_size_str", "size_str"])["computation_speedup"].median())
+    fixed_max_y_val = np.ceil(max_y_val / 1.5) * 1.5
+    
+    ax.set_ylim((0.9, fixed_max_y_val))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(7))
+    if j == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.05), fontsize=12, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.27), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+    for l_i, l in enumerate(labels):
+        baseline_median = np.median(data[data["size"] == int(l)]["baseline_time_sec"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(l_i,  -0.37), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend; 
+    if i == 0 and j == 0:
+        legend_labels = [f"1D={x.split(',')[0]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        # Add fake entries to have a comment about 2d and 3d sizes;
+        # custom_lines += [Rectangle((0, 0), 1, 1, fc="w", fill=False, edgecolor='none', linewidth=0)] * 2
+        # legend_labels += ["", ""]
+        # # Re-sort labels by transposing them;
+        # custom_lines = np.array(custom_lines).reshape((-1, 2)).T.reshape(-1)
+        # legend_labels = np.array(legend_labels).reshape((-1, 2)).T.reshape(-1)
+        
+        leg = fig.legend(custom_lines, legend_labels, 
+                                 bbox_to_anchor=(0.99, 1), fontsize=10, ncol=len(legend_labels), handletextpad=0.1, columnspacing=0.2)
+        leg.set_title("Block size:\n2D=8x8, 3D=4x4x4", prop={"size": 10})
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_exec_time_plot_2_row_multigpu(data, gridspec, fig, i, j):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    # Add prefetching or not to GPU name;
+    data["gpu_original"] = data["gpu"].copy()
+    # data["gpu"] += np.where(data["exec_policy_full"] == "sync_f", ", sync with prefetch", "")
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b3"], COLORS["b5"]][:len(data["gpu"].unique())]
+    markers = ["o", "X", "D", "X", "D"][:len(data["gpu"].unique())]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[i, j])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+    ax = sns.lineplot(x="size_str", y="computation_speedup", hue="gpu", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    # print(data.groupby(["size_str", "gpu"])["computation_speedup"].apply(gmean))
+    data_averaged = data.groupby(["size_str", "gpu"], as_index=True)["computation_speedup"].apply(gmean).reset_index()
+    order = data["gpu"].unique()
+    ax = sns.scatterplot(x="size_str", y="computation_speedup", hue="gpu", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="gpu", hue_order=order, style_order=order, linewidth=0.05)
+    
+    size_dict = {v: i for i, v in enumerate(sorted(data["size"].unique()))}
+    
+    # Top y-lim is depends on the benchmark, and is multiple of 1.5;
+    max_y_val = np.max(data.groupby(["gpu", "size_str"])["computation_speedup"].median())
+    # fixed_max_y_val = np.ceil(max_y_val / 1.5) * 1.5
+    fixed_max_y_val = 3 if i == 0 else 1.8
+    
+    # Obtain max/min for each block size;
+    max_speedup = {}
+    min_speedup = {}
+    data_block_aggregated = data.groupby(["size_str", "gpu", "block_size_str"], as_index=True)["computation_speedup"].apply(gmean).reset_index()
+    for (size, gpu), g in data_block_aggregated.groupby(["size_str", "gpu"], as_index=True):
+        curr_min = np.inf
+        curr_min_b = 0
+        curr_max = 0
+        curr_max_b = 0
+        for r_i, r in g.iterrows():
+            if r["computation_speedup"] >= curr_max:
+                curr_max = r["computation_speedup"]
+                curr_max_b = r["block_size_str"]
+            if r["computation_speedup"] <= curr_min:
+                curr_min = r["computation_speedup"]
+                curr_min_b = r["block_size_str"]
+        if gpu not in max_speedup:
+            max_speedup[gpu] = []
+        if gpu not in min_speedup:
+            min_speedup[gpu] = []
+        max_speedup[gpu] += [(size, curr_max, curr_max_b)]
+        min_speedup[gpu] += [(size, curr_min, curr_min_b)]
+    for g in data["gpu"].unique():
+        tmp_lines = [[(size_dict[int(e[0][0])], e[0][1]), (size_dict[int(e[0][0])], e[1][1])] for e in zip(min_speedup[g], max_speedup[g])]
+        lc = LineCollection(tmp_lines, color="#888888", alpha=0.8, linewidths=0.5)
+        ax.add_collection(lc)
+    for g in data["gpu"].unique():
+        for e in zip(min_speedup[g], max_speedup[g]):
+            if (e[1][1] - e[0][1] > (0.3 if i == 0 else 0.1)) and not (b == "b6" and g in ["GTX960", "GTX1660 Super"]):
+                v_offset = 0.05 if i == 0 else 0.01
+                ax.annotate(f"{e[0][2].split(',')[0]}", xy=(size_dict[int(e[0][0])] + 0.02, e[0][1] - v_offset), fontsize=6, ha="left", va="center", color="#2f2f2f", alpha=0.9,)
+                ax.annotate(f"{e[1][2].split(',')[0]}", xy=(size_dict[int(e[1][0])] + 0.02, min(fixed_max_y_val, e[1][1] + v_offset)), fontsize=6, ha="left", va="center", color="#2f2f2f", alpha=0.9,)
+
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    ax.set_ylim((0.8, fixed_max_y_val))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    odd_ticks = 0 if (len(labels_str) % 2 == 1) else 1
+    ax.set_xticks([l for i, l in enumerate(labels_str) if i % 2 == odd_ticks])
+    
+    ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels) if i % 2 == odd_ticks], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black", pad=3)
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(8 if i == 0 else 6))
+    if j == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.05), fontsize=12, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.yaxis.grid(True)
+    ax.xaxis.grid(False)
+   
+    # Add baseline execution time annotations (median of execution time across blocks);
+    # curr_label_set = set([int(l) for l_i, l in enumerate(labels) if l_i % 2 == odd_ticks])
+    # other_label_set = set([int(l) for l_i, l in enumerate(labels) if l_i % 2 != odd_ticks])
+    gpus = ["960", "1660", "P100"]
+    ax.annotate("Median baseline exec. time (ms):", xy=(0, -0.27), fontsize=9, ha="left", xycoords="axes fraction", color="#949494")
+    for g_i, gpu in enumerate(data["gpu_original"].unique()):
+        if g_i < len(gpus):
+            if (j == 0):
+                ax.annotate(f"{gpus[g_i]}:", xy=(-0.75, -0.37 - g_i * 0.1), fontsize=9, color=palette[g_i], ha="right", xycoords=("data", "axes fraction"))
+                   
+            # Always print the maximum number of ticks;
+            # curr_sizes = set(data[data["gpu_original"] == gpu]["size"].unique())
+            # odd_ticks_2 = odd_ticks if len(curr_sizes.intersection(curr_label_set)) > len(curr_sizes.intersection(other_label_set)) else int(not odd_ticks)
+          
+            for l_i, l in enumerate(labels):
+                vals = data[(data["size"] == int(l)) & (data["gpu_original"] == gpu)]["baseline_time_sec"]
+                baseline_median = np.median(vals) if len(vals) > 0 else np.nan
+                # print(i, j, gpu, baseline_median)
+                if not math.isnan(baseline_median) and l_i % 2 == odd_ticks:
+                    ax.annotate(f"{int(1000 * baseline_median)}", xy=(l_i, -0.37 - g_i * 0.1), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend; 
+    if i == 0 and j == 0:
+        legend_labels = data["gpu"].unique() # [f"1D={x.split(',')[0]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        # Add fake entries to have a comment about 2d and 3d sizes;
+        # custom_lines += [Rectangle((0, 0), 1, 1, fc="w", fill=False, edgecolor='none', linewidth=0)] * 2
+        # legend_labels += ["", ""]
+        # # Re-sort labels by transposing them;
+        # custom_lines = np.array(custom_lines).reshape((-1, 2)).T.reshape(-1)
+        # legend_labels = np.array(legend_labels).reshape((-1, 2)).T.reshape(-1)
+        
+        leg = fig.legend(custom_lines, legend_labels, 
+                                 bbox_to_anchor=(0.99, 1), fontsize=10, ncol=len(legend_labels), handletextpad=0.1, columnspacing=0.3)
+        # leg.set_title("Block size:\n2D=8x8, 3D=4x4x4", prop={"size": 10})
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+    
+#%%
+if __name__ == "__main__":
+    # data = load_data(INPUT_DATE, skip_iter=3)
+    
+    # # Ignore synchronous execution;
+    # data = data[data["exec_policy"] != "sync"]
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # block_size_list = sorted(data["block_size_str"].unique(), key=lambda x: [int(y) for y in x.split(",")])
+
+    # num_col = len(benchmark_list)
+    # num_row = len(block_size_list)
+    # fig = plt.figure(figsize=(2.5 * num_col, 4 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.8,
+    #                 bottom=0.15,
+    #                 left=0.2,
+    #                 right=0.90,
+    #                 hspace=1.1,
+    #                 wspace=0.3)
+        
+    # exec_time_axes = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     for block_size_i, block_size in enumerate(block_size_list): 
+    #         curr_res = data[(data["benchmark"] == b) & (data["block_size_str"] == block_size)].reset_index(drop=True)  
+    #         exec_time_axes += [build_exec_time_plot(curr_res, gs, block_size_i, b_i)]
+            
+    # plt.annotate("Input number of elements", xy=(0.5, 0.03), fontsize=20, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup over serial scheduling", xy=(0.02, 0.5), fontsize=20, ha="center", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Execution time speedup\nover serial kernel scheduling", fontsize=25, x=.05, y=0.99, ha="left")
+    
+    # save_plot(PLOT_DIR, "speedup_baseline_{}.{}", OUTPUT_DATE)
+    
+    #%% Similar plot, but all block sizes are on 1 row;
+    
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # num_col = len(benchmark_list)
+    # num_row = 1
+    # fig = plt.figure(figsize=(2.6 * num_col, 4.1 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.65,
+    #                 bottom=0.21,
+    #                 left=0.1,
+    #                 right=0.95,
+    #                 hspace=1.1,
+    #                 wspace=0.3)
+        
+    # exec_time_axes = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     curr_res = data[data["benchmark"] == b].reset_index(drop=True)  
+    #     exec_time_axes += [build_exec_time_plot_1_row(curr_res, gs, b_i)]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.03), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup over\nserial scheduling", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Execution time speedup\nover serial kernel scheduling", fontsize=20, x=.05, y=0.92, ha="left")
+
+    # save_plot(PLOT_DIR, "speedup_baseline_1_row_{}.{}", OUTPUT_DATE)
+    
+    
+    #%% Similar plot, but formatted for 1-column on a paper;
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # data = data[~((data["benchmark"] == "b5") & (data["size"] == 3000000))]
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # num_row = 2
+    # num_col = len(benchmark_list) // num_row
+    # fig = plt.figure(figsize=(2.2 * num_col, 2.7 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.82,
+    #                 bottom=0.15,
+    #                 left=0.08,
+    #                 right=0.98,
+    #                 hspace=0.55,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # speedups = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     i = b_i // num_col
+    #     j = b_i % num_col
+    #     curr_res = data[data["benchmark"] == b].reset_index(drop=True)  
+    #     curr_res = remove_outliers_df_grouped(curr_res, column="computation_speedup", group=["block_size_str", "size"])
+    #     speedups += [curr_res.groupby(["size", "block_size_str"])["computation_speedup"].apply(gmean)]
+    #     exec_time_axes += [build_exec_time_plot_2_row(curr_res, gs, fig, i, j)]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # # plt.annotate("Speedup over\nserial scheduling", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Parallel scheduler speedup\nover serial scheduler", fontsize=16, x=.02, y=0.99, ha="left")
+
+    # save_plot(PLOT_DIR, "speedup_baseline_2_row_{}.{}", OUTPUT_DATE)
+    
+    #%% Plot both P100 and GTX960
+    
+    # data_960 = load_data(INPUT_DATE_960, skip_iter=3)
+    # data_p100 = load_data(INPUT_DATE_P100, skip_iter=3)
+    # data_1660 = load_data(INPUT_DATE_1660, skip_iter=3)
+    # # data_p100_np = load_data(INPUT_DATE_P100_NP, skip_iter=3)
+    # data_960["gpu"] = "GTX960"
+    # data_p100["gpu"] = "P100"
+    # data_1660["gpu"] = "GTX1660 Super"
+    # # data_p100_np["gpu"] = "P100, no prefetch"
+    # # data = pd.concat([data_960, data_p100, data_p100_np])
+    # data = pd.concat([data_960, data_1660, data_p100]).reset_index(drop=True)
+    
+    # # data = data[data["force_prefetch"] == False]
+    
+    # # Ignore synchronous execution;
+    # # data = data[data["exec_policy"] != "sync"]
+    
+    # # Remove no prefetch data if required;
+    # # data = data[data["gpu"] != "P100, no prefetch"]
+    
+    # # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # block_size_list = sorted(data["block_size_str"].unique(), key=lambda x: [int(y) for y in x.split(",")])
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # num_row = 2
+    # num_col = len(benchmark_list) // num_row
+    # fig = plt.figure(figsize=(2.2 * num_col, 2.7 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.86,
+    #                 bottom=0.18,
+    #                 left=0.09,
+    #                 right=0.98,
+    #                 hspace=0.75,
+    #                 wspace=0.1)
+        
+    # exec_time_axes = []
+    # speedups = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     i = b_i // num_col
+    #     j = b_i % num_col
+    #     curr_res = data[data["benchmark"] == b].reset_index(drop=True)  
+    #     curr_res = remove_outliers_df_grouped(curr_res, column="computation_speedup", group=["block_size_str", "size", "gpu"])
+    #     speedups += [curr_res.groupby(["size", "block_size_str", "gpu"])["computation_speedup"].apply(gmean)]
+    #     exec_time_axes += [build_exec_time_plot_2_row_multigpu(curr_res, gs, fig, i, j)]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # # plt.annotate("Speedup over\nserial scheduling", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Parallel scheduler speedup\nover serial scheduler", fontsize=16, x=.02, y=0.99, ha="left")
+
+    # save_plot(PLOT_DIR, "speedup_baseline_multigpu_{}.{}", OUTPUT_DATE)
+    
+    
+    #%% Plot speedup with prefetching of sync and async w.r.t. sync baseline;
+    
+    data_960 = load_data(INPUT_DATE_960, skip_iter=3)
+    data_p100 = load_data(INPUT_DATE_P100, skip_iter=3)
+    data_1660 = load_data(INPUT_DATE_1660, skip_iter=3)
+    data_960["gpu"] = "GTX960"
+    data_p100["gpu"] = "P100"
+    data_1660["gpu"] = "GTX1660 Super"
+    data = pd.concat([data_960, data_1660, data_p100]).reset_index(drop=True)
+    
+    data["exec_policy_full"] = data["exec_policy"] + np.where(data["force_prefetch"], "_f", "")
+    
+    # Recompute speedups w.r.t. sync-noprefetch policy;
+    compute_speedup(data, ["gpu", "benchmark", "new_stream_policy", "parent_stream_policy",
+            "dependency_policy", "block_size_1d", "block_size_2d",
+            "total_iterations", "cpu_validation", "random_init", "size", "realloc", "reinit"], baseline_filter_col="exec_policy_full", baseline_filter_val="sync")
+
+    # Ignore synchronous execution;
+    data = data[data["exec_policy_full"] != "sync"]
+    # Skip no-prefetch;
+    data = data[(data["exec_policy_full"] != ASYNC_POLICY_NAME) | (data["gpu"] == "GTX960")]
+    data = data[(data["exec_policy_full"] != "sync_f") | (data["gpu"] == "GTX960")]
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['axes.titlepad'] = 20 
+    plt.rcParams['axes.labelpad'] = 10 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    
+    # Lists of benchmarks and block sizes;
+    benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    block_size_list = sorted(data["block_size_str"].unique(), key=lambda x: [int(y) for y in x.split(",")])
+    
+    # Lists of benchmarks and block sizes;
+    benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    num_row = 2
+    num_col = len(benchmark_list) // num_row
+    fig = plt.figure(figsize=(2.2 * num_col, 2.8 * num_row))
+    gs = gridspec.GridSpec(num_row, num_col)
+    plt.subplots_adjust(top=0.86,
+                    bottom=0.18,
+                    left=0.09,
+                    right=0.98,
+                    hspace=0.85,
+                    wspace=0.1)
+        
+    exec_time_axes = []
+    speedups = []
+    for b_i, b in enumerate(benchmark_list):
+        i = b_i // num_col
+        j = b_i % num_col
+        curr_res = data[data["benchmark"] == b].reset_index(drop=True)  
+        curr_res = remove_outliers_df_grouped(curr_res, column="computation_speedup", group=["block_size_str", "size", "gpu"])
+        speedups += [curr_res.groupby(["size", "block_size_str", "gpu"])["computation_speedup"].apply(gmean)]
+        exec_time_axes += [build_exec_time_plot_2_row_multigpu(curr_res, gs, fig, i, j)]
+        
+    plt.annotate("Input number of elements (x-axis not to scale)", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup over\nserial scheduling", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    plt.suptitle("Parallel scheduler speedup\nover serial scheduler", fontsize=16, x=.02, y=0.99, ha="left")
+
+    save_plot(PLOT_DIR, "speedup_baseline_multigpu_prefetch_{}.{}", OUTPUT_DATE)
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_speedup_grcuda_cuda.py b/projects/resources/python/plotting/plot_speedup_grcuda_cuda.py
new file mode 100755
index 00000000..a68e19ec
--- /dev/null
+++ b/projects/resources/python/plotting/plot_speedup_grcuda_cuda.py
@@ -0,0 +1,1353 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Sun Jun 21 15:26:36 2020
+
+@author: alberto.parravicini
+"""
+
+import numpy as np
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from scipy.stats.mstats import gmean
+from matplotlib.patches import Patch, Rectangle
+from matplotlib.collections import PatchCollection, LineCollection
+import matplotlib.lines as lines
+import math
+
+import os
+from load_data import load_data, load_data_cuda, join_tables, join_tables_baseline
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot, update_width, add_labels, get_upper_ci_size, remove_outliers_df_grouped
+import matplotlib.ticker as ticker
+
+##############################
+##############################
+
+# # P100
+# INPUT_DATE_GRCUDA = "2020_09_19_2_grcuda"
+# INPUT_DATE_CUDA = "2020_09_25_09_29_10_cuda"
+# # 960
+# INPUT_DATE_GRCUDA = "2020_09_22_17_44_41_grcuda_b8baseline"
+# INPUT_DATE_CUDA = "2020_09_22_18_36_21_cuda"
+
+OUTPUT_DATE = "2020_10_14"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+# INPUT_DATE_GRCUDA_P100 = "2020_09_19_2_grcuda"
+# INPUT_DATE_GRCUDA_960 = "2020_09_22_17_44_41_grcuda_b8baseline"
+INPUT_DATE_CUDA_P100 = "P100/2020_10_13_10_04_06_cuda" # "P100/2020_09_25_09_29_10_cuda" 
+INPUT_DATE_CUDA_960 = "960/2020_10_07_17_08_41_cuda"
+INPUT_DATE_CUDA_1660 = "1660/2020_10_13_14_49_29_cuda"
+
+INPUT_DATE_GRCUDA_960 = "960/2020_10_11_13_15_09_grcuda_baseline"
+INPUT_DATE_GRCUDA_P100 = "P100/2020_10_13_10_03_48_grcuda_baseline" # "P100/2020_10_06_grcuda_p100_baseline" # "2020_09_29_17_30_03_grcuda_forceprefetch"
+INPUT_DATE_GRCUDA_1660 = "1660/2020_10_13_18_21_04_grcuda_baseline"
+
+BENCHMARK_NAMES = {"b1": "Vector Squares", "b5": "B&S", "b8": "Images", "b6": "ML Ensemble", "b7": "HITS", "b10": "DL"}
+
+# ASYNC_POLICY_NAME = "async"   # If parsing new results;
+ASYNC_POLICY_NAME = "default"  # If parsing older results;
+
+##############################
+##############################
+
+def build_exec_time_plot_grcuda_cuda(data, gridspec, x, y):
+    
+    palette = [COLORS["peach1"], COLORS["bb1"]]
+    markers = ["o"] * len(palette)
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="grcuda_cuda_speedup", hue="exec_policy", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, ci=None, sort=False, zorder=2)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Add rectangles to represent variance;
+    for p_i, p in enumerate(sorted(data["exec_policy"].unique())):
+        rectangles = []
+        for s_i, s in enumerate(labels):
+            curr_data = data[(data["size"] == s) & (data["exec_policy"] == p)]
+            upper_ci_size, lower_ci_size, center = get_ci_size(curr_data["grcuda_cuda_speedup"], estimator=gmean, ci=0.90)
+            bottom = center - lower_ci_size
+            width = 0.1
+            lower_left = [s_i - width / 2, bottom]
+            # Add an offset to the x position, to avoid overlapping;
+            lower_left[0] += (2 * p_i - 1) * (width / 3.5)
+            rectangles += [Rectangle(lower_left, width, upper_ci_size + lower_ci_size)]
+            
+        pc = PatchCollection(rectangles, facecolor=palette[p_i], edgecolor="#2f2f2f", linewidth=0.5, zorder=3, clip_on=True, alpha=0.7)         
+        ax.add_collection(pc)         
+
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 2))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+    
+    # Set the x ticks;
+    odd_ticks = 0 if (len(labels_str) % 2 == 1) else 1
+    ax.set_xticks([l for i, l in enumerate(labels_str) if i % 2 == odd_ticks])
+    ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels) if i % 2 == odd_ticks], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black", pad=3)
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(7))
+    if j == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+                
+    # Set the x ticks;
+    # ax.set_xticks(labels_str)
+    # ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=45, ha="right", fontsize=9, rotation_mode="anchor")
+    # ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(5))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=12)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=14, ha="center", xycoords="axes fraction")
+    ax.annotate(f"GrCUDA serial time (ms):", xy=(0, -0.37), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["r4"])
+    
+    for i, l in enumerate(labels):
+        baseline_median = np.median(data[data["size"] == int(l)]["baseline_time_sec_grcuda"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.47), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Add block size annotation;
+    if y == 0:
+        ax.annotate(f"Block size:\n1D={data['block_size_1d'].iloc[0]}, 2D={data['block_size_2d'].iloc[0]}x{data['block_size_2d'].iloc[0]}", xy=(-0.65, 1.25), fontsize=14, ha="left", xycoords="axes fraction") 
+    
+    # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Legend;
+    if y == 0 and x == 0:
+        legend_labels = ["Paraller Scheduler", "Serial Scheduler"]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        
+        leg = fig.legend(custom_lines, legend_labels,
+                                 bbox_to_anchor=(0.91, 0.98), fontsize=12, ncol=1, handletextpad=0.1)
+        leg.set_title(None)
+        leg._legend_box.align = "left"
+        
+    
+    return ax
+
+
+def build_exec_time_plot_grcuda_cuda_compact(data, gridspec, x, y):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    legend_labels = ["Paraller Scheduler", "Serial Scheduler"]
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]][:len(data["block_size_str"].unique())]
+    markers = ["o", "X", "D", "P"][:len(data["block_size_str"].unique())]
+    order = data["block_size_str"].unique()
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="grcuda_cuda_speedup", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order, zorder=2)
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["grcuda_cuda_speedup"].apply(gmean).reset_index()
+    
+    ax = sns.scatterplot(x="size_str", y="grcuda_cuda_speedup", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 2))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(5))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=12)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0:
+        ax.annotate(f"{legend_labels[x % 2]}", xy=(-0.15, 1.25), fontsize=14, ha="left", xycoords="axes fraction") 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=14, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.22), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["r4"])
+    for i, l in enumerate(labels):
+        baseline_median = np.median(data[data["size"] == int(l)]["baseline_time_sec_cuda"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.29), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend;   
+    if x == 0 and y == 0:
+        legend_labels = [f"1D={x.split(',')[0]}, 2D={x.split(',')[1]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        
+        leg = fig.legend(custom_lines, legend_labels,
+                                 bbox_to_anchor=(0.95, 1), fontsize=12, ncol=len(legend_labels) // 2, handletextpad=0.1)
+        leg.set_title("Block size:")
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_exec_time_plot_grcuda_cuda_2rows(data, gridspec, x, y):
+        
+    data["size_str"] = data["size"].astype(str)
+    
+    legend_labels = ["Paraller Scheduler", "Serial Scheduler"]
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]][:len(data["block_size_str"].unique())]
+    markers = ["o", "X", "D", "P"][:len(data["block_size_str"].unique())]
+    order = data["block_size_str"].unique()
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="grcuda_cuda_speedup", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order, zorder=2)
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["grcuda_cuda_speedup"].apply(gmean).reset_index()
+    
+    ax = sns.scatterplot(x="size_str", y="grcuda_cuda_speedup", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Set the same y limits in each plot;
+    ax.set_ylim((0.0, 1.5))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=8)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(4))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0 and x % 2 == 0:
+        ax.annotate(f"{legend_labels[x // 2]}", xy=(-0.3, -1.4), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=10, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.42), fontsize=8, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+    for i, l in enumerate(labels):
+        baseline_median = np.median(data[data["size"] == int(l)]["baseline_time_sec_cuda"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.57), fontsize=8, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend; 
+    if x == 0 and y== 0:
+        legend_labels = [f"1D={x.split(',')[0]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]        
+        leg = fig.legend(custom_lines, legend_labels, 
+                                 bbox_to_anchor=(0.99, 1), fontsize=10, ncol=2, handletextpad=0.1, columnspacing=0.2)
+        leg.set_title("Block size:\n2D=8x8, 3D=4x4x4", prop={"size": 10})
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_exec_time_plot_grcuda_cuda_2rows_multigpu(data, gridspec, x, y, exec_policy, palette_in, markers_in):
+    
+    # data = pd.melt(data, id_vars=["benchmark", "size", "block_size_str", "computation_sec"], value_vars=data.columns[-3:],
+    #     var_name="versus", value_name="speedup")
+    # data["size_str"] = data["size"].astype(str)
+    
+    if exec_policy == ASYNC_POLICY_NAME:
+        data = data[~data["versus"].isin(["speedup_sync", "speedup_cudagraphsingle"])]
+        if len(data) == 0:
+            return
+    elif exec_policy == "sync":
+        data = data[data["versus"].isin(["speedup_cudagraphsingle"])]
+        if len(data) == 0:
+            return
+    else:
+        raise ValueError(exec_policy + " is not ok!")
+    # data = data[data["versus"] != "speedup_sync"]
+    # print(x,y,exec_policy,len(data))
+    legend_labels = ["Serial Scheduler", "Paraller Scheduler"]
+    
+    order = data["versus"].unique()
+    palette = [palette_in[o] for o in order]
+    markers = [markers_in[o] for o in order]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.3)
+
+    ax = sns.lineplot(x="size_str", y="speedup", hue="versus", data=data[data["gpu"] == "GTX960"], palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order, zorder=2)
+    ax = sns.lineplot(x="size_str", y="speedup", hue="versus", data=data[data["gpu"] == "P100"], palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order, zorder=2)
+    data_averaged = data.groupby(["size_str", "versus", "gpu"], as_index=True)["speedup"].apply(gmean).reset_index()
+    
+    ax = sns.scatterplot(x="size_str", y="speedup", hue="versus", data=data_averaged[data_averaged["gpu"] == "GTX960"], palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="versus", hue_order=order, style_order=order, linewidth=0.05)
+    ax = sns.scatterplot(x="size_str", y="speedup", hue="versus", data=data_averaged[data_averaged["gpu"] == "P100"], palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="versus", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(l) for l in labels]
+    
+    # Set the same y limits in each plot;
+    num_y_ticks = 6
+    if exec_policy == "sync":
+        ax.set_ylim((0.5, 1.5))
+        num_y_ticks = 5
+    elif exec_policy == ASYNC_POLICY_NAME and x == 3:
+        ax.set_ylim((0.5, 2.0))
+        num_y_ticks = 4
+    else:
+        ax.set_ylim((0.5, 3))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Add a vertical line to split GPUs;
+    max_size_960 = str(data[data["gpu"] == "GTX960"]["size"].max())
+    
+    # Set the x ticks;
+    odd_ticks = 0 if (len(labels_str) % 2 == 1) else 1
+    xticks = []
+    max_tick_960 = 0
+    for i, l in enumerate(labels_str):
+        if i % 2 == odd_ticks:
+            xticks += [l]
+        if l == max_size_960:
+            max_tick_960 = i
+    ax.axvline(x=max_tick_960, color="#2f2f2f", linestyle="--", zorder=1, linewidth=0.5, alpha=0.5)
+    ax.annotate("GTX960", xy=(0.35, 0.85), fontsize=8, ha="center", xycoords="axes fraction", color="#2f2f2f", alpha=0.5)
+    ax.annotate("P100", xy=(0.6, 0.85), fontsize=8, ha="center", xycoords="axes fraction", color="#2f2f2f", alpha=0.5)        
+    ax.set_xticks(xticks)
+    ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels) if i % 2 == odd_ticks], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black", pad=3)
+
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(num_y_ticks))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0 and x % 2 == 0:
+        ax.annotate(f"{legend_labels[x // 2]}", xy=(-0.3, -1.4), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.08), fontsize=10, ha="center", xycoords="axes fraction")
+    
+    # Turn off tick lines;
+    ax.yaxis.grid(True)
+    ax.xaxis.grid(False)
+    # ax.tick_params(axis="x", which="major",length=3)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    gpus = ["960", "P100"]
+    palette_gpu = [COLORS["peach1"], COLORS["b8"], COLORS["b2"]]
+    ax.annotate("Median GrCUDA exec. time (ms):", xy=(0, -0.42), fontsize=9, ha="left", xycoords="axes fraction", color="#949494")
+    for g_i, gpu in enumerate(data["gpu"].unique()):
+        if g_i < len(gpus):
+            if (j == 0):
+                ax.annotate(f"{gpus[g_i]}:", xy=(-0.75, -0.57 - g_i * 0.15), fontsize=9, color=palette_gpu[g_i], ha="right", xycoords=("data", "axes fraction"))
+            for l_i, l in enumerate(labels):
+                vals = data[(data["size"] == int(l)) & (data["gpu"] == gpu)]["computation_sec"]
+                baseline_median = np.median(vals) if len(vals) > 0 else np.nan
+                if not math.isnan(baseline_median) and l_i % 2 == odd_ticks:
+                    ax.annotate(f"{int(1000 * baseline_median)}", xy=(l_i, -0.57 - g_i * 0.15), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    
+    # # Add baseline execution time annotations (median of execution time across blocks);
+    # ax.annotate("Median GrCUDA exec. time (ms):", xy=(0, -0.42), fontsize=8, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+    # for i, l in enumerate(labels):    
+    #     baseline_median = np.median(data[data["size"] == int(l)]["computation_sec"])
+    #     ax.annotate(f"{int(1000 * baseline_median)}", xy=(i, -0.57 - j * 0.1), fontsize=8, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    return ax
+
+
+def build_exec_time_plot_grcuda_cuda_3rows_multigpu(data, gridspec, x, y, gpu, palette_in, markers_in, sizes=None):
+    
+    legend_label = "Paraller Scheduler"
+    
+    order = data["versus"].unique()
+    palette = [palette_in[o] for o in order]
+    markers = [markers_in[o] for o in order]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.3)
+
+    ax = sns.lineplot(x="size_str", y="speedup", hue="versus", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order, zorder=2)
+    data_averaged = data.groupby(["size_str", "versus", "gpu"], as_index=True)["speedup"].apply(gmean).reset_index()
+    
+    ax = sns.scatterplot(x="size_str", y="speedup", hue="versus", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="versus", hue_order=order, style_order=order, linewidth=0.05)
+      
+    if sizes is None:
+        labels = sorted(data["size"].unique())
+    else:
+        labels = sizes.copy()
+    labels_str = [str(l) for l in labels]
+    
+    # Set the same y limits in each plot;
+    num_y_ticks = 6
+    ax.set_ylim((0.5, 3))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                    
+    # Set the x ticks;
+    xticks = []
+    for i, l in enumerate(labels_str):
+        xticks += [l]
+       
+    # ax.set_xticks(xticks)
+    ax.set_xticks(range(0, len(xticks), 2))
+
+    ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels) if i % 2 == 0], rotation=0, ha="center", fontsize=9)
+    # ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels)], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black", pad=3)
+
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(num_y_ticks))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    gpu_dict = {"GTX960": "GTX960", "GTX1660 Super": "GTX1660 Super", "P100": "Tesla P100"}
+    if y == 0 and x % 2 == 0:
+        ax.annotate(gpu_dict[g], xy=(-0.3, -1.0), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.08), fontsize=10, ha="center", xycoords="axes fraction")
+    
+    # Turn off tick lines;
+    ax.yaxis.grid(True)
+    ax.xaxis.grid(False)
+    # ax.tick_params(axis="x", which="major",length=3)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    gpu_dict = {"GTX960": "960", "GTX1660 Super": "1660", "P100": "P100"}
+    palette_gpu = [COLORS["peach1"], COLORS["b8"], COLORS["b2"]]
+    ax.annotate("Median GrCUDA exec. time (ms):", xy=(0, -0.45), fontsize=9, ha="left", xycoords="axes fraction", color="#949494")
+    if (j == 0):
+        ax.annotate(f"{gpu_dict[g]}:", xy=(-0.75, -0.61), fontsize=9, color="#949494", ha="right", xycoords=("data", "axes fraction"))
+    for l_i, l in enumerate(labels):
+        vals = data[(data["size"] == int(l))]["computation_sec"]
+        baseline_median = np.median(vals) if len(vals) > 0 else np.nan
+        if not math.isnan(baseline_median) and l_i % 2 == 0:
+            ax.annotate(f"{int(1000 * baseline_median)}", xy=(l_i, -0.61), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+        
+    return ax
+
+
+def build_exec_time_plot_grcuda_cuda_2rows_multigpu3(data, gridspec, x, y, exec_policy, palette_in, markers_in):
+    
+    # data = pd.melt(data, id_vars=["benchmark", "size", "block_size_str", "computation_sec"], value_vars=data.columns[-3:],
+    #     var_name="versus", value_name="speedup")
+    # data["size_str"] = data["size"].astype(str)
+    
+    if exec_policy == ASYNC_POLICY_NAME:
+        data = data[~data["versus"].isin(["speedup_sync", "speedup_cudagraphsingle", "speedup_cudagraph"])]
+        if len(data) == 0:
+            return
+    elif exec_policy == "sync":
+        data = data[data["versus"].isin(["speedup_cudagraphsingle"])]
+        if len(data) == 0:
+            return
+    else:
+        raise ValueError(exec_policy + " is not ok!")
+    # data = data[data["versus"] != "speedup_sync"]
+    # print(x,y,exec_policy,len(data))
+    legend_labels = ["Serial Scheduler", "Paraller Scheduler"]
+    
+    order_p = data["versus"].unique()
+    order_m = data["gpu"].unique()
+    palette = [palette_in[o] for o in order_p]
+    markers = [markers_in[o] for o in order_m]
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.3)
+
+    ax = sns.lineplot(x="size_str", y="speedup", hue="versus", data=data[data["gpu"] == "GTX960"], palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order_p, zorder=2)
+    ax = sns.lineplot(x="size_str", y="speedup", hue="versus", data=data[data["gpu"] == "GTX1660 Super"], palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order_p, zorder=2)
+    ax = sns.lineplot(x="size_str", y="speedup", hue="versus", data=data[data["gpu"] == "P100"], palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, hue_order=order_p, zorder=2)
+    data_averaged = data.groupby(["size_str", "versus", "gpu"], as_index=True)["speedup"].apply(gmean).reset_index()
+    
+    ax = sns.scatterplot(x="size_str", y="speedup", data=data_averaged[data_averaged["gpu"] == "GTX960"], color="#ffffff", ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers_in, linewidth=0.08, style="gpu")
+    ax = sns.scatterplot(x="size_str", y="speedup", data=data_averaged[data_averaged["gpu"] == "GTX1660 Super"], color="#ffffff", ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers_in, linewidth=0.08, style="gpu")
+    ax = sns.scatterplot(x="size_str", y="speedup", data=data_averaged[data_averaged["gpu"] == "P100"], color="#ffffff", ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers_in, linewidth=0.08, style="gpu")
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(l) for l in labels]
+    
+    # Set the same y limits in each plot;
+    num_y_ticks = 6
+    if exec_policy == "sync":
+        ax.set_ylim((0.5, 1.5))
+        num_y_ticks = 5
+    elif exec_policy == ASYNC_POLICY_NAME and x == 3:
+        ax.set_ylim((0.5, 2.0))
+        num_y_ticks = 4
+    else:
+        ax.set_ylim((0.5, 3))
+
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Add a vertical line to split GPUs;
+    max_size_960 = str(data[data["gpu"] == "GTX960"]["size"].max())
+    
+    # Set the x ticks;
+    odd_ticks = 0 if (len(labels_str) % 2 == 1) else 1
+    xticks = []
+    max_tick_960 = 0
+    for i, l in enumerate(labels_str):
+        if i % 2 == odd_ticks:
+            xticks += [l]
+        if l == max_size_960:
+            max_tick_960 = i
+    # ax.axvline(x=max_tick_960, color="#2f2f2f", linestyle="--", zorder=1, linewidth=0.5, alpha=0.5)
+    # ax.annotate("GTX960", xy=(0.35, 0.85), fontsize=8, ha="center", xycoords="axes fraction", color="#2f2f2f", alpha=0.5)
+    # ax.annotate("P100", xy=(0.6, 0.85), fontsize=8, ha="center", xycoords="axes fraction", color="#2f2f2f", alpha=0.5)        
+    ax.set_xticks(xticks)
+    ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels) if i % 2 == odd_ticks], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black", pad=3)
+
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(num_y_ticks))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0 and x % 2 == 0:
+        ax.annotate(f"{legend_labels[x // 2]}", xy=(-0.3, -1.4), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.08), fontsize=10, ha="center", xycoords="axes fraction")
+    
+    # Turn off tick lines;
+    ax.yaxis.grid(True)
+    ax.xaxis.grid(False)
+    # ax.tick_params(axis="x", which="major",length=3)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    gpus = ["960", "1660", "P100"]
+    palette_gpu = [COLORS["peach1"], COLORS["b8"], COLORS["b2"]]
+    ax.annotate("Median GrCUDA exec. time (ms):", xy=(0, -0.42), fontsize=9, ha="left", xycoords="axes fraction", color="#949494")
+    for g_i, gpu in enumerate(data["gpu"].unique()):
+        if g_i < len(gpus):
+            if (j == 0):
+                ax.annotate(f"{gpus[g_i]}:", xy=(-0.75, -0.57 - g_i * 0.15), fontsize=9, color=palette_gpu[g_i], ha="right", xycoords=("data", "axes fraction"))
+
+            for l_i, l in enumerate(labels):
+                vals = data[(data["size"] == int(l)) & (data["gpu"] == gpu)]["computation_sec"]
+                baseline_median = np.median(vals) if len(vals) > 0 else np.nan
+                # print(i, j, gpu, baseline_median)
+                if not math.isnan(baseline_median) and l_i % 2 == odd_ticks:
+                    ax.annotate(f"{int(1000 * baseline_median)}", xy=(l_i, -0.37 - g_i * 0.1), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+ 
+    # # Add baseline execution time annotations (median of execution time across blocks);
+    # ax.annotate("Median GrCUDA exec. time (ms):", xy=(0, -0.42), fontsize=8, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+    # for i, l in enumerate(labels):    
+    #     baseline_median = np.median(data[data["size"] == int(l)]["computation_sec"])
+    #     ax.annotate(f"{int(1000 * baseline_median)}", xy=(i, -0.57 - j * 0.1), fontsize=8, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    return ax
+
+def ridgeplot(data):
+    # Plotting setup;
+    sns.set(font_scale=1.4)
+    sns.set_style("whitegrid")
+    plt.rcParams["font.family"] = ["Latin Modern Roman"]
+    plt.rcParams['axes.titlepad'] = 20 
+    plt.rcParams['axes.labelpad'] = 10 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    
+    sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    
+    # Plot only some data, for now;
+    data = data[(data["block_size_1d"] == 256) & (data["exec_policy"] == ASYNC_POLICY_NAME)].copy()
+    # data = data[(data["exec_policy"] ==ASYNC_POLICY_NAME)].copy()
+
+    # For each benchmark, keep the data relative to the largest data size;
+    biggest_sizes = data.groupby(["benchmark"])["size"].max().to_dict()
+    data_filtered = []
+    for k, v in biggest_sizes.items():
+        data_filtered += [data[(data["benchmark"] == k) & (data["size"] == v)]]
+    data = pd.concat(data_filtered).reset_index(drop=True)
+    
+    # Normalize execution times so that the CUDA baseline has median 1;
+    data["normalized_time_cuda"] = 1
+    data["normalized_time_grcuda"] = 1
+    
+    # grouped_data = data.groupby(["benchmark", "size", "block_size_str"], as_index=False)
+    grouped_data = data.groupby(["benchmark"], as_index=False, sort=False)
+    for group_key, group in grouped_data:
+        # Compute the median baseline computation time;
+        median_baseline = np.median(group["computation_sec_cuda"])
+        # Compute the speedup for this group;
+        data.loc[group.index, "normalized_time_cuda"] = group["computation_sec_cuda"].values / median_baseline
+        data.loc[group.index, "normalized_time_grcuda"] = group["computation_sec_grcuda"].values / median_baseline
+        
+    benchmarks = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    block_sizes = data["block_size_str"].unique()
+    sizes = data["size"].unique()
+            
+    # Initialize the plot;
+    g = sns.FacetGrid(data, row="benchmark", aspect=5, height=1.2, sharey=False, sharex=False,)
+
+    # Plot a vertical line corresponding to speedup = 1;
+    g.map(plt.axvline, x=1, lw=0.75, clip_on=True, zorder=0, linestyle="--", ymax=0.5)         
+    # Plot the densities. Plot them twice as the second time we plot just the white contour;     
+    g.map(sns.kdeplot, "normalized_time_cuda", clip_on=False, shade=True, alpha=0.6, lw=1, color=COLORS["peach1"], zorder=2)  
+    g.map(sns.kdeplot, "normalized_time_grcuda", clip_on=False, shade=True, alpha=0.6, lw=1, color=COLORS["b8"], zorder=2)
+    g.map(sns.kdeplot, "normalized_time_cuda", clip_on=False, color="w", lw=1.1, zorder=2)
+    g.map(sns.kdeplot, "normalized_time_grcuda", clip_on=False, color="w", lw=1.1, zorder=2)
+    # Plot the horizontal line below the densities;
+    g.map(plt.axhline, y=0, lw=0.75, clip_on=False, zorder=5, color="0.6")
+    
+    # Write the x-axis tick labels using percentages;
+    @ticker.FuncFormatter
+    def major_formatter(x, pos):
+        return f"{x:.2f}x"
+    # Fix the horizontal axes.
+    # For each benchmark, find the smallest and largest values;    
+    offsets = {
+        "b1": [0.95, 1.02],
+        "b5": [0.95, 1.05],
+        "b6": [0.95, 1.05],
+        "b7": [0.99, 0.99],
+        "b8": [0.87, 1.13],
+        "b10": [0.87, 1.13],
+        }
+    # offsets = {
+    #     "b1": [0.85, 1.15],
+    #     "b5": [0.95, 1.05],
+    #     "b6": [0.95, 1.05],
+    #     "b7": [0.98, 1.02],
+    #     "b8": [0.87, 1.13]}
+    
+    for i, ax in enumerate(g.axes[:, 0]):
+        b = benchmarks[i]
+        d = data[data["benchmark"] == b]
+        max_v = offsets[b][1] * max(d["normalized_time_grcuda"].max(), d["normalized_time_cuda"].max())
+        min_v = offsets[b][0] * min(d["normalized_time_grcuda"].min(), d["normalized_time_cuda"].min()) 
+        print(min_v, max_v)
+        ax.set_xlim(left=min_v, right=max_v)
+        ax.xaxis.set_major_formatter(major_formatter)
+    
+    # Titles and labels;
+    g.set_titles("")
+    g.set(xlabel=None)
+    
+    # Add block size labels;
+    for i, ax in enumerate(g.axes[-1]):
+        ax.annotate("1D={}, 2D={}".format(*block_sizes[i].split(",")), xy=(0.5, -0.8), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=14)    
+    for i, ax in enumerate(g.axes[:, 0]):
+        # ax.annotate(f"{get_exp_label(sizes[i])}", xy=(-0.1, 0.05), xycoords="axes fraction", ha="center", color="#2f2f2f", fontsize=14)      
+        ax.annotate(f"{BENCHMARK_NAMES[benchmarks[i]]}", xy=(0.0, 0.09), xycoords="axes fraction", ha="left", color="#2f2f2f", fontsize=12)    
+
+    # Fix the borders. This must be done here as the previous operations update the default values;
+    g.fig.subplots_adjust(top=0.83,
+                      bottom=0.15,
+                      right=0.95,
+                      left=0.05,
+                      hspace=0.4,
+                      wspace=0.1)
+    
+    g.set(yticks=[])
+    g.despine(bottom=True, left=True)
+
+    # Add custom legend;
+    custom_lines = [Patch(facecolor=COLORS["peach1"], edgecolor="#2f2f2f", label="CUDA"),
+                    Patch(facecolor=COLORS["b8"], edgecolor="#2f2f2f", label="GrCUDA"),
+                    ]
+    leg = g.fig.legend(custom_lines, ["CUDA", "GrCUDA"], bbox_to_anchor=(0.97, 0.98), fontsize=12)
+    leg.set_title(None)
+    leg._legend_box.align = "left"
+    leg.get_frame().set_facecolor('white')
+    
+    # Main plot title;
+    g.fig.suptitle("Exec. Time Distribution,\nCUDA vs GrCUDA", ha="left", x=0.05, y=0.95, fontsize=18)
+    
+    return g    
+
+#%%
+
+##############################
+##############################
+
+if __name__ == "__main__":
+    # data_grcuda = load_data(INPUT_DATE_GRCUDA, skip_iter=3)
+    # data_cuda = load_data_cuda(INPUT_DATE_CUDA, skip_iter=3)
+    # data = join_tables(data_grcuda, data_cuda)
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # block_size_list = sorted(data["block_size_str"].unique(), key=lambda x: [int(y) for y in x.split(",")])
+    # num_col = len(benchmark_list)
+    # num_row = len(block_size_list)
+    # fig = plt.figure(figsize=(2.5 * num_col, 4 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.8,
+    #                 bottom=0.15,
+    #                 left=0.2,
+    #                 right=0.90,
+    #                 hspace=1.1,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     for block_size_i, block_size in enumerate(block_size_list): 
+    #         curr_res = data[(data["benchmark"] == b) & (data["block_size_str"] == block_size)].reset_index(drop=True)  
+    #         exec_time_axes += [build_exec_time_plot_grcuda_cuda(curr_res, gs, block_size_i, b_i)]
+            
+    # plt.annotate("Input number of elements", xy=(0.5, 0.03), fontsize=20, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup", xy=(0.02, 0.5), fontsize=20, ha="center", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup of GrCUDA w.r.t. CUDA", fontsize=25, x=.05, y=0.99, ha="left")
+    
+    # save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_{}.{}", OUTPUT_DATE)
+
+    
+    #%% Similar plot, but all block sizes are on 1 row;
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman"] 
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = sorted(data["exec_policy"].unique())
+    # num_col = len(benchmark_list)
+    # num_row = len(policy_list)
+    # fig = plt.figure(figsize=(2.7 * num_col, 3.9 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.8,
+    #                 bottom=0.14,
+    #                 left=0.1,
+    #                 right=0.95,
+    #                 hspace=0.8,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     for p_i, p in enumerate(policy_list): 
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         exec_time_axes += [build_exec_time_plot_grcuda_cuda_compact(curr_res, gs, p_i, b_i)]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.03), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup of GrCUDA w.r.t. CUDA", fontsize=25, x=.05, y=0.99, ha="left")
+    
+    # save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_compact_{}.{}", OUTPUT_DATE)
+    
+    
+    # %% Similar plot, but the plot fits on 1 row of a paper;
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 4
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    # num_col = len(benchmark_list) // 2
+    # num_row = len(policy_list) * 2
+    # fig = plt.figure(figsize=(2.2 * num_col, 1.8 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.84,
+    #                 bottom=0.12,
+    #                 left=0.10,
+    #                 right=0.98,
+    #                 hspace=1.1,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # for p_i, p in enumerate(policy_list): 
+    #     for b_i, b in enumerate(benchmark_list):
+    #         index_tot = (len(benchmark_list) * p_i + b_i)
+    #         j = index_tot % num_col
+    #         i = index_tot // num_col
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         curr_res = remove_outliers_df_grouped(curr_res, column="grcuda_cuda_speedup", group=["block_size_str", "size"])
+    #         exec_time_axes += [build_exec_time_plot_grcuda_cuda_2rows(curr_res, gs, i, j)]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # # plt.annotate("Speedup", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup of GrCUDA scheduling w.r.t.\nhand-optimized C++ CUDA scheduling", fontsize=16, x=.05, y=0.99, ha="left")
+    
+    # l1 = lines.Line2D([0.01, 0.99], [0.465, 0.465], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    # fig.lines.extend([l1])
+    
+    # save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_2rows_{}.{}", OUTPUT_DATE)
+    
+    #%% Similar plot, but using multiple CUDA benchmarks types;
+    ###############################
+    ###############################
+     
+    # data_grcuda_p100 = load_data(INPUT_DATE_GRCUDA_P100, skip_iter=3)
+    # data_grcuda_960 = load_data(INPUT_DATE_GRCUDA_960, skip_iter=3)
+    # data_cuda_960 = load_data_cuda(INPUT_DATE_CUDA_960, skip_iter=3, add_prefetch_as_policy=False)
+    # data_cuda_p100 = load_data_cuda(INPUT_DATE_CUDA_P100, skip_iter=3, add_prefetch_as_policy=False)
+    # data_cuda_960["gpu"] = "GTX960"
+    # data_grcuda_960["gpu"] = "GTX960"
+    # data_cuda_p100["gpu"] = "P100"
+    # data_grcuda_p100["gpu"] = "P100"
+    
+    # data_grcuda_p100 = data_grcuda_p100[data_grcuda_p100["force_prefetch"] == False]
+    # data_grcuda_960 = data_grcuda_960[data_grcuda_960["force_prefetch"] == False]
+    # data_cuda_960 = data_cuda_960[data_cuda_960["force_prefetch"] == False]
+    # data_cuda_p100 = data_cuda_p100[data_cuda_p100["force_prefetch"] == False]
+    
+    # # Ignore sync policies;
+    # # data_cuda_960 = data_cuda_960[data_cuda_960["exec_policy"] != "sync"]
+    # # data_grcuda_960 = data_grcuda_960[data_grcuda_960["exec_policy"] != "sync"]
+    # # data_cuda_p100 = data_cuda_p100[data_cuda_p100["exec_policy"] != "sync"]
+    # # data_grcuda_p100 = data_grcuda_p100[data_grcuda_p100["exec_policy"] != "sync"]
+    
+    # data_960 = join_tables_baseline(data_cuda_960, data_grcuda_960)
+    # data_p100 = join_tables_baseline(data_cuda_p100, data_grcuda_p100)
+    
+    # data = pd.concat([data_960, data_p100]).reset_index(drop=True)
+    
+    # # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 4
+    
+    # data = pd.melt(data, id_vars=["gpu", "benchmark", "exec_policy", "size", "block_size_str", "computation_sec"], value_vars=data.columns[-5:],
+    #     var_name="versus", value_name="speedup")
+    # data["size_str"] = data["size"].astype(str)
+    
+    # palette = {f"speedup_{ASYNC_POLICY_NAME}": COLORS["peach1"], "speedup_cudagraph": COLORS["b2"], "speedup_sync":  COLORS["b8"], "speedup_cudagraphmanual":  COLORS["b4"],  "speedup_cudagraphsingle":  COLORS["b8"]}
+    # markers = {f"speedup_{ASYNC_POLICY_NAME}": "o", "speedup_cudagraph": "X", "speedup_sync": "D", "speedup_cudagraphmanual": "P", "speedup_cudagraphsingle": "D"}
+    
+    # #%%
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    # num_col = len(benchmark_list) // 2
+    # num_row = len(policy_list) * 2
+    # fig = plt.figure(figsize=(2.2 * num_col, 2.15 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.84,
+    #                 bottom=0.12,
+    #                 left=0.10,
+    #                 right=0.98,
+    #                 hspace=1.1,
+    #                 wspace=0.15)
+    
+    # # Keep only 1 versus;
+    # # data = data[data["versus"] == "speedup_cudagraph"]
+           
+    # exec_time_axes = []
+    # for p_i, p in enumerate(policy_list): 
+    #     for b_i, b in enumerate(benchmark_list):
+    #         index_tot = (len(benchmark_list) * p_i + b_i)
+    #         j = index_tot % num_col
+    #         i = index_tot // num_col
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         curr_res = remove_outliers_df_grouped(curr_res, column="computation_sec", group=["block_size_str", "size"])
+    #         exec_time_axes += [build_exec_time_plot_grcuda_cuda_2rows_multigpu(curr_res, gs, i, j, p, palette, markers)]
+        
+    # # Legend; 
+    # versus = [l for l in data["versus"].unique() if l != "speedup_sync"]
+    # names = {f"speedup_{ASYNC_POLICY_NAME}": "Hand-tuned CUDA events", "speedup_cudagraph": "CUDA Graphs + events", "speedup_sync": "CUDA synchronous", "speedup_cudagraphmanual": "CUDA Graphs, manual dep.", "speedup_cudagraphsingle": "CUDA Graphs, single stream"}
+    # legend_labels = [names[l] for l in versus]
+    # custom_lines = [
+    #     lines.Line2D([], [], color="white", marker=markers[l], markersize=10, label=names[l], markerfacecolor=palette[l], markeredgecolor="#2f2f2f") 
+    #     for l in versus]        
+    # leg = fig.legend(custom_lines, legend_labels, 
+    #                           bbox_to_anchor=(0.99, 1), fontsize=10, ncol=1, handletextpad=0.1, columnspacing=0.2)
+    # leg.set_title("CUDA baseline type", prop={"size": 10})
+    # leg._legend_box.align = "left"    
+        
+    # plt.annotate("Input number of elements (x-axis not to scale)", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # # plt.annotate("Speedup", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup of our GrCUDA scheduling\nagainst hand-optimized CUDA Graphs\n(higher is better)", fontsize=16, x=.05, y=0.99, ha="left")
+    
+    # l1 = lines.Line2D([0.01, 0.99], [0.455, 0.455], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    # fig.lines.extend([l1])
+    
+    # save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_multicuda_{}.{}", OUTPUT_DATE)
+    
+    
+    #%% Ridge plot with distributions;
+    # g = ridgeplot(data)
+    # save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_ridgeplot_{}.{}", OUTPUT_DATE)
+    
+    
+    # %% Summary plot with CUDA speedups;
+    ###################################
+    ###################################
+   
+    # BENCHMARK_NAMES = {"b1": "VEC", "b5": "B&S", "b8": "Images", "b6": "ML", "b7": "HITS", "b10": "DL", "mean": "", "mean2": ""}
+
+    # sns.set_style("white", {"ytick.left": True})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 25 
+    # plt.rcParams['axes.labelpad'] = 5 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 2
+    
+    # data_cuda_960 = load_data_cuda(INPUT_DATE_CUDA_960, skip_iter=3)
+    # data_cuda_p100 = load_data_cuda(INPUT_DATE_CUDA_P100, skip_iter=3)
+    # data_cuda_1660 = load_data_cuda(INPUT_DATE_CUDA_1660, skip_iter=3)
+    # gpus = ["GTX960", "GTX1660 Super", "P100"][1:]
+    # # data_cuda_960["gpu"] = gpus[0]
+    # data_cuda_1660["gpu"] = gpus[0]
+    # data_cuda_p100["gpu"] = gpus[1]
+        
+    # data_list = []
+    # gmean_horizontal_values = []
+    # for data_c in [data_cuda_1660, data_cuda_p100]:
+    #     data_cuda_2 = remove_outliers_df_grouped(data_c, column="computation_speedup", group=["benchmark", "exec_policy_full", "block_size_str", "size", "gpu"])
+    #     cuda_summary = data_cuda_2[data_cuda_2["exec_policy_full"] == ASYNC_POLICY_NAME].groupby(["benchmark", "block_size_str", "size", "gpu"], sort=False)["computation_speedup"].apply(gmean).reset_index(drop=False)
+    #     cuda_summary = cuda_summary.sort_values(by=["benchmark"], key=lambda x: x.apply(lambda y: int(y[1:])))              
+        
+    #     # Add geomean;
+    #     gmean_res = pd.DataFrame(cuda_summary.groupby(["benchmark"], as_index=False).agg(gmean))
+    #     gmean_res["benchmark"] = "mean"
+    #     gmean_horizontal_value = gmean(gmean_res["computation_speedup"])
+    #     gmean_horizontal_values += [gmean_horizontal_value]
+    #     gmean_res["computation_speedup"] = 0
+    #     res_tmp = pd.concat([cuda_summary, gmean_res])
+        
+    #     # Do it again, workaround to have another fake column;
+    #     gmean_res = pd.DataFrame(cuda_summary.groupby(["benchmark"], as_index=False).agg(gmean))
+    #     gmean_res["benchmark"] = "mean2"
+    #     gmean_horizontal_value = gmean(gmean_res["computation_speedup"])
+    #     gmean_res["computation_speedup"] = 0
+    #     data_list += [res_tmp, gmean_res]
+    # res = pd.concat(data_list).reset_index(drop=True)
+    
+    # num_col = 1
+    # fig = plt.figure(figsize=(3.8 * num_col, 2))
+    # gs = gridspec.GridSpec(1, 1)
+    # plt.subplots_adjust(top=0.78,
+    #                 bottom=0.15,
+    #                 left=0.14,
+    #                 right=.99,
+    #                 hspace=0.9,
+    #                 wspace=0.05)
+    
+    # palettes = ["#A2F2B1", "#6CC982"]# * len(cuda_summary["benchmark"].unique()) + ["#96DE9B"]
+  
+    # ax = fig.add_subplot(gs[0, 0])
+    # ax0 = ax
+    
+    # ax = sns.barplot(x="benchmark", y="computation_speedup", hue="gpu", data=res, order=list(BENCHMARK_NAMES.keys()), ci=95,
+    #                   palette=palettes, capsize=.05, errwidth=0.8, ax=ax, edgecolor="#2f2f2f", estimator=gmean, zorder=2, saturation=1)
+    # ax.legend_.remove()  # Hack to remove legend;
+    
+    # gpu_dict = {"P100": "P100", "GTX1660 Super": "1660"}
+    # for i, g in enumerate(gpus):
+    #     ax.axhline(y=float(f"{gmean_horizontal_values[i]:.2}"), color="#D98159" if i else COLORS["peach1"], linestyle="-", zorder=1, linewidth=1, )
+    #     color = "#D98159" if i else COLORS["peach1"]
+    #     alpha = 1
+    #     color = "#2f2f2f"
+    #     alpha = 0.75 + i * 0.25
+    #     ax.annotate(f"{gpu_dict[g]}, geomean\nspeedup: {gmean_horizontal_values[i]:.2f}x", xy=(0.75, 0.26 + i * 0.25), xycoords="axes fraction", ha="left", alpha=alpha, color=color, fontsize=6)   
+    # ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+    # ax.annotate(f"Serial execution", xy=(0.75, 0.12), xycoords="axes fraction", ha="left", color="#2f2f2f", fontsize=6, alpha=0.5)   
+    
+    # ax.set_ylabel("Speedup", fontsize=11)
+    # ax.set_xlabel("")
+    # ax.set_ylim((0.5, 3))
+    # labels = ax.get_xticklabels()
+    # for j, l in enumerate(labels):
+    #     l.set_text(BENCHMARK_NAMES[l._text])
+    # ax.set_xticklabels(labels, ha="center", va="top")
+    # ax.tick_params(axis='x', which='major', labelsize=8, rotation=0)
+    
+    # ax.yaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.1f}x"))
+    # ax.yaxis.set_major_locator(plt.LinearLocator(6))
+    # ax.tick_params(axis='y', which='major', labelsize=8)
+    # ax.grid(True, axis="y")
+    
+    # update_width(ax, 0.4)
+    
+    # # Speedup labels;
+    # offsets = []
+    # for k, g in res.groupby(["benchmark", "gpu"]):
+    #     offsets += [get_upper_ci_size(g["computation_speedup"], ci=0.5)]
+    # offsets = offsets[:(len(offsets)//2)] + ([0] * 2) + offsets[(len(offsets)//2):] + ([0] * 2)
+    # offsets = [o + 0.05 if not np.isnan(o) else 0.2 for o in offsets]
+    # offsets[0] = 0.15
+    # # offsets[5] = 0.1
+    # # offsets[6] = 0.1
+    # # offsets[8] = 0.2
+    # offsets[9] = 0.12
+    # offsets[10] = 0.1
+    # add_labels(ax, vertical_offsets=offsets, rotation=0, format_str="{:.2f}", fontsize=6, skip_zero=False)
+    
+    # plt.suptitle("Achievable speedup in C++ CUDA with hand-tuned\nGPU data transfer and execution overlap", fontsize=11, x=.01, y=0.99, ha="left")
+    
+    # gpu_dict = {"P100": "Tesla P100", "GTX1660 Super": "GTX1660 Super"}
+    # legend_labels = [gpu_dict[g] for g in gpus]
+    # custom_lines = [Patch(facecolor=palettes[i], edgecolor="#2f2f2f", label=l)
+    #                 for i, l in enumerate(legend_labels)]
+    # leg = fig.legend(custom_lines, legend_labels, bbox_to_anchor=(0.99, 0.78), fontsize=8, ncol=1)
+    # leg.set_title("")
+    # leg._legend_box.align = "left"
+    # leg.get_frame().set_facecolor('white')
+        
+    # save_plot(PLOT_DIR, "cuda_speedup_{}.{}", OUTPUT_DATE)
+    
+    #%% Using 3 GPUs
+    ############################
+    ############################
+        
+    # data_grcuda_p100 = load_data(INPUT_DATE_GRCUDA_P100, skip_iter=3)
+    # data_grcuda_960 = load_data(INPUT_DATE_GRCUDA_960, skip_iter=3)
+    # data_grcuda_1660 = load_data(INPUT_DATE_GRCUDA_1660, skip_iter=3)
+    # data_cuda_960 = load_data_cuda(INPUT_DATE_CUDA_960, skip_iter=3)
+    # data_cuda_p100 = load_data_cuda(INPUT_DATE_CUDA_P100, skip_iter=3)
+    # data_cuda_1660 = load_data_cuda(INPUT_DATE_CUDA_1660, skip_iter=3)
+    # data_cuda_960["gpu"] = "GTX960"
+    # data_grcuda_960["gpu"] = "GTX960"
+    # data_cuda_p100["gpu"] = "P100"
+    # data_grcuda_p100["gpu"] = "P100"
+    # data_cuda_1660["gpu"] = "GTX1660 Super"
+    # data_grcuda_1660["gpu"] = "GTX1660 Super"
+    
+    # # Ignore sync policies;
+    # # data_cuda_960 = data_cuda_960[data_cuda_960["exec_policy"] != "sync"]
+    # # data_grcuda_960 = data_grcuda_960[data_grcuda_960["exec_policy"] != "sync"]
+    # # data_cuda_p100 = data_cuda_p100[data_cuda_p100["exec_policy"] != "sync"]
+    # # data_grcuda_p100 = data_grcuda_p100[data_grcuda_p100["exec_policy"] != "sync"]
+    
+    # data_960 = join_tables_baseline(data_cuda_960, data_grcuda_960)
+    # data_p100 = join_tables_baseline(data_cuda_p100, data_grcuda_p100)
+    # data_1660 = join_tables_baseline(data_cuda_1660, data_grcuda_1660)
+    
+    # data = pd.concat([data_960, data_1660, data_p100]).reset_index(drop=True)
+    
+    # data = data[data["force_prefetch"] == False]
+    
+    # # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 4
+    
+    # data = pd.melt(data, id_vars=["gpu", "benchmark", "exec_policy", "size", "block_size_str", "computation_sec"], value_vars=data.columns[-5:],
+    #     var_name="versus", value_name="speedup")
+    # data["size_str"] = data["size"].astype(str)
+    
+    # palette = {f"speedup_{ASYNC_POLICY_NAME}": COLORS["peach1"], "speedup_cudagraph": COLORS["b2"], "speedup_sync":  COLORS["b8"], "speedup_cudagraphmanual":  COLORS["b4"],  "speedup_cudagraphsingle":  COLORS["b8"]}
+    # # markers = {f"speedup_{ASYNC_POLICY_NAME}": "o", "speedup_cudagraph": "X", "speedup_sync": "D", "speedup_cudagraphmanual": "P", "speedup_cudagraphsingle": "D"}
+    # markers = {"GTX960": "o", "GTX1660 Super": "X", "P100": "D"}
+
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    # num_col = len(benchmark_list) // 2
+    # num_row = len(policy_list) * 2
+    # fig = plt.figure(figsize=(2.2 * num_col, 2.15 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.84,
+    #                 bottom=0.12,
+    #                 left=0.10,
+    #                 right=0.98,
+    #                 hspace=1.1,
+    #                 wspace=0.15)
+    
+    # # Keep only 1 versus;
+    # # data = data[data["versus"] == "speedup_cudagraph"]
+           
+    # exec_time_axes = []
+    # for p_i, p in enumerate(policy_list): 
+    #     for b_i, b in enumerate(benchmark_list):
+    #         index_tot = (len(benchmark_list) * p_i + b_i)
+    #         j = index_tot % num_col
+    #         i = index_tot // num_col
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         curr_res = remove_outliers_df_grouped(curr_res, column="computation_sec", group=["block_size_str", "size"])
+    #         exec_time_axes += [build_exec_time_plot_grcuda_cuda_2rows_multigpu3(curr_res, gs, i, j, p, palette, markers)]
+        
+    # # Legend; 
+    # # versus = [l for l in data["versus"].unique() if l not in ["speedup_sync", "speedup_cudagraph"]]
+    # # names = {f"speedup_{ASYNC_POLICY_NAME}": "Hand-tuned CUDA events", "speedup_cudagraph": "CUDA Graphs + events", "speedup_sync": "CUDA synchronous", "speedup_cudagraphmanual": "CUDA Graphs, manual dep.", "speedup_cudagraphsingle": "CUDA Graphs, single stream"}
+    # # legend_labels = [names[l] for l in versus]
+    # # custom_lines = [
+    # #     lines.Line2D([], [], color="white", marker=markers[l], markersize=10, label=names[l], markerfacecolor=palette[l], markeredgecolor="#2f2f2f") 
+    # #     for l in versus]        
+    # # leg = fig.legend(custom_lines, legend_labels, 
+    # #                           bbox_to_anchor=(0.99, 1), fontsize=10, ncol=1, handletextpad=0.1, columnspacing=0.2)
+    # # leg.set_title("CUDA baseline type", prop={"size": 10})
+    # # leg._legend_box.align = "left"    
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # # plt.annotate("Speedup", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup of GrCUDA against\nhand-optimized CUDA Graphs\n(higher is better)", fontsize=16, x=.05, y=0.99, ha="left")
+    
+    # l1 = lines.Line2D([0.01, 0.99], [0.455, 0.455], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    # fig.lines.extend([l1])
+    
+    # save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_multicuda3_{}.{}", OUTPUT_DATE)
+    
+    
+    
+    #%% Performance of GrCUDA vs CUDA Graphs on all GPUs
+    ####################################################
+    ####################################################    
+    
+    data_grcuda_p100 = load_data(INPUT_DATE_GRCUDA_P100, skip_iter=3)
+    data_grcuda_1660 = load_data(INPUT_DATE_GRCUDA_1660, skip_iter=3)
+    data_grcuda_960 = load_data(INPUT_DATE_GRCUDA_960, skip_iter=3)
+    data_cuda_960 = load_data_cuda(INPUT_DATE_CUDA_960, skip_iter=3, add_prefetch_as_policy=False)
+    data_cuda_1660 = load_data_cuda(INPUT_DATE_CUDA_1660, skip_iter=3, add_prefetch_as_policy=False)
+    data_cuda_p100 = load_data_cuda(INPUT_DATE_CUDA_P100, skip_iter=3, add_prefetch_as_policy=False)
+    data_cuda_960["gpu"] = "GTX960"
+    data_grcuda_960["gpu"] = "GTX960"
+    data_cuda_1660["gpu"] = "GTX1660 Super"
+    data_grcuda_1660["gpu"] = "GTX1660 Super"
+    data_cuda_p100["gpu"] = "P100"
+    data_grcuda_p100["gpu"] = "P100"
+    
+    data_grcuda_p100 = data_grcuda_p100[data_grcuda_p100["force_prefetch"] == False]
+    data_grcuda_960 = data_grcuda_960[data_grcuda_960["force_prefetch"] == False]
+    data_grcuda_1660 = data_grcuda_1660[data_grcuda_1660["force_prefetch"] == False]
+    data_cuda_960 = data_cuda_960[data_cuda_960["force_prefetch"] == False]
+    data_cuda_p100 = data_cuda_p100[data_cuda_p100["force_prefetch"] == False]
+    data_cuda_1660 = data_cuda_1660[data_cuda_1660["force_prefetch"] == False]
+
+    # Ignore sync policies;
+    # data_cuda_960 = data_cuda_960[data_cuda_960["exec_policy"] != "sync"]
+    # data_grcuda_960 = data_grcuda_960[data_grcuda_960["exec_policy"] != "sync"]
+    # data_cuda_p100 = data_cuda_p100[data_cuda_p100["exec_policy"] != "sync"]
+    # data_grcuda_p100 = data_grcuda_p100[data_grcuda_p100["exec_policy"] != "sync"]
+    
+    data_960 = join_tables_baseline(data_cuda_960, data_grcuda_960)
+    data_1660 = join_tables_baseline(data_cuda_1660, data_grcuda_1660)
+    data_p100 = join_tables_baseline(data_cuda_p100, data_grcuda_p100)
+    
+    data = pd.concat([data_960, data_1660, data_p100]).reset_index(drop=True)
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['axes.titlepad'] = 20 
+    plt.rcParams['axes.labelpad'] = 10 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    plt.rcParams['xtick.major.pad'] = 4
+    
+    data = pd.melt(data, id_vars=["gpu", "benchmark", "exec_policy", "size", "block_size_str", "computation_sec"], value_vars=data.columns[-5:],
+        var_name="versus", value_name="speedup")
+    data = data[~data["versus"].isin(["speedup_sync", "speedup_cudagraphsingle"])]
+    data["size_str"] = data["size"].astype(str)
+    
+    data = data[data["exec_policy"] == ASYNC_POLICY_NAME]
+    
+    palette = {f"speedup_{ASYNC_POLICY_NAME}": COLORS["peach1"], "speedup_cudagraph": COLORS["b2"], "speedup_sync":  COLORS["b8"], "speedup_cudagraphmanual":  COLORS["b4"],  "speedup_cudagraphsingle":  COLORS["b8"]}
+    markers = {f"speedup_{ASYNC_POLICY_NAME}": "o", "speedup_cudagraph": "X", "speedup_sync": "D", "speedup_cudagraphmanual": "P", "speedup_cudagraphsingle": "D"}
+    
+    #%%
+    
+    # Lists of benchmarks and block sizes;
+    benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    gpu_list = list(data["gpu"].unique())
+    num_col = len(benchmark_list) // 2
+    num_row = len(gpu_list) * 2
+    fig = plt.figure(figsize=(2.2 * num_col, 1.6 * num_row))
+    gs = gridspec.GridSpec(num_row, num_col)
+    plt.subplots_adjust(top=0.875,
+                    bottom=0.085,
+                    left=0.10,
+                    right=0.965,
+                    hspace=1,
+                    wspace=0.15)
+    
+    # Keep only 1 versus;
+    # data = data[data["versus"] == "speedup_cudagraph"]
+           
+    exec_time_axes = []
+    for g_i, g in enumerate(gpu_list): 
+        for b_i, b in enumerate(benchmark_list):
+            index_tot = (len(benchmark_list) * g_i + b_i)
+            j = index_tot % num_col
+            i = index_tot // num_col
+            curr_res = data[(data["benchmark"] == b) & (data["gpu"] == g)].reset_index(drop=True)  
+            sizes = sorted(data[data["benchmark"] == b]["size"].unique())
+            curr_res = remove_outliers_df_grouped(curr_res, column="computation_sec", group=["block_size_str", "size"])
+            print(g, b, len(curr_res))
+            exec_time_axes += [build_exec_time_plot_grcuda_cuda_3rows_multigpu(curr_res, gs, i, j, g, palette, markers, sizes)]
+        
+    # Legend; 
+    versus = [l for l in data["versus"].unique() if l != "speedup_sync"]
+    names = {f"speedup_{ASYNC_POLICY_NAME}": "Hand-tuned CUDA events", "speedup_cudagraph": "CUDA Graphs + events", "speedup_sync": "CUDA synchronous", "speedup_cudagraphmanual": "CUDA Graphs, manual dep.", "speedup_cudagraphsingle": "CUDA Graphs, single stream"}
+    legend_labels = [names[l] for l in versus]
+    custom_lines = [
+        lines.Line2D([], [], color="white", marker=markers[l], markersize=10, label=names[l], markerfacecolor=palette[l], markeredgecolor="#2f2f2f") 
+        for l in versus]        
+    leg = fig.legend(custom_lines, legend_labels, 
+                              bbox_to_anchor=(0.99, 1), fontsize=10, ncol=1, handletextpad=0.1, columnspacing=0.2)
+    leg.set_title("CUDA baseline type", prop={"size": 10})
+    leg._legend_box.align = "left"    
+        
+    plt.annotate("Input number of elements (x-axis not to scale)", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    plt.suptitle("Speedup of our GrCUDA scheduling\nagainst hand-optimized CUDA Graphs\n(higher is better)", fontsize=16, x=.05, y=0.99, ha="left")
+    
+    l1 = lines.Line2D([0.01, 0.99], [0.322, 0.322], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    l2 = lines.Line2D([0.01, 0.99], [0.608, 0.608], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    fig.lines.extend([l1, l2])
+    
+    save_plot(PLOT_DIR, "speedup_baseline_grcuda_cuda_3gpu_{}.{}", OUTPUT_DATE)
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_speedup_grcuda_vs_async_baseline_2gpu.py b/projects/resources/python/plotting/plot_speedup_grcuda_vs_async_baseline_2gpu.py
new file mode 100644
index 00000000..a8ce6ace
--- /dev/null
+++ b/projects/resources/python/plotting/plot_speedup_grcuda_vs_async_baseline_2gpu.py
@@ -0,0 +1,91 @@
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns              
+import matplotlib.ticker as tkr
+import matplotlib.gridspec as gridspec
+from matplotlib.patches import Patch
+import os
+import matplotlib.lines as lines
+import matplotlib.ticker as ticker
+from plot_utils import *
+
+
+# INPUT_DATE = "2020_09_19_grcuda"
+OUTPUT_DATE = "2020_10_14"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+BENCHMARK_NAMES = {"b1": "Vector Squares", "b5": "B&S", "b8": "Images", "b6": "ML Ensemble", "b7": "HITS", "b10": "DL"}
+
+PALETTE_GW = [COLORS[r] for r in ["b1","b2","b3", "b4", "b5"]]
+HATCHES = ['', '/'*4, '\\'*4, '++++', '**']
+
+if __name__ == "__main__":
+
+    data = pd.read_csv("2GPU_allParents_vs_1GPU_Async.csv", sep=';')
+    singleGPU = data['number_GPU']==1
+    data.loc[singleGPU,'parent_stream_policy'] = ['Baseline Async 1 GPU']*len(data[singleGPU])
+
+    # sns.set_theme(style="whitegrid")
+    sns.set_style("whitegrid", {"ytick.left": True})
+    plt.rcParams["font.family"] = ["serif"]
+    plt.rcParams["font.size"] = 12
+    plt.rcParams['hatch.linewidth'] = 0.6
+    plt.rcParams['axes.labelpad'] = 5 
+    plt.rcParams['pdf.fonttype'] = 42
+    plt.rcParams['ps.fonttype'] = 42
+
+    conf = data['parent_stream_policy'].unique()
+    ylabels = {'computation_sec':"Computation Time [s]"}
+    ylims = {}
+
+    for var in ["computation_sec"]: 
+        g = sns.catplot(data=data, kind='bar', ci=99, x="size",
+        y=var, hue='parent_stream_policy', row='benchmark',
+        alpha=1, palette=PALETTE_GW, height=2.5, aspect=5, legend_out=False, 
+        sharey=False, sharex=False, margin_titles=True)
+        g.set_axis_labels("Input Size", ylabels[var])
+        #g.despine(left=True)
+        #  .set_titles("{col_name}")
+        # .set_yticklabels(list(range(100,1000,100))+list(range(1000,10000,1000))+list(range(10000, 75000, 10000)))
+
+        for i,axes in enumerate(g.axes):  
+            for ii,ax in enumerate(axes):
+        #         ax.set(ylim=ylims[var][i])
+        #         plt.sca(axes[ii])
+        #     #if ii!=0:
+        #         a = list(range(0, ylims[var][i][1]+1,(ylims[var][i][1]+1)//4))
+        #         print(a)
+        #         plt.yticks(a)
+        #         if ii!=0:
+        #             plt.yticks(a, ['']*5)
+        #         g.set_axis_labels("Input Size", ylabels[var])
+                for j, bar in enumerate(ax.patches):
+                    bar.set_hatch(HATCHES[(j//5)])
+                    # bar.set_edgecolor('k')
+                    
+                # for j, bar in enumerate([p for p in ax.patches if not pd.isna(p)]):
+                #     bar.set_hatch(HATCHES[j // len(axes)])
+        #     ax.yaxis.set_minor_locator(tkr.LogLocator(base=10, subs='all'))
+        #     ax.yaxis.set_minor_formatter(tkr.NullFormatter())
+        # if var == 'Wall_train':
+        #     g.set(yscale='log')
+            
+        # Add legend;
+        g.legend.remove()  # Remove the existing legend again;
+        custom_lines = [Patch(facecolor=PALETTE_GW[0], hatch=HATCHES[0], edgecolor="w", label=conf[0]),
+                        Patch(facecolor=PALETTE_GW[1], hatch=HATCHES[1], edgecolor="w", label=conf[1]),
+                        Patch(facecolor=PALETTE_GW[2], hatch=HATCHES[2], edgecolor="w", label=conf[2]),
+                        Patch(facecolor=PALETTE_GW[3], hatch=HATCHES[3], edgecolor="w", label=conf[3]), 
+                        Patch(facecolor=PALETTE_GW[4], hatch=HATCHES[4], edgecolor="w", label=conf[4])] 
+                        
+        legend_data = {a:b for a,b in zip(conf,custom_lines)}
+        g.add_legend(legend_data, loc="center left", bbox_to_anchor=(0., 0.6), fontsize=11, ncol=1, handletextpad=0.2, columnspacing=0.4, fancybox=True)
+        g.legend.set_title("Parent Stream Policy")
+        #g._legend_box.align = "left"
+        g.figure.suptitle("2 GPU All Parent Policies vs 1 GPU Async Baseline")
+        g.legend.get_frame().set_facecolor('white')
+        plt.subplots_adjust(left=0.07, bottom=0.065, right=0.98, top=0.96, hspace=0.2, wspace=0.14)
+        # plt.savefig(f"Total{var}.pdf")
+        plt.show()
+
diff --git a/projects/resources/python/plotting/plot_theoretical_performance.py b/projects/resources/python/plotting/plot_theoretical_performance.py
new file mode 100755
index 00000000..6d43174d
--- /dev/null
+++ b/projects/resources/python/plotting/plot_theoretical_performance.py
@@ -0,0 +1,885 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Mon Jun 29 12:00:01 2020
+
+@author: alberto.parravicini
+"""
+
+import json
+import numpy as np
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from scipy.stats.mstats import gmean
+from matplotlib.patches import Patch, Rectangle
+from matplotlib.collections import PatchCollection, LineCollection
+import matplotlib.lines as lines
+import math
+
+import os
+from load_data import load_data, load_data_cuda, join_tables, compute_speedup
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot, remove_outliers_df_grouped
+import matplotlib.ticker as ticker
+
+##############################
+##############################
+
+DEFAULT_RES_DIR = "../../../../grcuda-data/results/scheduling"
+
+# INPUT_DATE_GRCUDA = "960/2020_09_29_20_11_01_grcuda_withphases_forceprefetch"
+OUTPUT_DATE = "2020_10_14"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+INPUT_DATE_GRCUDA_P100 = "P100/2020_10_13_13_47_28_grcuda_with_phases"
+INPUT_DATE_GRCUDA_960 = "960/2020_10_11_19_14_42_grcuda_with_phases"
+INPUT_DATE_GRCUDA_1660 = "1660/2020_10_13_19_11_14_grcuda_with_phases"
+
+B5_ITER = 10
+B7_ITER = 5
+
+BENCHMARK_NAMES = {"b1": "Vector Squares", "b5": "B&S", "b8": "Images", "b6": "ML Ensemble", "b7": "HITS", "b10": "DL"}
+BENCHMARK_PHASES = {
+    "b1": ["square_1", "square_2", "reduce"],
+    "b5": [y for x in [[f"bs_{i}"] for i in range(B5_ITER)] for y in x],
+    "b6": ["rr_1", "rr_2", "rr_3", "nb_1", "nb_2", "nb_3", "nb_4", "softmax_1", "softmax_2", "argmax"],
+    "b7": [y for x in [[f"spmv_a_{i}", f"spmv_h_{i}", f"sum_a_{i}", f"sum_h_{i}", f"divide_a_{i}", f"divide_h_{i}", f"norm_reset_{i}"] for i in range(B7_ITER)] for y in x],
+    "b8": ["blur_small", "blur_large", "blur_unsharpen", "sobel_small", "sobel_large", "maximum", "minimum", "extend", "unsharpen", "combine", "combine_2"],
+    "b10": ["conv_x1", "pool_x1", "conv_x2", "conv_y1", "pool_y1", "conv_y2", "concat", "dot_product"],
+    }
+
+# ASYNC_POLICY_NAME = "async"   # If parsing new results;
+ASYNC_POLICY_NAME = "default"  # If parsing older results;
+
+##############################
+##############################
+
+def theoretical_speed(input_data, group_columns, benchmark):
+    data = input_data.copy()
+    
+    # Only relevant for "sync" policy;
+    data["theoretical_time_sec"] = THEORETICAL_SPEED_FUNCTIONS[benchmark](data)
+    
+    for key, group in data.groupby(group_columns):
+        # Get the median theoretical time;
+        median_theoretical_time = np.mean(group[group["exec_policy"] == "sync"]["theoretical_time_sec"])
+        data.loc[group.index, "speedup_wrt_theoretical"] = median_theoretical_time / group["computation_sec"]
+    return data
+
+def theoretical_speed_b1(data):
+    return np.maximum(data["square_1"], data["square_2"]) + data["reduce"]
+
+def theoretical_speed_b5(data):
+    return data[[f"bs_{i}" for i in range(B5_ITER)]].max(axis=1)
+
+def theoretical_speed_b6(data):
+    return np.maximum(data["rr_1"] + data["rr_2"] + data["rr_3"] + data["softmax_1"], data["nb_1"] + data["nb_2"] + data["nb_3"] + data["nb_4"] + data["softmax_2"]) + data["argmax"]
+
+def theoretical_speed_b7(data):
+    total = np.zeros(len(data))
+    for i in range(B7_ITER):
+        total += data[f"norm_reset_{i}"] + np.maximum(data[f"divide_a_{i}"] + np.maximum(data[f"spmv_a_{i}"] + data[f"sum_a_{i}"], data[f"spmv_h_{i}"]),
+                                                      data[f"divide_h_{i}"] + np.maximum(data[f"spmv_h_{i}"] + data[f"sum_h_{i}"], data[f"spmv_a_{i}"]))
+    return total
+
+def theoretical_speed_b8(data):
+    extend = np.maximum(data["maximum"], data["minimum"]) + data["extend"]
+    combine = data["combine"] + np.maximum(data["blur_unsharpen"] + data["unsharpen"], data["blur_large"] + data["sobel_large"] + extend)
+    return data["combine_2"] + np.maximum(data["blur_small"] + data["sobel_small"], combine)
+
+def theoretical_speed_b10(data):
+    return np.maximum(data["conv_x1"] + data["pool_x1"] + data["conv_x2"], data["conv_y1"] + data["pool_y1"] + data["conv_y2"]) + data["concat"] + data["dot_product"]
+
+THEORETICAL_SPEED_FUNCTIONS = {
+    "b1": theoretical_speed_b1,
+    "b5": theoretical_speed_b5,
+    "b6": theoretical_speed_b6,
+    "b7": theoretical_speed_b7,
+    "b8": theoretical_speed_b8,
+    "b10": theoretical_speed_b10,
+    }
+
+##############################
+##############################
+
+def build_theoretical_time_plot(data, gridspec, x, y):
+    
+    palette = [COLORS["peach1"], COLORS["bb1"]]
+    markers = ["o"] * len(palette)
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="speedup_wrt_theoretical", hue="exec_policy", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, ci=None, sort=False, zorder=2)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Add rectangles to represent variance;
+    for p_i, p in enumerate(sorted(data["exec_policy"].unique())):
+        rectangles = []
+        for s_i, s in enumerate(labels):
+            curr_data = data[(data["size"] == s) & (data["exec_policy"] == p)]
+            upper_ci_size, lower_ci_size, center = get_ci_size(curr_data["speedup_wrt_theoretical"], estimator=gmean, ci=0.90)
+            bottom = center - lower_ci_size
+            width = 0.1
+            lower_left = [s_i - width / 2, bottom]
+            # Add an offset to the x position, to avoid overlapping;
+            lower_left[0] += (2 * p_i - 1) * (width / 3.5)
+            rectangles += [Rectangle(lower_left, width, upper_ci_size + lower_ci_size)]
+            
+        pc = PatchCollection(rectangles, facecolor=palette[p_i], edgecolor="#2f2f2f", linewidth=0.5, zorder=3, clip_on=True, alpha=0.7)         
+        ax.add_collection(pc)         
+
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 1))
+
+    # Add a horizontal line to denote speedup = 1x;
+    # ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=45, ha="right", fontsize=9, rotation_mode="anchor")
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(5))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=14, ha="center", xycoords="axes fraction")
+    ax.annotate(f"Min. theoretical time (ms):", xy=(0, -0.37), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["r4"])
+    
+    for i, l in enumerate(labels):
+        baseline_median = np.median(data[(data["exec_policy"] == "sync") & (data["size"] == int(l))]["theoretical_time_sec"])
+        ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.47), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Add block size annotation;
+    if y == 0:
+        ax.annotate(f"Block size:\n1D={data['block_size_1d'].iloc[0]}, 2D={data['block_size_2d'].iloc[0]}x{data['block_size_2d'].iloc[0]}", xy=(-0.65, 1.25), fontsize=14, ha="left", xycoords="axes fraction") 
+    
+    # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Legend;
+    if y == 0 and x == 0:
+        legend_labels = ["DAG Scheduling", "Serial Scheduling"]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        
+        leg = fig.legend(custom_lines, legend_labels,
+                                 bbox_to_anchor=(0.91, 0.98), fontsize=12, ncol=1, handletextpad=0.1)
+        leg.set_title(None)
+        leg._legend_box.align = "left"
+        
+    
+    return ax
+
+
+def build_theoretical_time_plot_compact(data, gridspec, x, y, baseline_labels=None):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    legend_labels = ["DAG Scheduling", "Serial Scheduling"]
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]][:len(data["block_size_str"].unique())]
+    markers = ["o", "X", "D", "P"][:len(data["block_size_str"].unique())]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="speedup_wrt_theoretical", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["speedup_wrt_theoretical"].apply(gmean).reset_index()
+    order = data["block_size_str"].unique()
+    ax = sns.scatterplot(x="size_str", y="speedup_wrt_theoretical", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 1))
+
+    # Add a horizontal line to denote speedup = 1x;
+    # ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(5))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0:
+        ax.annotate(f"{legend_labels[x]}", xy=(-0.15, 1.25), fontsize=14, ha="left", xycoords="axes fraction") 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=14, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    if baseline_labels:
+        ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.22), fontsize=9, ha="left", xycoords="axes fraction", color=COLORS["r4"])
+        for i, l in enumerate(labels):
+            baseline_median = baseline_labels[i]
+            ax.annotate(f"{int(1000 * baseline_median)}", xy=(i,  -0.29), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend;   
+    if x == 0 and y == 0:
+        legend_labels = [f"1D={x.split(',')[0]}, 2D={x.split(',')[1]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]
+        
+        leg = fig.legend(custom_lines, legend_labels,
+                                 bbox_to_anchor=(0.95, 1), fontsize=12, ncol=len(legend_labels) // 2, handletextpad=0.1)
+        leg.set_title("Block size:")
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_theoretical_time_plot_2rows(data, gridspec, x, y, baseline_labels=None):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    legend_labels = ["Serial Scheduler", "Parallel Scheduler"]
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]][:len(data["block_size_str"].unique())]
+    markers = ["o", "X", "D", "P"][:len(data["block_size_str"].unique())]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="speedup_wrt_theoretical", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["speedup_wrt_theoretical"].apply(gmean).reset_index()
+    order = data["block_size_str"].unique()
+    ax = sns.scatterplot(x="size_str", y="speedup_wrt_theoretical", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 1))
+
+    # Add a horizontal line to denote speedup = 1x;
+    # ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=8)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(5))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0 and x % 2 == 0:
+        ax.annotate(f"{legend_labels[x // 2]}", xy=(-0.3, -1.4), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=10, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.xaxis.grid(False)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    if baseline_labels:
+        ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.34), fontsize=8, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+        for i, l in enumerate(labels):
+            baseline_median = baseline_labels[i]
+            ax.annotate(f"{int(1000 * baseline_median)}", xy=(i, -0.48), fontsize=8, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend; 
+    if x == 0 and y== 0:
+        legend_labels = [f"1D={x.split(',')[0]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            lines.Line2D([], [], color="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]        
+        leg = fig.legend(custom_lines, legend_labels, 
+                                 bbox_to_anchor=(0.99, 1), fontsize=10, ncol=2, handletextpad=0.1, columnspacing=0.2)
+        leg.set_title("Block size:\n2D=8x8, 3D=4x4x4", prop={"size": 10})
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_theoretical_time_plot_2rows_default(data, gridspec, x, y, baseline_labels=None):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    legend_labels = ["Parallel Scheduler"]
+    
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]][:len(data["block_size_str"].unique())]
+    markers = ["o", "X", "D", "P"][:len(data["block_size_str"].unique())]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="speedup_wrt_theoretical", hue="block_size_str", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    data_averaged = data.groupby(["size_str", "block_size_str"], as_index=True)["speedup_wrt_theoretical"].apply(gmean).reset_index()
+    order = data["block_size_str"].unique()
+    ax = sns.scatterplot(x="size_str", y="speedup_wrt_theoretical", hue="block_size_str", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="block_size_str", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 1.2))
+
+    # Add a horizontal line to denote speedup = 1x;
+    # ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    ax.set_xticks(labels_str)
+    ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=8)
+    ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(7))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0 and x % 2 == 0:
+        ax.annotate(f"{legend_labels[x // 2]}", xy=(-0.3, -1.4), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=10, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.yaxis.grid(True)
+    ax.xaxis.grid(False)
+    
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+    
+    # Add baseline execution time annotations (median of execution time across blocks);
+    if baseline_labels:
+        ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.34), fontsize=8, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+        for i, l in enumerate(labels):
+            baseline_median = baseline_labels[i]
+            ax.annotate(f"{int(1000 * baseline_median)}", xy=(i, -0.48), fontsize=8, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    
+    # Legend; 
+    if x == 0 and y== 0:
+        legend_labels = [f"1D={x.split(',')[0]}" for x in data["block_size_str"].unique()]
+        custom_lines = [
+            # Patch(facecolor="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            lines.Line2D([0], [0], linestyle="none", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]        
+        leg = fig.legend(custom_lines, legend_labels, 
+                                 bbox_to_anchor=(0.99, 1), fontsize=10, ncol=2, handletextpad=0.1, columnspacing=0.2)
+        leg.set_title("Block size:\n2D=8x8, 3D=4x4x4", prop={"size": 10})
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+def build_theoretical_time_plot_2rows_multigpu(data, gridspec, x, y, baseline_labels=None):
+    
+    data["size_str"] = data["size"].astype(str)
+    
+    legend_labels = ["Parallel Scheduler"]
+        
+    palette = [COLORS["peach1"], COLORS["b8"], COLORS["b2"], COLORS["b4"]][:len(data["gpu"].unique())]
+    markers = ["o", "X", "D", "P"][:len(data["gpu"].unique())]
+    
+    # Add a lineplot with the exec times;
+    ax = fig.add_subplot(gridspec[x, y])
+    ax.axhspan(0, 1, facecolor='0.8', alpha=0.1)
+
+    ax = sns.lineplot(x="size_str", y="speedup_wrt_theoretical", hue="gpu", data=data, palette=palette, ax=ax, estimator=gmean,
+                      err_style="bars", linewidth=2, legend=None, sort=False, ci=None, zorder=2)
+    data_averaged = data.groupby(["size_str", "gpu"], as_index=True)["speedup_wrt_theoretical"].apply(gmean).reset_index()
+    order = data["gpu"].unique()
+    ax = sns.scatterplot(x="size_str", y="speedup_wrt_theoretical", hue="gpu", data=data_averaged, palette=palette, ax=ax, edgecolor="#0f0f0f",
+          size_norm=30, legend=False, zorder=3, ci=None, markers=markers, style="gpu", hue_order=order, style_order=order, linewidth=0.05)
+    
+    labels = sorted(data["size"].unique())
+    labels_str = [str(x) for x in labels]
+    
+    # Set the same y limits in each plot;
+    ax.set_ylim((0, 1.2))
+
+    # Add a horizontal line to denote speedup = 1x;
+    # ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+                
+    # Set the x ticks;
+    # ax.set_xticks(labels_str)
+    # ax.set_xticklabels(labels=[get_exp_label(l) for l in labels], rotation=0, ha="center", fontsize=8)
+    # ax.tick_params(labelcolor="black")
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(7))
+    # if y == 0:
+    #     ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=9)
+    # else:
+    #     ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+    #     # Hide tick markers;
+    #     for tic in ax.yaxis.get_major_ticks():
+    #         tic.tick1line.set_visible(False) 
+    #         tic.tick2line.set_visible(False) 
+            
+    # Set the x ticks;
+    odd_ticks = 0 if (len(labels_str) % 2 == 1) else 1
+    ax.set_xticks([l for i, l in enumerate(labels_str) if i % 2 == odd_ticks])
+    
+    ax.set_xticklabels(labels=[get_exp_label(l) for i, l in enumerate(labels) if i % 2 == odd_ticks], rotation=0, ha="center", fontsize=9)
+    ax.tick_params(labelcolor="black", pad=3)
+    # Set the y ticks;
+    ax.yaxis.set_major_locator(plt.LinearLocator(7))
+    if y == 0:
+        ax.set_yticklabels(labels=["{:.1f}x".format(l) for l in ax.get_yticks()], ha="right", fontsize=10)
+    else:
+        ax.set_yticklabels(labels=["" for l in ax.get_yticks()])
+        # Hide tick markers;
+        for tic in ax.yaxis.get_major_ticks():
+            tic.tick1line.set_visible(False) 
+            tic.tick2line.set_visible(False) 
+            
+    # Add policy annotation;
+    if y == 0 and x % 2 == 0:
+        ax.annotate(f"{legend_labels[x // 2]}", xy=(-0.3, -1.4), fontsize=14, ha="center", xycoords="axes fraction", rotation=90) 
+    
+    ax.set_ylabel(None)     
+    ax.set_xlabel(None) 
+    
+    # Add benchmark name and baseline execution time annotations;
+    ax.annotate(f"{BENCHMARK_NAMES[data['benchmark'].iloc[0]]}", xy=(0.50, 1.1), fontsize=10, ha="center", xycoords="axes fraction")
+    
+     # Turn off tick lines;
+    ax.yaxis.grid(True)
+    ax.xaxis.grid(False)
+    
+    # Add a horizontal line to denote speedup = 1x;
+    ax.axhline(y=1, color="#2f2f2f", linestyle="--", zorder=1, linewidth=1, alpha=0.5)
+    
+    # # Add baseline execution time annotations (median of execution time across blocks);
+    # if baseline_labels:
+    #     ax.annotate(f"Median baseline exec. time (ms):", xy=(0, -0.34), fontsize=8, ha="left", xycoords="axes fraction", color=COLORS["peach1"])
+    #     for i, l in enumerate(labels):
+    #         baseline_median = baseline_labels[i]
+    #         ax.annotate(f"{int(1000 * baseline_median)}", xy=(i, -0.48), fontsize=8, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+    ax.annotate("Contention-free exec. time (ms):", xy=(0, -0.35), fontsize=9, ha="left", xycoords="axes fraction", color="#949494")
+    gpus = ["960", "1660", "P100"]
+    for g_i, gpu in enumerate(data["gpu"].unique()):
+        if g_i < len(gpus):
+            if (j == 0):
+                ax.annotate(f"{gpus[g_i]}:", xy=(-0.75, -0.47 - g_i * 0.1), fontsize=9, color=palette[g_i], ha="right", xycoords=("data", "axes fraction"))
+                       
+            for l_i, l in enumerate(labels):
+                try:
+                    baseline_median = baseline_labels[gpu][data["benchmark"].unique()[0]][l]
+                except KeyError:
+                    baseline_median = np.nan
+                # print(i, j, gpu, baseline_median)
+                if not math.isnan(baseline_median) and l_i % 2 == odd_ticks:
+                    ax.annotate(f"{int(1000 * baseline_median)}", xy=(l_i, -0.47 - g_i * 0.1), fontsize=9, color="#2f2f2f", ha="center", xycoords=("data", "axes fraction"))
+
+    # Legend; 
+    if x == 0 and y== 0:
+        legend_labels = data["gpu"].unique()
+        custom_lines = [
+            # Patch(facecolor="white", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            lines.Line2D([0], [0], linestyle="none", marker=markers[i], markersize=10, label=legend_labels[i], markerfacecolor=palette[i], markeredgecolor="#2f2f2f") 
+            for i in range(len(legend_labels))]        
+        leg = fig.legend(custom_lines, legend_labels, 
+                                  bbox_to_anchor=(0.99, 1), fontsize=10, ncol=1, handletextpad=0.1, columnspacing=0.2)
+        leg.set_title(None)
+        leg._legend_box.align = "left"
+    
+    return ax
+
+
+##############################
+##############################
+
+#%%
+
+if __name__ == "__main__":
+    
+    # Columns that uniquely identify each benchmark setup;
+    index_columns = ["benchmark", "exec_policy",
+                     # "new_stream_policy", "parent_stream_policy", "dependency_policy",
+                     "block_size_1d", "block_size_2d",
+                     # "total_iterations", "cpu_validation", "random_init", 
+                     "size",
+                     # "realloc", "reinit"
+                     ]
+    
+    processed_data_p100 = []
+    processed_data_summary_p100 = []
+    for b in BENCHMARK_PHASES.keys():
+        data_b = load_data(INPUT_DATE_GRCUDA_P100, skip_iter=3, benchmark=b, phases=BENCHMARK_PHASES[b])
+        data_b = data_b[data_b["force_prefetch"] == True]
+        if len(data_b) > 0:
+            data_b = remove_outliers_df_grouped(data_b, column="computation_speedup", group=["exec_policy", "benchmark", "block_size_1d", "block_size_2d", "size", "force_prefetch"]).reset_index(drop=True)
+            tmp_cols = index_columns.copy()
+            tmp_cols.remove("exec_policy")
+            data_b = theoretical_speed(data_b, tmp_cols, b)
+            data_b["gpu"] = "P100"
+            data_summary = data_b.groupby(index_columns + ["gpu"])[["computation_speedup", "speedup_wrt_theoretical"]].aggregate(gmean).reset_index()
+            processed_data_p100 += [data_b[list(data_b.columns[:20]) + list(data_b.columns[-4:])]]
+            processed_data_summary_p100 += [data_summary]
+            
+    processed_data_960 = []
+    processed_data_summary_960 = []
+    for b in BENCHMARK_PHASES.keys():
+        data_b = load_data(INPUT_DATE_GRCUDA_960, skip_iter=3, benchmark=b, phases=BENCHMARK_PHASES[b])
+        if len(data_b) > 0:
+            data_b = remove_outliers_df_grouped(data_b, column="computation_speedup", group=["exec_policy", "benchmark", "block_size_1d", "block_size_2d", "size", "force_prefetch"]).reset_index(drop=True)
+            tmp_cols = index_columns.copy()
+            tmp_cols.remove("exec_policy")
+            data_b = theoretical_speed(data_b, tmp_cols, b)
+            data_b["gpu"] = "GTX960"
+            data_summary = data_b.groupby(index_columns + ["gpu"])[["computation_speedup", "speedup_wrt_theoretical"]].aggregate(gmean).reset_index() 
+            processed_data_960 += [data_b[list(data_b.columns[:20]) + list(data_b.columns[-4:])]]
+            processed_data_summary_960 += [data_summary]
+        
+    processed_data_1660 = []
+    processed_data_summary_1660 = []
+    for b in BENCHMARK_PHASES.keys():
+        data_b = load_data(INPUT_DATE_GRCUDA_1660, skip_iter=3, benchmark=b, phases=BENCHMARK_PHASES[b])
+        data_b = data_b[data_b["force_prefetch"] == True]
+        if len(data_b) > 0:
+            data_b = remove_outliers_df_grouped(data_b, column="computation_speedup", group=["exec_policy", "benchmark", "block_size_1d", "block_size_2d", "size", "force_prefetch"]).reset_index(drop=True)
+            tmp_cols = index_columns.copy()
+            tmp_cols.remove("exec_policy")
+            data_b = theoretical_speed(data_b, tmp_cols, b)
+            data_b["gpu"] = "GTX1660 Super"
+            data_summary = data_b.groupby(index_columns + ["gpu"])[["computation_speedup", "speedup_wrt_theoretical"]].aggregate(gmean).reset_index() 
+            processed_data_1660 += [data_b[list(data_b.columns[:20]) + list(data_b.columns[-4:])]]
+            processed_data_summary_1660 += [data_summary]    
+        
+    data = pd.concat(processed_data_960 + processed_data_1660 + processed_data_p100).reset_index(drop=True)
+    
+    #%%
+        
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # block_size_list = sorted(data["block_size_str"].unique(), key=lambda x: [int(y) for y in x.split(",")])
+    # num_col = len(benchmark_list)
+    # num_row = len(block_size_list)
+    # fig = plt.figure(figsize=(2.5 * num_col, 4 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.85,
+    #                 bottom=0.15,
+    #                 left=0.2,
+    #                 right=0.90,
+    #                 hspace=1.2,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     for block_size_i, block_size in enumerate(block_size_list): 
+    #         curr_res = data[(data["benchmark"] == b) & (data["block_size_str"] == block_size)].reset_index(drop=True)  
+    #         exec_time_axes += [build_theoretical_time_plot(curr_res, gs, block_size_i, b_i)]
+            
+    # plt.annotate("Input number of elements", xy=(0.5, 0.03), fontsize=20, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup", xy=(0.02, 0.5), fontsize=20, ha="center", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup w.r.t\nminimum theoretical time", fontsize=25, x=.05, y=0.99, ha="left")
+    
+    # save_plot(PLOT_DIR, "speedup_theoretical_time_{}.{}", OUTPUT_DATE)
+
+    
+    #%% Similar plot, but all block sizes are on 1 row;
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman"] 
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = sorted(data["exec_policy"].unique())
+    # num_col = len(benchmark_list)
+    # num_row = len(policy_list)
+    # fig = plt.figure(figsize=(2.7 * num_col, 3.9 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.8,
+    #                 bottom=0.14,
+    #                 left=0.1,
+    #                 right=0.95,
+    #                 hspace=0.8,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # for b_i, b in enumerate(benchmark_list):
+    #     baselines = []
+    #     tmp_data = data[(data["exec_policy"] == "sync") & (data["benchmark"] == b)]
+    #     labels = sorted(tmp_data["size"].unique())
+    #     for i, l in enumerate(labels):
+    #         baselines += [np.median(tmp_data[tmp_data["size"] == int(l)]["theoretical_time_sec"])]
+    #     for p_i, p in enumerate(policy_list): 
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         exec_time_axes += [build_theoretical_time_plot_compact(curr_res, gs, p_i, b_i, baseline_labels=baselines)]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.03), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.annotate("Speedup", xy=(0.022, 0.44), fontsize=14, ha="left", va="center", rotation=90, xycoords="figure fraction")    
+    # plt.suptitle("Speedup w.r.t\nminimum theoretical time", fontsize=25, x=.05, y=0.99, ha="left")
+    
+    # save_plot(PLOT_DIR, "speedup_theoretical_time_compact_{}.{}", OUTPUT_DATE)
+    
+    #%% Similar plot, but formatted for 1-column on a paper;
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 4
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    # num_col = len(benchmark_list) // 2
+    # num_row = len(policy_list) * 2
+    # fig = plt.figure(figsize=(2.2 * num_col, 2 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.85,
+    #                 bottom=0.10,
+    #                 left=0.10,
+    #                 right=0.98,
+    #                 hspace=0.9,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # baselines_dict = {}
+    # for b_i, b in enumerate(benchmark_list):
+    #     baselines = []
+    #     tmp_data = data[(data["exec_policy"] == "sync") & (data["benchmark"] == b)]
+    #     labels = sorted(tmp_data["size"].unique())
+    #     for i, l in enumerate(labels):
+    #         baselines += [np.median(tmp_data[tmp_data["size"] == int(l)]["theoretical_time_sec"])]
+    #     baselines_dict[b] = baselines
+        
+    # for p_i, p in enumerate(policy_list): 
+    #     for b_i, b in enumerate(benchmark_list):
+    #         index_tot = (len(benchmark_list) * p_i + b_i)
+    #         j = index_tot % num_col
+    #         i = index_tot // num_col
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         exec_time_axes += [build_theoretical_time_plot_2rows(curr_res, gs, i, j, baseline_labels=baselines_dict[b])]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.suptitle("Slowdown w.r.t. execution\nwithout resource contention", fontsize=16, x=.02, y=0.99, ha="left")
+    
+    # l1 = lines.Line2D([0.01, 0.99], [0.46, 0.46], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    # fig.lines.extend([l1])
+
+    # save_plot(PLOT_DIR, "speedup_theoretical_time_2rows_{}.{}", OUTPUT_DATE)
+        
+    #%% Similar plot, but formatted for 1-column on a paper and without serial execution time;
+    
+    # data = pd.concat(processed_data_1660).reset_index(drop=True)
+    
+    # # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    # sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 20 
+    # plt.rcParams['axes.labelpad'] = 10 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 4
+    
+    # # Lists of benchmarks and block sizes;
+    # benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    # policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    # num_col = len(benchmark_list) // 2
+    # num_row = len(policy_list)
+    # fig = plt.figure(figsize=(2.2 * num_col, 2.4 * num_row))
+    # gs = gridspec.GridSpec(num_row, num_col)
+    # plt.subplots_adjust(top=0.75,
+    #                 bottom=0.18,
+    #                 left=0.10,
+    #                 right=0.98,
+    #                 hspace=0.9,
+    #                 wspace=0.15)
+        
+    # exec_time_axes = []
+    # baselines_dict = {}
+    # for b_i, b in enumerate(benchmark_list):
+    #     baselines = []
+    #     tmp_data = data[(data["exec_policy"] == "sync") & (data["benchmark"] == b)]
+    #     labels = sorted(tmp_data["size"].unique())
+    #     for i, l in enumerate(labels):
+    #         baselines += [np.median(tmp_data[tmp_data["size"] == int(l)]["theoretical_time_sec"])]
+    #     baselines_dict[b] = baselines
+    
+    # policy_list = [ASYNC_POLICY_NAME]  # Skip sync policy;
+    # for p_i, p in enumerate(policy_list): 
+    #     for b_i, b in enumerate(benchmark_list):
+    #         index_tot = (len(benchmark_list) * p_i + b_i)
+    #         j = index_tot % num_col
+    #         i = index_tot // num_col
+    #         curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+    #         exec_time_axes += [build_theoretical_time_plot_2rows_default(curr_res, gs, i, j, baseline_labels=baselines_dict[b])]
+        
+    # plt.annotate("Input number of elements", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    # plt.suptitle("Slowdown with respect to execution\nwithout resource contention,\nGTX 1660 Super", fontsize=16, x=.02, y=0.99, ha="left")
+    
+    # # l1 = lines.Line2D([0.01, 0.99], [0.46, 0.46], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    # # fig.lines.extend([l1])
+
+    # save_plot(PLOT_DIR, "speedup_theoretical_time_2rows_default_{}.{}", OUTPUT_DATE)
+    
+    #%% Similar plot, but using multiple GPUs
+    
+    # sns.set_style("whitegrid", {"xtick.bottom": True, "ytick.left": True, "xtick.color": ".8", "ytick.color": ".8"})
+    sns.set_style("white", {"ytick.left": True, "xtick.bottom": True})
+
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['axes.titlepad'] = 20 
+    plt.rcParams['axes.labelpad'] = 10 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    plt.rcParams['xtick.major.pad'] = 4
+    
+    # Lists of benchmarks and block sizes;
+    benchmark_list = [b for b in BENCHMARK_NAMES.keys() if b in data["benchmark"].unique()]
+    policy_list = list(reversed(sorted(data["exec_policy"].unique())))
+    num_col = len(benchmark_list) // 2
+    num_row = len(policy_list)
+    fig = plt.figure(figsize=(2.2 * num_col, 2.5 * num_row))
+    gs = gridspec.GridSpec(num_row, num_col)
+    plt.subplots_adjust(top=0.79,
+                    bottom=0.18,
+                    left=0.10,
+                    right=0.98,
+                    hspace=1,
+                    wspace=0.15)
+        
+    exec_time_axes = []
+    baselines_dict = {}
+    for g in data["gpu"].unique():
+        baselines_dict[g] = {}
+        for b_i, b in enumerate(benchmark_list):
+            baselines = {}
+            tmp_data = data[(data["exec_policy"] == "sync") & (data["benchmark"] == b) & (data["gpu"] == g)]
+            labels = sorted(tmp_data["size"].unique())
+            for i, l in enumerate(labels):
+                baselines[l] = np.median(tmp_data[tmp_data["size"] == int(l)]["theoretical_time_sec"])
+            baselines_dict[g][b] = baselines
+    
+    policy_list = [ASYNC_POLICY_NAME]  # Skip sync policy;
+    for p_i, p in enumerate(policy_list): 
+        for b_i, b in enumerate(benchmark_list):
+            index_tot = (len(benchmark_list) * p_i + b_i)
+            j = index_tot % num_col
+            i = index_tot // num_col
+            curr_res = data[(data["benchmark"] == b) & (data["exec_policy"] == p)].reset_index(drop=True)  
+            exec_time_axes += [build_theoretical_time_plot_2rows_multigpu(curr_res, gs, i, j, baseline_labels=baselines_dict)]
+        
+    plt.annotate("Input number of elements (x-axis not to scale)", xy=(0.5, 0.02), fontsize=14, ha="center", va="center", xycoords="figure fraction")
+    plt.suptitle("Slowdown with respect to execution\nwithout resource contention", fontsize=16, x=.02, y=0.99, ha="left")
+    
+    # l1 = lines.Line2D([0.01, 0.99], [0.46, 0.46], transform=fig.transFigure, figure=fig, color="#2f2f2f", linestyle="--", linewidth=1)
+    # fig.lines.extend([l1])
+
+    save_plot(PLOT_DIR, "speedup_theoretical_time_2rows_multigpu_{}.{}", OUTPUT_DATE)
+    
+    #%%
+    
+    # slowdown = gmean(data["grcuda_cuda_speedup"])
+    # print(slowdown)
+    
+    # slowdown_dict = {"sync": [], ASYNC_POLICY_NAME: []}
+    # for i, g in data.groupby(["benchmark", "exec_policy"]):
+    #     max_size = g["size"].max()
+    #     slowdown_dict[i[1]] += [gmean(g[g["size"] == max_size]["grcuda_cuda_speedup"])]
+    # print(gmean(slowdown_dict["sync"]), gmean(slowdown_dict[ASYNC_POLICY_NAME]))
\ No newline at end of file
diff --git a/projects/resources/python/plotting/plot_transfer_computation_overlap.py b/projects/resources/python/plotting/plot_transfer_computation_overlap.py
new file mode 100755
index 00000000..5d72550e
--- /dev/null
+++ b/projects/resources/python/plotting/plot_transfer_computation_overlap.py
@@ -0,0 +1,310 @@
+# Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NECSTLab nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#  * Neither the name of Politecnico di Milano nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Tue Jul 21 09:45:50 2020
+
+@author: alberto.parravicini
+"""
+
+import os
+import numpy as np
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt 
+import matplotlib.gridspec as gridspec
+from scipy.stats.mstats import gmean
+from matplotlib.patches import Patch, Rectangle
+from matplotlib.collections import PatchCollection, LineCollection
+import matplotlib.lines as lines
+from segretini_matplottini.src.plot_utils import COLORS, get_exp_label, get_ci_size, save_plot
+
+
+DEFAULT_RES_DIR = "../../../../grcuda-data/results/scheduling_nvprof_log"
+
+# 960
+INPUT_DATE_960 = "2020_10_07_960"
+# P100
+INPUT_DATE_P100 = "2020_10_10_P100"
+# 1660
+INPUT_DATE_1660 = "2020_10_10_1660"
+
+OUTPUT_DATE = "2020_10_11"
+PLOT_DIR = "../../../../grcuda-data/plots"
+
+BENCHMARK_NAMES = {"b1": "Vector Squares", "b5": "B&S",  "b8": "Images",  "b6": "ML Ensemble", "b7": "HITS","b10": "DL"}
+
+LABEL_DICT = {"ct_overlap_perc": "CT", "tc_overlap_perc": "TC", "cc_overlap_perc": "CC", "total_overlap_perc": "TOT", "fake_perc": ""}
+LABEL_LEGEND_DICT = {"ct_overlap_perc": "CT, computation w.r.t transfer",
+                     "tc_overlap_perc": "TC, transfer w.r.t computation",                 
+                     "cc_overlap_perc": "CC, computation w.r.t computation",
+                     "total_overlap_perc": "TOT, any type of overlap"
+                     }
+
+SPEEDUPS = {
+    "b1": 1.17,
+    "b5": 1.33,
+    "b6": 1.22,
+    "b7": 1.13,
+    "b8": 1.32,
+    "b10": 1.34,
+    }
+
+SPEEDUPS_960 = {
+    "b1": 1.17,
+    "b5": 1.33,
+    "b6": 1.22,
+    "b7": 1.13,
+    "b8": 1.55,
+    "b10": 1.34,
+    }
+
+SPEEDUPS_P100 = {
+    "b1": 2.55,
+    "b5": 2.79,
+    "b6": 1.39,
+    "b7": 1.33,
+    "b8": 1.49,
+    "b10": 1.17,
+    }
+
+SPEEDUPS_1660 = {
+    "b1": 2.68,
+    "b5": 1.83,
+    "b6": 1.28,
+    "b7": 1.38,
+    "b8": 1.34,
+    "b10": 1.19,
+    }
+
+GPU_NAMES = ["GTX 960", "GTX 1660 Super", "Tesla P100"]
+
+#%%
+if __name__ == "__main__": 
+    
+    # data = pd.read_csv(os.path.join(DEFAULT_RES_DIR, INPUT_DATE_P100, "summary.csv"))
+        
+    # # Add a fake column for visualization;
+    # data["fake_perc"] = 0.0
+    # data["benchmark_num"] = [list(BENCHMARK_NAMES.keys()).index(x) for x in data["benchmark"]]
+
+    # # Pivot the dataset;
+    # data_pivot = pd.melt(data, id_vars=[data.columns[0], data.columns[-1]], value_vars=data.columns[1:-1],
+    #     var_name="overlap_type", value_name="overlap_perc")
+    # data_pivot = data_pivot.sort_values(["benchmark_num"], ignore_index=True, kind="mergesort")
+    
+    # # Remove the fake column for the last benchmark;
+    # last_b = data_pivot["benchmark"].unique()[-1]
+    # data_pivot = data_pivot[~((data_pivot["benchmark"] == last_b) & (data_pivot["overlap_type"] == "fake_perc"))]
+    
+    # # Obtain x values for the plot;
+    # x = np.arange(len(data_pivot))
+    # # Obtain labels;
+    # x_labels = [LABEL_DICT[l] for l in data_pivot["overlap_type"]]
+    # # Obtain y;
+    # y = data_pivot["overlap_perc"]
+    
+    # sns.set_style("white", {"ytick.left": True})
+    # plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    # plt.rcParams['axes.titlepad'] = 25 
+    # plt.rcParams['axes.labelpad'] = 9 
+    # plt.rcParams['axes.titlesize'] = 22 
+    # plt.rcParams['axes.labelsize'] = 14 
+    # plt.rcParams['xtick.major.pad'] = 1 
+    
+    # num_col = len(data_pivot["benchmark"].unique())
+    # # fig = plt.figure(figsize=(1.2 * num_col, 3))
+    # # gs = gridspec.GridSpec(1, num_col)
+    
+    # fig = plt.figure(figsize=(1.2 * num_col, 2.8)) 
+    # ax = fig.add_subplot()
+    # plt.subplots_adjust(top=0.72,
+    #                 bottom=0.25,
+    #                 left=0.08,
+    #                 right=.99,
+    #                 hspace=0.9,
+    #                 wspace=0.0)
+    # p = [COLORS["b3"], COLORS["b8"], COLORS["y3"], COLORS["r5"], COLORS["bb4"], COLORS["bb5"]]
+    # # p = ["#FFEDAB", "#FFDB8C", "#FFC773", "#FFAF66"]
+    # p = ["#C8FCB6", "#96DE9B", "#66B784", "#469E7B"]
+    # palette = (p[:len(LABEL_DICT) - 1] + ["#ffffff"]) * num_col
+    # palette = palette[:len(x)]
+    # edgecolor = ["#ffffff" if (p == "#ffffff" or y[i] <= 0) else "#2f2f2f" for i, p in enumerate(palette)]
+    
+    # bar_width = 0.8
+    
+    # white_bars = (([1] * len(LABEL_LEGEND_DICT) + [0]) * num_col)[:-1]
+    # edgecolor_white_bars = ["#ffffff" if p == "#ffffff" else "#0f0f0f" for i, p in enumerate(palette)]
+    # ax.bar(x, white_bars, bar_width, color="0.8", edgecolor=edgecolor_white_bars, alpha=0.5)
+    # ax.bar(x, y, bar_width, color=palette, edgecolor=edgecolor)
+    # ax.set_xticks(x)
+    # ax.set_xticklabels(x_labels, fontsize=9)
+    
+    # ax.set_xlim((0 - bar_width / 2 - 0.2, len(x) - 1 + bar_width / 2 + 0.2))
+    # ax.set_ylim((0, 1))
+    # # Set the y ticks;
+    # ax.yaxis.set_major_locator(plt.LinearLocator(6))
+    # ax.set_yticklabels(labels=[f"{int(l * 100)}%" for l in ax.get_yticks()], ha="right", fontsize=11)
+    # ax.grid(True, axis="y")
+    
+    # # Add benchmark name;
+    # x_label_pos = 1 / (2 * len(BENCHMARK_NAMES))
+    # def get_x_label_pos(i):
+    #     base_pos = 2 * x_label_pos * i + x_label_pos
+    #     if i == 0:
+    #         return base_pos - 0.015
+    #     elif i == len(BENCHMARK_NAMES) - 1:
+    #         return base_pos + 0.015
+    #     else:
+    #         return base_pos
+    # for i, b in enumerate(BENCHMARK_NAMES):
+    #     ax.annotate(f"{BENCHMARK_NAMES[b]}", xy=(get_x_label_pos(i), -0.28), fontsize=12, ha="center", xycoords="axes fraction")
+    #     ax.annotate(f"Speedup: ", xy=(get_x_label_pos(i) - 0.02, -0.43), fontsize=10, ha="center", xycoords="axes fraction")
+    #     ax.annotate(f"{SPEEDUPS[b]:.2f}x", xy=(get_x_label_pos(i) + 0.045, -0.43), fontsize=10, ha="center", xycoords="axes fraction", color="#469E7B")
+        
+    # # Legend;  
+    # labels = [LABEL_LEGEND_DICT[l] for l in list(LABEL_DICT.keys())[:-1]]
+    # custom_lines = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l)
+    #                 for i, l in enumerate(labels)]
+    # leg = fig.legend(custom_lines, labels, bbox_to_anchor=(1, 1), fontsize=10, ncol=1)
+    # leg.set_title("Type of overlap")
+    # leg._legend_box.align = "left"
+    # leg.get_frame().set_facecolor('white')
+    
+    # plt.suptitle("Amount of transfer and computaton\noverlap for each benchmark", fontsize=16, x=.05, y=0.95, ha="left")
+
+    # save_plot(PLOT_DIR, "overlap_{}.{}", OUTPUT_DATE)
+    
+    
+    # %% Plot both GPUs;
+    
+    data_p100 = pd.read_csv(os.path.join(DEFAULT_RES_DIR, INPUT_DATE_P100, "summary.csv"))
+    data_960 = pd.read_csv(os.path.join(DEFAULT_RES_DIR, INPUT_DATE_960, "summary.csv"))
+    data_1660 = pd.read_csv(os.path.join(DEFAULT_RES_DIR, INPUT_DATE_1660, "summary.csv"))
+    data_list = [data_960, data_1660, data_p100]
+    speedups = [SPEEDUPS_960, SPEEDUPS_1660, SPEEDUPS_P100]
+    
+    sns.set_style("white", {"ytick.left": True})
+    plt.rcParams["font.family"] = ["Latin Modern Roman Demi"]
+    plt.rcParams['axes.titlepad'] = 25 
+    plt.rcParams['axes.labelpad'] = 9 
+    plt.rcParams['axes.titlesize'] = 22 
+    plt.rcParams['axes.labelsize'] = 14 
+    plt.rcParams['xtick.major.pad'] = 1 
+    
+    num_col = len(data_p100["benchmark"].unique())
+    num_row = len(data_list)
+    fig = plt.figure(figsize=(1.2 * num_col, 2.1 * num_row))
+    gs = gridspec.GridSpec(len(data_list), 1)
+    
+    plt.subplots_adjust(top=0.77,
+                    bottom=0.09,
+                    left=0.08,
+                    right=.99,
+                    hspace=0.8,
+                    wspace=0.0)
+    p = [COLORS["b3"], COLORS["b8"], COLORS["y3"], COLORS["r5"], COLORS["bb4"], COLORS["bb5"]]
+    # p = ["#FFEDAB", "#FFDB8C", "#FFC773", "#FFAF66"]
+    p = ["#C8FCB6", "#96DE9B", "#66B784", "#469E7B"]
+    palette = (p[:len(LABEL_DICT) - 1] + ["#ffffff"]) * num_col
+    # palette = palette[:len(x)]
+    
+    bar_width = 0.8
+    
+    for i, data in enumerate(data_list):
+        
+        ax = fig.add_subplot(gs[i, 0])
+        
+        # Add a fake column for visualization;
+        data["fake_perc"] = 0.0
+        data["benchmark_num"] = [list(BENCHMARK_NAMES.keys()).index(x) for x in data["benchmark"]]
+    
+        # Pivot the dataset;
+        data_pivot = pd.melt(data, id_vars=[data.columns[0], data.columns[-1]], value_vars=data.columns[1:-1],
+            var_name="overlap_type", value_name="overlap_perc")
+        data_pivot = data_pivot.sort_values(["benchmark_num"], ignore_index=True, kind="mergesort")
+        
+        # Remove the fake column for the last benchmark;
+        last_b = data_pivot["benchmark"].unique()[-1]
+        data_pivot = data_pivot[~((data_pivot["benchmark"] == last_b) & (data_pivot["overlap_type"] == "fake_perc"))]
+
+        # Obtain x values for the plot;
+        x = np.arange(len(data_pivot))
+        # Obtain labels;
+        x_labels = [LABEL_DICT[l] for l in data_pivot["overlap_type"]]
+        # Obtain y;
+        y = data_pivot["overlap_perc"]
+        edgecolor = ["#ffffff" if (p == "#ffffff" or y[j] <= 0) else "#2f2f2f" for j, p in enumerate(palette)]
+
+        white_bars = (([1] * len(LABEL_LEGEND_DICT) + [0]) * num_col)[:-1]
+        edgecolor_white_bars = ["#ffffff" if p == "#ffffff" else "#0f0f0f" for j, p in enumerate(palette)]
+        ax.bar(x, white_bars, bar_width, color="0.8", edgecolor=edgecolor_white_bars, alpha=0.5)
+        ax.bar(x, y, bar_width, color=palette, edgecolor=edgecolor)
+        ax.set_xticks(x)
+        ax.set_xticklabels(x_labels, fontsize=9)
+        
+        ax.set_xlim((0 - bar_width / 2 - 0.2, len(x) - 1 + bar_width / 2 + 0.2))
+        ax.set_ylim((0, 1))
+        # Set the y ticks;
+        ax.yaxis.set_major_locator(plt.LinearLocator(6))
+        ax.set_yticklabels(labels=[f"{int(l * 100)}%" for l in ax.get_yticks()], ha="right", fontsize=11)
+        ax.grid(True, axis="y")
+        
+        # Add benchmark name;
+        x_label_pos = 1 / (2 * len(BENCHMARK_NAMES))
+        def get_x_label_pos(i):
+            base_pos = 2 * x_label_pos * i + x_label_pos
+            if i == 0:
+                return base_pos - 0.015
+            elif i == len(BENCHMARK_NAMES) - 1:
+                return base_pos + 0.015
+            else:
+                return base_pos
+        ax.annotate(f"{GPU_NAMES[i]}", xy=(-0.065, 1.35 if i == 0 else 1.18), fontsize=16, ha="left", xycoords="axes fraction")
+        for j, b in enumerate(BENCHMARK_NAMES):
+            if i == 0:
+                ax.annotate(f"{BENCHMARK_NAMES[b]}", xy=(get_x_label_pos(j), 1.1), fontsize=12, ha="center", xycoords="axes fraction")
+            ax.annotate("Speedup: ", xy=(get_x_label_pos(j) - 0.02, -0.35), fontsize=10, ha="center", xycoords="axes fraction")
+            ax.annotate(f"{speedups[i][b]:.2f}x", xy=(get_x_label_pos(j) + 0.045, -0.35), fontsize=10, ha="center", xycoords="axes fraction", color="#469E7B")
+            
+    # Legend;  
+    labels = [LABEL_LEGEND_DICT[l] for l in list(LABEL_DICT.keys())[:-1]]
+    custom_lines = [Patch(facecolor=palette[i], edgecolor="#2f2f2f", label=l)
+                    for i, l in enumerate(labels)]
+    leg = fig.legend(custom_lines, labels, bbox_to_anchor=(1, 1), fontsize=10, ncol=1)
+    leg.set_title("Type of overlap")
+    leg._legend_box.align = "left"
+    leg.get_frame().set_facecolor('white')
+    
+    plt.suptitle("Amount of transfer and computaton\noverlap for each benchmark", fontsize=16, x=.02, y=0.98, ha="left")
+
+    save_plot(PLOT_DIR, "overlap_full_{}.{}", OUTPUT_DATE)
+    
\ No newline at end of file
diff --git a/projects/resources/python/plotting/segretini_matplottini b/projects/resources/python/plotting/segretini_matplottini
new file mode 160000
index 00000000..8f4e3652
--- /dev/null
+++ b/projects/resources/python/plotting/segretini_matplottini
@@ -0,0 +1 @@
+Subproject commit 8f4e36528083527fc72d1aad48b5ba7d3471f694
diff --git a/tensorrt/README.md b/tensorrt/README.md
index 30c1568f..91d2c4bb 100644
--- a/tensorrt/README.md
+++ b/tensorrt/README.md
@@ -1,14 +1,14 @@
-# grCUDA and TensorRT
+# GrCUDA and TensorRT
 
 This directory contains a wrapper library `libtrt.so` for TensorRT.
 It simplifies the use of the TensorRT inference library.
 
 ## Build libtrt
 
-Build that the grCUDA wrapper library `libtrt` for TensorRT.
+Build that the GrCUDA wrapper library `libtrt` for TensorRT.
 
 ```console
-$ cd <grCUDA repo root>../tensorrt
+$ cd <GrCUDA repo root>../tensorrt
 $ mkdir build
 $ cd build
 $ cmake .. -DTENSORRT_DIR=/usr/local/TensorRT-7.0.0.11/
@@ -75,7 +75,7 @@ prediction: 4
 destroying inference runtime...
 ```
 
-## Use libtrt in grCUDA
+## Use libtrt in GrCUDA
 
 ```bash
 GRCUDA_JAR="$GRCUDA_BUILD_DIR/mxbuild/dists/jdk1.8/grcuda.jar"