GrCUDA 0.2.1 - Transparent asynchronous scheduling and more #43

AlbertoParravicini · 2021-10-05T11:33:59Z

Hi there, we are a research team from Politecnico di Milano, and in the past year or so we did a lot of research & development on GrCUDA, with the help of Oracle Labs and OCI.
We'd like to understand if there's interest in accepting our contributions to this repo.
We decided to open a PR with our updates so far.
The main contributions are related to the asynchronous scheduling of kernels (see below), but there are also plenty of quality-of-life improvements, tests, and code samples.
We did our best to document and test our contributions, but we are aware of the size of these changes. If there's interest from your side in our work, we'll be happy to discuss how to optimally integrate our contributions into the existing codebase.

API Changes

Added option to specify arguments in NFI kernel signatures as const
- The effect is the same as marking them as in in the NIDL syntax
- It is not strictly required to have the corresponding arguments in the CUDA kernel marked as const, although that's recommended
- Marking arguments as const or in enables the async scheduler to overlap kernels that use the same read-only arguments

New asynchronous scheduler

Added a new asynchronous scheduler for GrCUDA, enable it with --experimental-options --grcuda.ExecutionPolicy=async
- With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes immediately
- The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU data are accessed by the host thread
- Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using different streams
- Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams
- The scheduler supports different options, see README.md for the full list
- It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
Enabled partial support for cuBLAS and cuML in the aync scheduler
- Known limitation: functions in these libraries work with the async scheduler, although they still run on the default stream (i.e. they are not asynchronous)
- They do benefit from prefetching
Set TensorRT support to experimental
- TensorRT is currently not supported on CUDA 11.4, making it impossible to use along with a recent version of cuML
- Known limitation: due to this incompatibility, TensorRT is currently not available on the async scheduler

New features

Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray, MultiDimDeviceArrayView, and provides high-level array interfaces
Added API for prefetching
- If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance
Added API for stream attachment
- Always enabled in GPUs with architecture older than Pascal, and the async scheduler is active. With the sync scheduler, it can be manually enabled
- It restricts the visibility of GPU data to the specified stream
- In architectures newer or equal than Pascal it can provide a small performance benefit
Added copyTo/copyFrom functions on generic arrays (Truffle interoperable objects that expose the array API)
- Internally, the copy is implemented as a for loop, instead of using CUDA's memcpy
- It is still faster than copying using loops in the host languages, in many cases, and especially if host code is not JIT-ted
- It is also used for copying data to/from DeviceArrays with column-major layout, as memcpy cannot copy non-contiguous data

Demos, benchmarks and code samples

Added demo used at SeptembeRSE 2021 (demos/image_pipeline_local and demos/image_pipeline_web)
- It shows an image processing pipeline that applies a retro look to images. We have a local version and a web version that displays results on a web page
Added a benchmark suite written in Graalpython, and used in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
- It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling.

Miscellaneous

Added dependency to grcuda-data submodule, used to store data, results and plots used in publications and demos.
Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it?
Added support for Java 11 along with Java 8
Added option to specify the location of cuBLAS and cuML with environment variables (LIBCUBLAS_DIR and LIBCUML_DIR)
Refactored package hierarchy to reflect changes to current GrCUDA (e.g. gpu -> runtime)
Added basic support for TruffleLogger
Removed a number of existing deprecation warnings
Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking
Updated documentation
- Bumped GraalVM compatibility to 21.2
- Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (see oci_setup/)
- Added documentation to setup IntelliJ Idea for GrCUDA development
- Added documentation about Python benchmark suite
- Added documentation on asynchronous scheduler options

… element

…d schedule interfaces to grcudacomputatioanlelement

…e function

… and isExecuted; using builder for kernel config

…ble names starting with grCUDA, i.e. grCUDAExecutionContext) and filenames (actually no file or folder had grCUDA in their names) (#6)

* added submodule for grcuda-data * removed docs files that have been moved

* Updated gitignore to ignore files related to the node project (frontend and backend) Added boilerplate and for both the backend and frontend. Communication using websockets is properly working and will be handled by the GrCUDAProxy class, which will also be in charge of mocking the communication with GrCUDA until it is ready. * Updated .gitignore to ignore images * Finished barebone implementation for the demo. Added lightbox to show the full resolution image when the thumbnail is clicked. * [GRCUDA-33] adding js image processing pipeline * [FRONTEND] Added padding between images to make the visualization more readable [BACKEND] Added documentation and extracted the whole communication with the frontend into a separate function * Added README.md * [MISC] Added README.md * added first version of grcuda image pipeline for demo, in javascript * removed 'Color' from image processing function name * extended copyFrom/copyTo to support generic truffle arrays * added simple script to measure js copy performance * added simple script to measure js copy performance * updated demo to use buffers/grcuda memcpy to have fast performance * Improved the look-and-feel of the frontend via bootstrap. Added brief description of each computation type with explainatory images * Added a call to communicateImageProcessed on last image to send the signal that the computation has completed * Implemented race mode. When race mode is activated all 3 modes run concurrently and send to the frontend only progress messages instead of both progress and image-completion messages in order to avoid cluttering the UI. TODO: Find a nice way to display the photos in this mode, show the three photos side by side? * Changed race mode display to accomodate the images processed by the three modes in three columns * Begun refactoring of GrCudaProxy * Begun refactoring of GrCudaProxy: extracted utility function (like _sleep) and options (MOCK_OPTIONS etc) into the utils file. Allowed the frontend to execute different runs one after the other, without reloading the page * Templated many inline html to improve code readability and reusability * Added opencv4nodejs as a dependency * Added setup script for OCI * Updated script for setting up the web demo * Fixed inconsistent naming in GrCudaProxy.ts (delay_jitter -> delayJitter) * Fixed css for better centering of the images * Modified backend in order to allow to be runned on different ports and on different instances concurrently. * Modified frontend to communicate with the three backends independently. * adding cuda-native image pipeline * updated native cuda implementation * integrated opencv with cuda native pipeline * finished native implementation of image pipeline * Finished implementation of async backend, sync one should be identical except for the fact that the grcuda.jar should differ\ * Changed overlay offset to accomodate for the reduced size of the images * fixed resizing of images * added lut for demo * added lut to js demo * added option to specify full input/output image paths * Finished implementation of basic backend integration. TODO: figure out why the async version is ~usually~ slower than the sync one. * Removed useless instatiation of websocket * Added experimental multigpu support to grcuda * Made setup_machine_from_scratch script executable * Updated dataset store and .gitmodules such that the pull happens with https instead of ssh * added parallel channel computation; removed resizing * Modified backend to send execution time when the computation ends * Changed frontend to use new images * Modified cuda native pipeline to be run on a different gpu than the default TODO: make it a cli arg * Updated readme and *UNTESTED* setup script * Finished version 1.0 of the web demo * Fully async mode for the backend, now images are processed in an asynchrounous fash * added fallback for missing GPUs; fixed slow array copy; updated cmake and parameters in native version; replaced port 8081 * using 1024x1024 images * removed unused modal.js; forcing images to be displayed at 512px * Solved progress bar bug. * Finished version 1.0 of the demo * improved LUTs in CUDA native * added new luts to js demo * removed unnecessary code * Fixed bug in setup script * Modified CMakelist.txt to compile on default ubuntu20.04 from OCI * [parra] updated setup_machine_from_scratch.sh * Updated readme with instructions and setup_webdemo.sh script * Updated .gitignore to ignore images * [GRCUDA-33] adding js image processing pipeline * added simple script to measure js copy performance * updated demo to use buffers/grcuda memcpy to have fast performance * added lut to js demo * added parallel channel computation; removed resizing * Fully async mode for the backend, now images are processed in an asynchrounous fash * Finished version 1.0 of the demo * added new luts to js demo * Reverted changes compromised by merge to commit f4fc112 * Enabled v21.1 in setup_machine_from_scratch.sh * removed array copy test, updated install script * added option to enable experimental multi-gpu support * updated imagepipeline.py and array copy performance test * updated demo readme * removed commented-out code * made rundemo executable * updated demo scripts * Backend now requires the support for multi GPU to be enabled manually * Fixed bug where images were showing correctly in thumbnail, but did display incorrectly when opening the lightbox * Moved web_demo and image_pipeline to examples folder and then renamed examples to demos * Moved description images to main repo and changed setup script accordingly. * Removed demo specific .gitignore things from main gitignore and moved them to a gitignore specific for the web demo * Fixed bug where cuda-native did not compute the last image * renamed image_pipeline -> image_pipeline_local web_demo -> image_pipeline_web and updated setup script accordingly * readded description images * Fixed path for cuda-native version Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Alberto Parravicini <[email protected]>

* added test cases for cublas support * improved testing interface to test all input options combinations at once * first working integration of cublas with async policy; still to be improved, it requires extra device sync * moved createKernelArguments into ConfiguredKernel; moved cuda library interfaces into separate package * added support for cuml in async policy; cuml is still sync, but other computations can happen concurrently * removed outdated code * added support for cuml, still to be tested * adding test for cuml * added env variable to support cuml path * added flags to disable cuml and tensorrt when not supported * removed redundant tests for sync policy * fixed missing kernelFunctionHandle function

* Updated gitignore to ignore files related to the node project (frontend and backend) Added boilerplate and for both the backend and frontend. Communication using websockets is properly working and will be handled by the GrCUDAProxy class, which will also be in charge of mocking the communication with GrCUDA until it is ready. * Updated .gitignore to ignore images * Finished barebone implementation for the demo. Added lightbox to show the full resolution image when the thumbnail is clicked. * [GRCUDA-33] adding js image processing pipeline * [FRONTEND] Added padding between images to make the visualization more readable [BACKEND] Added documentation and extracted the whole communication with the frontend into a separate function * Added README.md * [MISC] Added README.md * added first version of grcuda image pipeline for demo, in javascript * removed 'Color' from image processing function name * extended copyFrom/copyTo to support generic truffle arrays * added simple script to measure js copy performance * added simple script to measure js copy performance * updated demo to use buffers/grcuda memcpy to have fast performance * Improved the look-and-feel of the frontend via bootstrap. Added brief description of each computation type with explainatory images * Added a call to communicateImageProcessed on last image to send the signal that the computation has completed * Implemented race mode. When race mode is activated all 3 modes run concurrently and send to the frontend only progress messages instead of both progress and image-completion messages in order to avoid cluttering the UI. TODO: Find a nice way to display the photos in this mode, show the three photos side by side? * Changed race mode display to accomodate the images processed by the three modes in three columns * Begun refactoring of GrCudaProxy * Begun refactoring of GrCudaProxy: extracted utility function (like _sleep) and options (MOCK_OPTIONS etc) into the utils file. Allowed the frontend to execute different runs one after the other, without reloading the page * Templated many inline html to improve code readability and reusability * Added opencv4nodejs as a dependency * Added setup script for OCI * Updated script for setting up the web demo * Fixed inconsistent naming in GrCudaProxy.ts (delay_jitter -> delayJitter) * Fixed css for better centering of the images * Modified backend in order to allow to be runned on different ports and on different instances concurrently. * Modified frontend to communicate with the three backends independently. * adding cuda-native image pipeline * updated native cuda implementation * integrated opencv with cuda native pipeline * finished native implementation of image pipeline * Finished implementation of async backend, sync one should be identical except for the fact that the grcuda.jar should differ\ * Changed overlay offset to accomodate for the reduced size of the images * fixed resizing of images * added lut for demo * added lut to js demo * added option to specify full input/output image paths * Finished implementation of basic backend integration. TODO: figure out why the async version is ~usually~ slower than the sync one. * Removed useless instatiation of websocket * Added experimental multigpu support to grcuda * Made setup_machine_from_scratch script executable * Updated dataset store and .gitmodules such that the pull happens with https instead of ssh * added parallel channel computation; removed resizing * Modified backend to send execution time when the computation ends * Changed frontend to use new images * Modified cuda native pipeline to be run on a different gpu than the default TODO: make it a cli arg * Updated readme and *UNTESTED* setup script * Finished version 1.0 of the web demo * Fully async mode for the backend, now images are processed in an asynchrounous fash * added fallback for missing GPUs; fixed slow array copy; updated cmake and parameters in native version; replaced port 8081 * using 1024x1024 images * removed unused modal.js; forcing images to be displayed at 512px * Solved progress bar bug. * Finished version 1.0 of the demo * improved LUTs in CUDA native * added new luts to js demo * removed unnecessary code * Fixed bug in setup script * Modified CMakelist.txt to compile on default ubuntu20.04 from OCI * [parra] updated setup_machine_from_scratch.sh * Updated readme with instructions and setup_webdemo.sh script * Updated .gitignore to ignore images * [GRCUDA-33] adding js image processing pipeline * added simple script to measure js copy performance * updated demo to use buffers/grcuda memcpy to have fast performance * added lut to js demo * added parallel channel computation; removed resizing * Fully async mode for the backend, now images are processed in an asynchrounous fash * Finished version 1.0 of the demo * added new luts to js demo * Reverted changes compromised by merge to commit f4fc112 * Enabled v21.1 in setup_machine_from_scratch.sh * removed array copy test, updated install script * added option to enable experimental multi-gpu support * updated imagepipeline.py and array copy performance test * updated demo readme * removed commented-out code * made rundemo executable * updated demo scripts * Demo For SeptembRSE Conference (#14) * Updated gitignore to ignore files related to the node project (frontend and backend) Added boilerplate and for both the backend and frontend. Communication using websockets is properly working and will be handled by the GrCUDAProxy class, which will also be in charge of mocking the communication with GrCUDA until it is ready. * Updated .gitignore to ignore images * Finished barebone implementation for the demo. Added lightbox to show the full resolution image when the thumbnail is clicked. * [GRCUDA-33] adding js image processing pipeline * [FRONTEND] Added padding between images to make the visualization more readable [BACKEND] Added documentation and extracted the whole communication with the frontend into a separate function * Added README.md * [MISC] Added README.md * added first version of grcuda image pipeline for demo, in javascript * removed 'Color' from image processing function name * extended copyFrom/copyTo to support generic truffle arrays * added simple script to measure js copy performance * added simple script to measure js copy performance * updated demo to use buffers/grcuda memcpy to have fast performance * Improved the look-and-feel of the frontend via bootstrap. Added brief description of each computation type with explainatory images * Added a call to communicateImageProcessed on last image to send the signal that the computation has completed * Implemented race mode. When race mode is activated all 3 modes run concurrently and send to the frontend only progress messages instead of both progress and image-completion messages in order to avoid cluttering the UI. TODO: Find a nice way to display the photos in this mode, show the three photos side by side? * Changed race mode display to accomodate the images processed by the three modes in three columns * Begun refactoring of GrCudaProxy * Begun refactoring of GrCudaProxy: extracted utility function (like _sleep) and options (MOCK_OPTIONS etc) into the utils file. Allowed the frontend to execute different runs one after the other, without reloading the page * Templated many inline html to improve code readability and reusability * Added opencv4nodejs as a dependency * Added setup script for OCI * Updated script for setting up the web demo * Fixed inconsistent naming in GrCudaProxy.ts (delay_jitter -> delayJitter) * Fixed css for better centering of the images * Modified backend in order to allow to be runned on different ports and on different instances concurrently. * Modified frontend to communicate with the three backends independently. * adding cuda-native image pipeline * updated native cuda implementation * integrated opencv with cuda native pipeline * finished native implementation of image pipeline * Finished implementation of async backend, sync one should be identical except for the fact that the grcuda.jar should differ\ * Changed overlay offset to accomodate for the reduced size of the images * fixed resizing of images * added lut for demo * added lut to js demo * added option to specify full input/output image paths * Finished implementation of basic backend integration. TODO: figure out why the async version is ~usually~ slower than the sync one. * Removed useless instatiation of websocket * Added experimental multigpu support to grcuda * Made setup_machine_from_scratch script executable * Updated dataset store and .gitmodules such that the pull happens with https instead of ssh * added parallel channel computation; removed resizing * Modified backend to send execution time when the computation ends * Changed frontend to use new images * Modified cuda native pipeline to be run on a different gpu than the default TODO: make it a cli arg * Updated readme and *UNTESTED* setup script * Finished version 1.0 of the web demo * Fully async mode for the backend, now images are processed in an asynchrounous fash * added fallback for missing GPUs; fixed slow array copy; updated cmake and parameters in native version; replaced port 8081 * using 1024x1024 images * removed unused modal.js; forcing images to be displayed at 512px * Solved progress bar bug. * Finished version 1.0 of the demo * improved LUTs in CUDA native * added new luts to js demo * removed unnecessary code * Fixed bug in setup script * Modified CMakelist.txt to compile on default ubuntu20.04 from OCI * [parra] updated setup_machine_from_scratch.sh * Updated readme with instructions and setup_webdemo.sh script * Updated .gitignore to ignore images * [GRCUDA-33] adding js image processing pipeline * added simple script to measure js copy performance * updated demo to use buffers/grcuda memcpy to have fast performance * added lut to js demo * added parallel channel computation; removed resizing * Fully async mode for the backend, now images are processed in an asynchrounous fash * Finished version 1.0 of the demo * added new luts to js demo * Reverted changes compromised by merge to commit f4fc112 * Enabled v21.1 in setup_machine_from_scratch.sh * removed array copy test, updated install script * added option to enable experimental multi-gpu support * updated imagepipeline.py and array copy performance test * updated demo readme * removed commented-out code * made rundemo executable * updated demo scripts * Backend now requires the support for multi GPU to be enabled manually * Fixed bug where images were showing correctly in thumbnail, but did display incorrectly when opening the lightbox * Moved web_demo and image_pipeline to examples folder and then renamed examples to demos * Moved description images to main repo and changed setup script accordingly. * Removed demo specific .gitignore things from main gitignore and moved them to a gitignore specific for the web demo * Fixed bug where cuda-native did not compute the last image * renamed image_pipeline -> image_pipeline_local web_demo -> image_pipeline_web and updated setup script accordingly * readded description images * Fixed path for cuda-native version Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Alberto Parravicini <[email protected]> * added test cases for cublas support * improved testing interface to test all input options combinations at once * first working integration of cublas with async policy; still to be improved, it requires extra device sync * moved createKernelArguments into ConfiguredKernel; moved cuda library interfaces into separate package * added support for cuml in async policy; cuml is still sync, but other computations can happen concurrently * removed outdated code * added support for cuml, still to be tested * adding test for cuml * added env variable to support cuml path * added flags to disable cuml and tensorrt when not supported * removed redundant tests for sync policy * improved testing interface to test all input options combinations at once * adding test for cuml * updated names of policies to more descriptive names * updated readme to use correct new options * fixed typo in readme * restored missing filter for unnecessary sync flags Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Francesco Sgherzi <[email protected]>

* added basic logging facility, restructured tests to use logger with minimal messages * added 'assumeTrue' in skippable cuML test * fixed rebase errors; removed prints in tests

* added function to obtain pointer to full array in array slice, updated CUDA runtime to use it * added internal API to obtain full array size of array view * added slow-path memcpy for column-major matrix views * added tests to validate new memcpy fallback * fixed rebase errors

…a dependency on both arrays (#13) * fixed device-to-device memcpy not considering both arrays for dependencies * cleaned memcpy code * updated log message from warning to info * fixed rebase errors Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Francesco Sgherzi <[email protected]>

* renamed package gpu to runtime * moved package array inside runtime * moved computation argument inside computation package * moved array computations and stream attach policies to separate package * refactored tests location * fixed merge errors, removed outdated files

* updated docs to configure oci instances * updated readme/setup script to use graal 21.2 * fixed errors in setup script * updated readme * removed outdated doc, useful things moved to benchmark files * updated design documentation * fixed python benchmarks using outdated paths * updated java from 11 to 8+

* adding changelog, removed unused thread manager * fixed install.sh, now using env variables to retrieve the absolute path to grcuda.jar * added curl to install script * added license to demo * removed unused cuda code; added license to benchmarks * added license to tests * added license to functions and libraries * fixed removal of thread manager breaking build * added more updated licenses * added license to runtime files * Updated changelog * fixed typo * added grcuda-data info to readme * udpated tracking of grcuda-data * temporarily removed submodule grcuda-data * readded grcuda-data submodule * tracking master? * updated grcuda-data tracking * Added the possibility to send execution times to the frontend * Display execution times in race mode * clarified streamattach in changelog Co-authored-by: Guido Walter Di Donato <[email protected]> Co-authored-by: Francesco Sgherzi <[email protected]>

* fixed install dir (GRCUDA-67) * fixed python benchmarks not creating nested folders and not using experimental options (GRCUDA-68) * updated make for cuda benchmarks and readme (GRCUDA-68)

…tion (#21) * added support for streams in cuml and cublas libraries * minor fixes in cuml, 1 test wrong * support for cublas added, cuml work in progress * minor fixes, completed streams support for cuml * added comments * changelog updated * minor fixes * minor fixes * changelog updated * Delete async-lib_12688.csv * Delete async-lib_12801.csv * Delete async-lib_13268.csv * Delete async-lib_12901.csv * Delete async-lib_13092.csv * Delete async-lib_12981.csv * Delete report_7764 * Delete report_7884 * minor fixes to solve requested changes * added cublas scheduling test with 2 gemm and 1 axpy * moved setlibrary functions and removed librarysetstreamfunction from inheriting Function * turned dependence from LibrarySetStream into local variable for both Registry files * fixed comments to cublaswithscheduletest * fixed size in benchmarks Co-authored-by: Alberto Parravicini <[email protected]>

* updated install script to support nvswitch * updated nvswitch flag

* made all the system.out.print instances as instances of GrCUDAContext.LOGGER * added GrCUDALogger Class, and removed previous edits to preexisting classes interfaces * changed general loggers referred to CrCUDAContext with specfic loggers for each project * changed general loggers referred to CrCUDAContext with specfic loggers for each project + forgotten ones * added logging level option, commented method to parse the option in GrCUDAContext, everything must be set when launching as options in the command * added logging.md file draft to documentation * correction after first revision: fix messages, add default value to GrCUDALogger, add examples in documentation * add changes to changelog * removed unused grcuda option for logging, re-inserted suppress warning for a static method in CUDARuntime * fixing code style * fixing conflicts with maaster for PR Co-authored-by: Ginevra Cerri <[email protected]>

* added first file * New GrCUDAOptionMap class with copyright * GCOptionMap: added contructor with options of GCContext * Added public method to navigate the options in a list * Moved parseXXX functions from Context and added API to acces the Map * Modified constructor and functions to expose runtime values of some GrCUDAOptions * GrCUDAOptionMap: added internal API to access options * GetOptionFuntion: added external API for getting options from OptionMap * fixed use of methods to get values of options * GrCUDAOptionMap: minor fixes and settlement of the external API * minor fix * GrCUDAOptionMap as Singleton * Default values for options moved in OptionMap * test added * copyright added * external API retrieves strings both for keys and for values * updated option map to use strings as keys, fixing polyglot access to option map * fixed polyglot option map; iterator not working * iterator: fixed and tested * added test and minor fixes * Update CHANGELOG.md Co-authored-by: Alberto Parravicini <[email protected]>

* ProfilableElement from multiGPU + updated GrCUDAOptions * Adding ProfilableElement usage in ComputationalElement, creating option for kernel timers * porting from multiGPU branch 1 * waiting support for static option map * Porting from multiGPU branch 2 * fixed stream manager test * implemented option isEnableKernelTimers in stream manager * refactored EnableKernelTimers to TimeComputation * added kernel timer option docs * updated implementation of kernel timers logging method * updated changelog * Removed logging of profilable state * final adjustments before PR * Fixes for pull request * fixed NPE in event start; fixed tests always using sync policy; temporarily removed some optionmap tests causing NPE Co-authored-by: Alberto Parravicini <[email protected]>

* initial commit, folder creation for cusparse * added function to CUSPARSERegistry, missing: nfi functions' input settings * nfi signatures completed, missing: Desc types mgmt * final commit from my machine :( * enum addition and casting * completed enum revision * added first test cusparseCoo * added SpMV test, not working * Removed useless function instantiations in CUSPARSERegistry. For the moment being we are exposing to the user the functions to create various matrix descripts as well as creating/destroying the handle. This needs to be changed * Almost functioning version of testSpMV. The error that i'm getting now is related to the enums, but at least no more polyglot exceptions * initial support for cusparse * cleaned code, more tests added * minor additions to tests * minor fixes to context and options for enabling cusparse * formatting and cleaning complete * minor fixes to context options in cusparse tests * changelog updated * Removed unused imports * Modified libcusparse.so.11 -> libcusparse.so * updated copyright for new files * added suport for async functions, implementation of non-exposed functions * removed useless initialization * added sparseSgemvi * begun implementation of proxiesn [commit della svolta] * added basic functions to proxyspmv, to be testes * added proxy for Sgemvi * Completed functions for SpMV and Sgemvi, context creation missing * tests ready for proxies, not working (context issues) * working on contexts * context creation fails * minor additions to sparse proxy * createCoo now works * proxies all right, non valid handle in buffersize function * IT WORKS * initial steps for tests implementation for Sgemvi and SpMV with CSR format * working tests for coo and csr with spmv, sgemvi does not work (does not update the vector passed as input) * minor fixes to tests * finished testing sgemvi and spmv * finished testing sgemvi and spmv * added test for libraries integration * minor fixes, all good, streams' functioning for libraries interoperability checked with profiler * changelog updated * tests * Added breaks to switch statement * partially working tests for TGemvi and SpMV * Fixed spmv Tests for coo and csr, gemvi still needs to be fixed * TGemvi now works with data types C and S * added streams syncing to tests * added syncing, sometimes (after mx clean) crs/coo do not work, tgemvi does not work with double types, despite syncing * Removed double and double complex from tests * GrCUDAOptions updated for cuSPARSE * fixed context * removed ternary expressions * small cleanups; fixed tracking of array dependencies not working in cusparse * updated changelog Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Francesco Sgherzi <[email protected]> Co-authored-by: Alberto Parravicini <[email protected]>

* removed deprecation warning for arity exception * updated changelog

* updated install for graal 21.3 * updated changelog for release 2 * updated url of grcuda repo in installation script

* Integrating multiGPU in master [TEST] (#45) * removed deprecation warning for arity exception * updated changelog * updated cuda benchmark suite with multi-gpu benchmarks * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * fixed crash when shutting down grcuda and numgpus < totgpus; disabled cublas on old gpus and async scheduler; added multigpu API support to runtime * added multi-gpu option in context used in tests * GRCUDA-56 added optimized interface to load/build kernels on single/multi GPU * tests are properly skipped if the system configuration does not support them * added test for manual selection of multigpu * fixed manual gpu selection not working with async scheduler. Added tracking of currently active GPU in runtime * improved computationelement profiling interface * minor updates in naming, added default GPU id * removed unnecessary logging of timers * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * replaced 'isLastComputationArrayAccess' with device-level tracking of array updates * added fixme note on possible problem with array tracking * fixed scheduling of write array being skipped when a read-only GPU computation was ongoing * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * [GrCUDA 96-1] update python benchmark suite for multi gpu (#30) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * [GrCUDA-96-2] integrate multi gpu scheduler (#31) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * integrating multi-gpu stream policies * replaced default cuda benchmark with ERR instead of B1 * removed unnecessary js file * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * added interface to restrict device selection to a subset of devices; changed DeviceList impl to use List instead of array * fixed round robin with specific devices; added tests for it * added filtered device selection policies * added new parent stream selection policy * added test for stream aware policy * added test for disjoint parent policy * fixed oob error when reading connection graph and the number of gpus to use is smaller than the number of gpus in the system * fixed connection_graph loading error in parsing csv * added test dataset for connection garph * [GrCUDA-96-4] multigpu device management (#33) * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * [GrCUDA-96-6] Stream policies for multi-GPU (#36) * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * integrating multi-gpu stream policies * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * fixed things for pr. Added connection_graph_test.csv to git; * minr fix porting from 97-7 * Merge 96-8 on test-96-0 (#41) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * Grcuda 96 9 more time logging (#42) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * added logging for multiple computations on same deviceId * Merge 96-11 on test 96-0 (#43) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * Merge 96-12 on test-96-0 (#44) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * Added first version of dump scheduling graph functionality * fixing minor issues * fixing minor issues * Added MultiGPU support * Added option to export scheduling DAG true or false for now, path hardcoded * negative values bug fix * refactoring using java 8 streams * minor issue * minor issue * minor issue * minor issue * minor issue * modifying export path * fixing bug in export dag function * Export DAG to specific path the option now expects the path where to place the DAG file as value, if not specified, the DAG will not be exported * fixed ExportDAG option * Update README.md * visualization optimization * minor visualization optimization * minor visualization optimization * minor visualization optimization * Added documentation to GraphExport.java * removed hardcoded paths from tests * code cleanup, prepare to PR * Removed DAG export from tests (#46) * removed deprecation warning for arity exception * updated changelog * updated cuda benchmark suite with multi-gpu benchmarks * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * fixed crash when shutting down grcuda and numgpus < totgpus; disabled cublas on old gpus and async scheduler; added multigpu API support to runtime * added multi-gpu option in context used in tests * GRCUDA-56 added optimized interface to load/build kernels on single/multi GPU * tests are properly skipped if the system configuration does not support them * added test for manual selection of multigpu * fixed manual gpu selection not working with async scheduler. Added tracking of currently active GPU in runtime * improved computationelement profiling interface * minor updates in naming, added default GPU id * removed unnecessary logging of timers * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * replaced 'isLastComputationArrayAccess' with device-level tracking of array updates * added fixme note on possible problem with array tracking * fixed scheduling of write array being skipped when a read-only GPU computation was ongoing * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * [GrCUDA 96-1] update python benchmark suite for multi gpu (#30) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * [GrCUDA-96-2] integrate multi gpu scheduler (#31) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * integrating multi-gpu stream policies * replaced default cuda benchmark with ERR instead of B1 * removed unnecessary js file * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * added interface to restrict device selection to a subset of devices; changed DeviceList impl to use List instead of array * fixed round robin with specific devices; added tests for it * added filtered device selection policies * added new parent stream selection policy * added test for stream aware policy * added test for disjoint parent policy * fixed oob error when reading connection graph and the number of gpus to use is smaller than the number of gpus in the system * fixed connection_graph loading error in parsing csv * added test dataset for connection garph * [GrCUDA-96-4] multigpu device management (#33) * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * [GrCUDA-96-6] Stream policies for multi-GPU (#36) * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * integrating multi-gpu stream policies * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * fixed things for pr. Added connection_graph_test.csv to git; * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * Added first version of dump scheduling graph functionality * fixing minor issues * fixing minor issues * Added MultiGPU support * Added option to export scheduling DAG true or false for now, path hardcoded * negative values bug fix * refactoring using java 8 streams * minor issue * minor issue * minor issue * minor issue * minor issue * modifying export path * fixing bug in export dag function * Export DAG to specific path the option now expects the path where to place the DAG file as value, if not specified, the DAG will not be exported * fixed ExportDAG option * Update README.md * visualization optimization * minor visualization optimization * minor visualization optimization * minor visualization optimization * Added documentation to GraphExport.java * removed hardcoded paths from tests * code cleanup, prepare to PR * minr fix porting from 97-7 * Merge 96-8 on test-96-0 (#41) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * Grcuda 96 9 more time logging (#42) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * added logging for multiple computations on same deviceId * Merge 96-11 on test 96-0 (#43) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * remove execution DAG export from tests * {HOTFIX} Update install.sh * {HOTFIX} Update setup_machine_from_scratch.sh * {HOTFIX} Create setup_graalvm.sh * {HOTFIX} Update setup_machine_from_scratch.sh Install CUDA toolkit 11.7 instead of 11.4 * {HOTFIX} Update README.md * updated bindings and logging documentation * support for graalvm 21.3.2 * fix to setup script * {HOTFIX} Update CHANGELOG.md * {HOTFIX} Update CHANGELOG.md * support for graalvm 22.1.0 * moved to network version of cuda installer, removed unuseful binary * updated readme * updated documentation

Implemented the main benchmarks present in the Python suite in Java. The class Benchmark.java provides a template for future use cases. Created configuration files to easily adapt the benchmarks to different types of workloads. The suite is built as a Maven project, indeed running mvn test will execute all the benchmarks based on the configuration file of the appropriate GPU architecture.

* Each policy type has a separated class * Keep only retrieveImpl * Now TransferTimeDeviceSelectionPolicy extends DeviceSelectionPolicy * Delete previously commented methods and clean code * added license for each file

* Add B9M + updated config files * Minor fix - deleted replicated B9M entry

AlbertoParravicini added 30 commits April 12, 2020 10:39

updated readme and added install script

d85564c

added python pipeline example

b6d0caa

adding project notes

3ee93d5

updated project notes

e8ac831

added abstract base array class, adding grcuda execution context

7c75b5f

added kernel execution and collecting info about kernels

f97da0e

fixed indentation on github

8e0053e

adding execution dag

3773e91

adding kernel dependency computation, abstracted grcuda computational…

24b941f

… element

added more tests to dag

47b46b1

cleaned initialization of grcuda computational element

6b558c5

added active parameters-aware frontier computation

6bde54a

updated notes

a6f2754

moved cudaruntime inside grcudaexecutioncontext

40b0415

separating kernel scheduling from execution, added generic execute an…

837911f

…d schedule interfaces to grcudacomputatioanlelement

added initial cuda stream support

4bbf114

added stream destroy and synchronize, removed unnecessary streamcreat…

3c1749f

…e function

adding stream manager. removed some deprecation warnings

bee17ee

added stream selection. modified computational element to hold stream…

30fd61b

… and isExecuted; using builder for kernel config

adding multithreaded execution of DAG, not working yet

03d2780

added sync execution on multiple streams

48ed234

added log files to gitignore

1c3df6f

cleaned stream initialization

fe39d64

adding grcuda computational elements for read/write of devicearray

29327ed

added & exposed streamattach function

7d78e79

added automatic array-stream association in stream manager

9f4924a

added more python pipelines examples

89ac03f

disable automatically stream attach in post pascal gpus

3952d6c

moved pre/post pascal array stream mapping to separate interface

21d036d

added multidim read/write comp elem

2f84029

gwdidonato and others added 30 commits August 25, 2021 10:01

[GRCUDA-4] replaced grCUDA with GrCUDA in all files (exception: varia…

f83b562

…ble names starting with grCUDA, i.e. grCUDAExecutionContext) and filenames (actually no file or folder had grCUDA in their names) (#6)

deleted data folder; moved notes into docs

ab3ba51

Grcuda 41 repo for dataset (#7)

1ea2e15

* added submodule for grcuda-data * removed docs files that have been moved

Removed wrong import of ValueException (#8)

e2662b3

[GrCUDA-42] added basic TruffleLogger support (#11)

15f9e96

* added basic logging facility, restructured tests to use logger with minimal messages * added 'assumeTrue' in skippable cuML test * fixed rebase errors; removed prints in tests

restored device array copy function (#15)

4e5b000

fixed installation for 8+ not being done in the right folder (#18)

311ea3e

[GrCUDA-HOTFIX] fix install and benchmarks (#20)

9c9a357

* fixed install dir (GRCUDA-67) * fixed python benchmarks not creating nested folders and not using experimental options (GRCUDA-68) * updated make for cuda benchmarks and readme (GRCUDA-68)

[GRCUDA-hotfix] updated install script to support nvswitch (#22)

426b03d

* updated install script to support nvswitch * updated nvswitch flag

[GrCUDA-66] fix arity deprecation (#27)

920f52c

* removed deprecation warning for arity exception * updated changelog

[GrCUDA-82] support for graalvm 21.3 (#35)

a3e7bfe

* updated install for graal 21.3 * updated changelog for release 2 * updated url of grcuda repo in installation script

Updated graal and mx commit in readme to match setup script (#48)

e224629

Updated default config for V100 and A100 (#51)

84b07b4

[GRCUDA-hotfix] move B9M into the Java benchmark suite (#52)

ca826df

* Add B9M + updated config files * Minor fix - deleted replicated B9M entry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GrCUDA 0.2.1 - Transparent asynchronous scheduling and more #43

GrCUDA 0.2.1 - Transparent asynchronous scheduling and more #43

AlbertoParravicini commented Oct 5, 2021

GrCUDA 0.2.1 - Transparent asynchronous scheduling and more #43

Are you sure you want to change the base?

GrCUDA 0.2.1 - Transparent asynchronous scheduling and more #43

Conversation

AlbertoParravicini commented Oct 5, 2021

API Changes

New asynchronous scheduler

New features

Demos, benchmarks and code samples

Miscellaneous