Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GrCUDA 0.2.1 - Transparent asynchronous scheduling and more #43

Open
wants to merge 246 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
246 commits
Select commit Hold shift + click to select a range
d85564c
updated readme and added install script
AlbertoParravicini Apr 12, 2020
b6d0caa
added python pipeline example
AlbertoParravicini Apr 12, 2020
3ee93d5
adding project notes
AlbertoParravicini Apr 13, 2020
e8ac831
updated project notes
AlbertoParravicini Apr 13, 2020
7c75b5f
added abstract base array class, adding grcuda execution context
AlbertoParravicini Apr 13, 2020
f97da0e
added kernel execution and collecting info about kernels
AlbertoParravicini Apr 13, 2020
8e0053e
fixed indentation on github
AlbertoParravicini Apr 13, 2020
3773e91
adding execution dag
AlbertoParravicini Apr 14, 2020
24b941f
adding kernel dependency computation, abstracted grcuda computational…
AlbertoParravicini Apr 14, 2020
47b46b1
added more tests to dag
AlbertoParravicini Apr 15, 2020
6b558c5
cleaned initialization of grcuda computational element
AlbertoParravicini Apr 15, 2020
6bde54a
added active parameters-aware frontier computation
AlbertoParravicini Apr 15, 2020
a6f2754
updated notes
AlbertoParravicini Apr 17, 2020
40b0415
moved cudaruntime inside grcudaexecutioncontext
AlbertoParravicini Apr 18, 2020
837911f
separating kernel scheduling from execution, added generic execute an…
AlbertoParravicini Apr 18, 2020
4bbf114
added initial cuda stream support
AlbertoParravicini Apr 19, 2020
3c1749f
added stream destroy and synchronize, removed unnecessary streamcreat…
AlbertoParravicini Apr 19, 2020
bee17ee
adding stream manager. removed some deprecation warnings
AlbertoParravicini Apr 19, 2020
30fd61b
added stream selection. modified computational element to hold stream…
AlbertoParravicini Apr 20, 2020
03d2780
adding multithreaded execution of DAG, not working yet
AlbertoParravicini Apr 20, 2020
48ed234
added sync execution on multiple streams
AlbertoParravicini Apr 22, 2020
1c3df6f
added log files to gitignore
AlbertoParravicini Apr 22, 2020
fe39d64
cleaned stream initialization
AlbertoParravicini Apr 22, 2020
29327ed
adding grcuda computational elements for read/write of devicearray
AlbertoParravicini Apr 22, 2020
7d78e79
added & exposed streamattach function
AlbertoParravicini Apr 23, 2020
9f4924a
added automatic array-stream association in stream manager
AlbertoParravicini Apr 23, 2020
89ac03f
added more python pipelines examples
AlbertoParravicini Apr 23, 2020
3952d6c
disable automatically stream attach in post pascal gpus
AlbertoParravicini Apr 23, 2020
21d036d
moved pre/post pascal array stream mapping to separate interface
AlbertoParravicini Apr 24, 2020
2f84029
added multidim read/write comp elem
AlbertoParravicini Apr 24, 2020
7dacf2a
optimized array accesses with fast path to skip scheduling if not nec…
AlbertoParravicini Apr 24, 2020
4b173ff
added python benchmarking
AlbertoParravicini Apr 26, 2020
0936f5f
added 2 benchmarks and random init
AlbertoParravicini Apr 26, 2020
9d5fa75
updated readme
AlbertoParravicini Apr 27, 2020
3364c72
updatnig notes
AlbertoParravicini Apr 27, 2020
fe9e091
updated notes
AlbertoParravicini Apr 27, 2020
1f7905b
Update NOTES.md
AlbertoParravicini Apr 27, 2020
17cafcc
temporarily changed printf to provide human readable output
AlbertoParravicini Apr 27, 2020
58071b8
Merge branch 'execution-model-sync' of github.com:AlbertoParravicini/…
AlbertoParravicini Apr 27, 2020
a63618f
fixed array read/write optimization not being propagated to parent ar…
AlbertoParravicini Apr 27, 2020
bbd5612
added synchronous execution policy
AlbertoParravicini Apr 28, 2020
3638e86
added bookkeeping to sync exec
AlbertoParravicini Apr 28, 2020
56972c6
added option to disable cpu validation in benchmarks
AlbertoParravicini Apr 28, 2020
bf47aa6
added 1 more benchmark, fixed benchmark options
AlbertoParravicini Apr 29, 2020
3b79ae0
wrapping kernel arguments to store if const or array
AlbertoParravicini Apr 30, 2020
e61862a
moved dependency computation to separate class, adding read-only dep …
AlbertoParravicini May 1, 2020
2295d91
added separate dependency builder
AlbertoParravicini May 1, 2020
679feb7
removed kernels using manual streams from dag scheduling
AlbertoParravicini May 3, 2020
4a1d8d0
fixed inconsistency in frontier computation, added tests for synced s…
AlbertoParravicini May 3, 2020
ab61157
tracking active comptuations for each stream
AlbertoParravicini May 3, 2020
c2c9e05
removed sync if no computation is active; moved default stream to sin…
AlbertoParravicini May 6, 2020
b2a421a
updated naming of mock test classes
AlbertoParravicini May 6, 2020
5af8cb7
added more tests for const dependency streams
AlbertoParravicini May 6, 2020
f61757f
added lifo stream retrieval policy, added option to specify policy
AlbertoParravicini May 6, 2020
30556b6
fixed empty computation map not being tracked correctly
AlbertoParravicini May 6, 2020
c1ced1a
added tests for retrive stream policies
AlbertoParravicini May 7, 2020
7c7fab0
const arrays are kept on default stream
AlbertoParravicini May 7, 2020
f31b039
fixed host array access skipping sync if the array is on global strea…
AlbertoParravicini May 7, 2020
558b19c
added new tests for sync const dag
AlbertoParravicini May 7, 2020
4efe7fb
removed unnecessary dependencies in DAG creation
AlbertoParravicini May 8, 2020
57246de
added strategy to handle redundant dependencies based on dependency c…
AlbertoParravicini May 9, 2020
6b5f703
adding complex ensemble benchmark
AlbertoParravicini May 9, 2020
a7a08ad
updated ensemble benchmark
AlbertoParravicini May 9, 2020
a98b9cb
updated notes
AlbertoParravicini May 9, 2020
b39e388
adding strategy to compute parent stream
AlbertoParravicini May 18, 2020
ad35851
added disjoint arg set stream retrieval
AlbertoParravicini May 19, 2020
93a83dd
modified async exec model with CompletableFuture, still not working
AlbertoParravicini May 24, 2020
fb582b6
temporarily switched to java 11; fixed multiple args not working with…
AlbertoParravicini May 27, 2020
941022c
removed unneccesary stream sync
AlbertoParravicini Jun 2, 2020
e4b0641
fixed streams being synced more than once; fixed const arg visibility…
AlbertoParravicini Jun 2, 2020
9118bbb
switched back to java 8
AlbertoParravicini Jun 2, 2020
15357fc
fixed stream attach being done on const arrays when not necessary
AlbertoParravicini Jun 2, 2020
f2754a5
added cuda event api
AlbertoParravicini Jun 2, 2020
5f35701
added cuda event functions in the internal runtime API
AlbertoParravicini Jun 2, 2020
5023e67
replaced kernel sync with cuda events
AlbertoParravicini Jun 2, 2020
2be2c99
updated stream test to use new sync computation; fixed parent-of-pare…
AlbertoParravicini Jun 3, 2020
5d02151
fixed not setting as finished parent computations using a different s…
AlbertoParravicini Jun 3, 2020
ae088e4
fixed fifo stream retrieval adding free streams more than once
AlbertoParravicini Jun 3, 2020
d22d75d
adding hits benchmark
AlbertoParravicini Jun 3, 2020
bcbbcee
added image pipeline benchmark
AlbertoParravicini Jun 8, 2020
dcc3f87
fixed sync of parent streams
AlbertoParravicini Jun 8, 2020
cd18e3b
added mock tests using same structure as complex becnhmarks
AlbertoParravicini Jun 8, 2020
90159a9
updated notes
AlbertoParravicini Jun 9, 2020
45c725b
updated readme with dag settings
AlbertoParravicini Jun 9, 2020
8430ec7
adding benchmark results
AlbertoParravicini Jun 9, 2020
077a435
added benchmark results
AlbertoParravicini Jun 9, 2020
264a3e2
updated bench8 with faster kernels
AlbertoParravicini Jun 11, 2020
13c72be
updated benchmarks to use grid-stride
AlbertoParravicini Jun 16, 2020
8e63eb2
updated benchmark results
AlbertoParravicini Jun 16, 2020
674448d
updated benchmark results
AlbertoParravicini Jun 16, 2020
34a2d23
modified event-based sync so that each kernel has an associated event
AlbertoParravicini Jun 17, 2020
0bee654
added support for device array copy as dag element
AlbertoParravicini Jun 17, 2020
ee8129c
added support for memcpy on multidim arrays
AlbertoParravicini Jun 17, 2020
14fd3e2
added block size as benchmark parameter
AlbertoParravicini Jun 19, 2020
7ce4cf1
added bench wrapper
AlbertoParravicini Jun 19, 2020
c380046
added output file name to benchmark wrapper; changed grcuda options t…
AlbertoParravicini Jun 19, 2020
aaf65c5
benchmark wrapper results are stored in a unique folder for each run
AlbertoParravicini Jun 20, 2020
28ee6a2
fixed wrapper overriding names and multiple iterations not being stor…
AlbertoParravicini Jun 20, 2020
fd30f20
added plot with scalability
AlbertoParravicini Jun 20, 2020
a10916b
added new scalability plot
AlbertoParravicini Jun 21, 2020
14c7b29
adding baseline cuda benchmarks and plots
AlbertoParravicini Jun 21, 2020
a749db3
added image cuda benchmarks
AlbertoParravicini Jun 21, 2020
e15a60c
added new plots
AlbertoParravicini Jun 21, 2020
48afa1b
added new benchmark results
AlbertoParravicini Jun 22, 2020
364e016
added png plots
AlbertoParravicini Jun 23, 2020
22d4473
added images to results
AlbertoParravicini Jun 23, 2020
93de41a
Update RESULTS.md
AlbertoParravicini Jun 23, 2020
fc87ca9
updated plots
AlbertoParravicini Jun 23, 2020
e9870d6
updated plots, adding ridgeplot
AlbertoParravicini Jun 23, 2020
194a68c
updated plots and benchmarks now use Java nanotime for better accuracy
AlbertoParravicini Jul 1, 2020
ea04b61
adding bs benchmark
AlbertoParravicini Jul 2, 2020
22385be
fixed memcpy not working correctly with arrays with stream-restricted…
AlbertoParravicini Jul 2, 2020
f76ca4d
added option to skip phase timing in benchmarks
AlbertoParravicini Jul 5, 2020
bfd1be6
added makefile for cuda examples
AlbertoParravicini Jul 5, 2020
b9e9f90
updated benchmarks and plots
AlbertoParravicini Jul 6, 2020
2c4ac35
added baseline times to 1-row plots
AlbertoParravicini Jul 8, 2020
c2a996d
updated results for b7 and b8
AlbertoParravicini Jul 10, 2020
fbd6a48
Update RESULTS.md
AlbertoParravicini Jul 10, 2020
4f68351
updated test for b1 and b5, fixed with-const policy non being applied…
AlbertoParravicini Jul 14, 2020
290686a
fixed dates in results
AlbertoParravicini Jul 14, 2020
57ee19c
updated results
AlbertoParravicini Jul 16, 2020
6dbfb9f
added theoretical speed plot
AlbertoParravicini Jul 16, 2020
aaf9557
added support for nvprof profiling in benchmarks
AlbertoParravicini Jul 20, 2020
b1c2117
added computation overlap analysis
AlbertoParravicini Jul 21, 2020
77e90a6
fixed overlap plot
AlbertoParravicini Jul 21, 2020
e0f3093
added speedup to overlap plot
AlbertoParravicini Jul 21, 2020
eb515dd
updated palette
AlbertoParravicini Jul 21, 2020
742bfaf
added device memory analysis
AlbertoParravicini Jul 28, 2020
3b84093
fixed missing plot
AlbertoParravicini Jul 28, 2020
2c49811
Update RESULTS.md
AlbertoParravicini Jul 28, 2020
59f572b
updated metric plot with l2 and ipc
AlbertoParravicini Jul 28, 2020
1087fb6
added dl benchmark and results
AlbertoParravicini Aug 5, 2020
6130906
updated dl benchmark
AlbertoParravicini Aug 5, 2020
1a6666e
addign plots for paper
AlbertoParravicini Aug 12, 2020
5943749
updated plots
AlbertoParravicini Aug 12, 2020
d381972
updated plots
AlbertoParravicini Aug 12, 2020
5e33e33
updated plots
AlbertoParravicini Aug 13, 2020
465432d
updated plots
AlbertoParravicini Aug 15, 2020
488ac25
added plot for gigaflops
AlbertoParravicini Aug 17, 2020
ebfcfc3
fixed plots
AlbertoParravicini Aug 19, 2020
9ebf755
updated first plot
AlbertoParravicini Aug 22, 2020
61e3f23
updated theoretical speed plot title
AlbertoParravicini Aug 25, 2020
75cc7e1
added default-only theoretical speed plots
AlbertoParravicini Aug 26, 2020
2389475
added default-only theoretical speed plots - 2
AlbertoParravicini Aug 26, 2020
0fb7c40
added option to force stream attach
AlbertoParravicini Sep 3, 2020
4be33c3
added default num blocks in benchmarks
AlbertoParravicini Sep 3, 2020
3cd60f5
benchmarks block sizes do not require full reinit now
AlbertoParravicini Sep 7, 2020
2d499cc
added option to load graph from pickle in b7
AlbertoParravicini Sep 7, 2020
4ca037e
added script tp generate graphs for B7
AlbertoParravicini Sep 8, 2020
67b7028
updated graph generation
AlbertoParravicini Sep 8, 2020
83574e3
modified b7 to load json
AlbertoParravicini Sep 9, 2020
b68ce13
modified init of b7
AlbertoParravicini Sep 9, 2020
407bb63
fixed reinit being done for different block sizes
AlbertoParravicini Sep 9, 2020
e2049bc
updated cuda benchmarks for P100
AlbertoParravicini Sep 11, 2020
f6b0438
fixed realloc being applied when prevent_reinit is true
AlbertoParravicini Sep 11, 2020
28f4b4b
minor updates in benchmarks
AlbertoParravicini Sep 13, 2020
a3a49a0
fixed block size not being updated
AlbertoParravicini Sep 14, 2020
acc07ae
adding plots for p100
AlbertoParravicini Sep 15, 2020
0b7916b
updated plotting code
AlbertoParravicini Sep 17, 2020
a5efb62
updated b8 and b10
AlbertoParravicini Sep 18, 2020
7821e76
updated cuda benchmark suite
AlbertoParravicini Sep 23, 2020
40f8208
adding cudagraph benchmarks
AlbertoParravicini Sep 24, 2020
5912623
added cudagraph b6 benchmarks
AlbertoParravicini Sep 24, 2020
f75897e
added cudagraph b8/10 benchmarks
AlbertoParravicini Sep 24, 2020
7f81c9b
added single-cudastream cudagraph benchmarks
AlbertoParravicini Sep 24, 2020
a7f4c1b
added prefetch to b1 and b5
AlbertoParravicini Sep 26, 2020
aad39d2
updated readme
AlbertoParravicini Sep 28, 2020
2cb239a
added option to force prefetching in sync execution
AlbertoParravicini Sep 29, 2020
b9b97b4
fixed benchmark sizes
AlbertoParravicini Sep 29, 2020
7f6017e
fixed name in api of prefetching
AlbertoParravicini Sep 29, 2020
e57b89f
added prefetch option to wrapper
AlbertoParravicini Sep 29, 2020
debfcab
added option to disable/enable prefetching in cuda baseline
AlbertoParravicini Sep 30, 2020
9d6cd41
added option to disable/enable prefetching in cuda test for wrapper
AlbertoParravicini Sep 30, 2020
e5e52ef
fixed stream attach in cuda benchmark
AlbertoParravicini Oct 4, 2020
68f9b93
fixed prefetch not working with sync policy
AlbertoParravicini Oct 4, 2020
326cdc2
added sync policy for prefetcher
AlbertoParravicini Oct 4, 2020
85162c2
updated nvprof wrapper to Turing
AlbertoParravicini Oct 5, 2020
c165497
fixed prefetcher being active by default
AlbertoParravicini Oct 6, 2020
06da973
Merge branch 'execution-model-sync' of github.com:AlbertoParravicini/…
AlbertoParravicini Oct 6, 2020
8008761
added block num option in benchmarks
AlbertoParravicini Oct 6, 2020
a648a44
updated wrapper scripts
AlbertoParravicini Oct 7, 2020
d435b4a
fixed exec time inconsistencies in b7
AlbertoParravicini Oct 9, 2020
0cdcc6a
updated wrapper scripts
AlbertoParravicini Oct 14, 2020
3b8e1c3
updated plotting
AlbertoParravicini Oct 14, 2020
266dd20
updated readme
AlbertoParravicini Oct 14, 2020
81839e9
updated readme
AlbertoParravicini Nov 3, 2020
3e08e2f
Merge branch 'execution-model-stable' of github.com:AlbertoParravicin…
AlbertoParravicini Nov 3, 2020
55a10a8
fixing conflicts
AlbertoParravicini Nov 3, 2020
47724d8
updated code with master nvidia
AlbertoParravicini Nov 3, 2020
3465a0d
refactor master integration
AlbertoParravicini Nov 4, 2020
a5c7a82
removed unnecessary files
AlbertoParravicini Nov 4, 2020
0dfa953
added option to use clang in makefile for cuda tests
AlbertoParravicini Nov 6, 2020
86962c5
removed stamp NIDL file that prevented compilation
AlbertoParravicini Nov 10, 2020
4ae90e1
Update README.md
AlbertoParravicini Mar 16, 2021
b79e911
bumped support to java11; removed wrong import in StreamManagerTest
AlbertoParravicini May 13, 2021
ad5eddf
added script to configure machine from scratch to use grcuda/graal. U…
AlbertoParravicini May 13, 2021
e774cc9
removed hardcoded nvvc path in test
AlbertoParravicini May 16, 2021
e3e3b3b
added graal commit checkout to setup
AlbertoParravicini May 24, 2021
2093925
changed grcuda from http to ssh in config script
AlbertoParravicini May 24, 2021
5370a8b
[GRCUDA-7] removed deprecations from grcuda exceptions
AlbertoParravicini Jul 19, 2021
31081d7
GRCUDA-7 removed deprecated isObjectOfLanguage from GrCUDALanguage
AlbertoParravicini Jul 19, 2021
85f35ec
GRCUDA-7 removed deprecation from NVRTCException
AlbertoParravicini Jul 19, 2021
2427863
GRCUDA-7 updated java8 to java11 in readme
AlbertoParravicini Jul 19, 2021
83115fe
[GrCUDA-7] update to java 11 and remove deprecations (#2)
AlbertoParravicini Jul 19, 2021
afa8252
Merge branch 'master' of https://github.com/NVIDIA/grcuda
AlbertoParravicini Jul 22, 2021
806a27e
Merge branch 'master' of github.com:AlbertoParravicini/grcuda
AlbertoParravicini Jul 22, 2021
76a7716
[GRCUDA-37] updated documentation to install grcuda in intellij. Forc…
AlbertoParravicini Jul 26, 2021
bd4abb1
Merge pull request #4 from AlbertoParravicini/grcuda-37-java11-compat…
AlbertoParravicini Jul 26, 2021
408724f
added py script to validate correctness of img pipeline
AlbertoParravicini Jul 26, 2021
efd2b76
removed outdated multithreaded grcuda context
AlbertoParravicini Aug 7, 2021
2a91520
added field to GrCudaExecutionContext that identifies the execution p…
AlbertoParravicini Aug 7, 2021
425b1bf
added py script to validate correctness of img pipeline
AlbertoParravicini Jul 26, 2021
c6ca341
removed outdated multithreaded grcuda context
AlbertoParravicini Aug 7, 2021
217c4bb
added field to GrCudaExecutionContext that identifies the execution p…
AlbertoParravicini Aug 7, 2021
32d1fee
Merge branch 'GRCUDA-13-remove-multithreaded-context' of github.com:A…
AlbertoParravicini Aug 7, 2021
f8cb9f9
Merge pull request #5 from AlbertoParravicini/GRCUDA-13-remove-multit…
AlbertoParravicini Aug 7, 2021
f83b562
[GRCUDA-4] replaced grCUDA with GrCUDA in all files (exception: varia…
gwdidonato Aug 25, 2021
ab3ba51
deleted data folder; moved notes into docs
AlbertoParravicini Aug 30, 2021
1ea2e15
Grcuda 41 repo for dataset (#7)
AlbertoParravicini Aug 30, 2021
e2662b3
Removed wrong import of ValueException (#8)
AlbertoParravicini Aug 31, 2021
21f25ac
Demo For SeptembRSE Conference (#14)
lnghrdntcr Sep 12, 2021
deeea56
[GRCUDA-2] missing support for cublas/cuml/tensorrt libraries (#9)
AlbertoParravicini Sep 13, 2021
4645edd
[GrCUDA-45] replace default option names (#10)
AlbertoParravicini Sep 14, 2021
15f9e96
[GrCUDA-42] added basic TruffleLogger support (#11)
AlbertoParravicini Sep 14, 2021
f18c46e
[GrCUDA-14] fixed memcpy on arrays with column-major memory layout (#12)
AlbertoParravicini Sep 14, 2021
4e5b000
restored device array copy function (#15)
AlbertoParravicini Sep 14, 2021
70436da
[GrCUDA-17] fixed devicearray-to-devicearray memcpy not computing dat…
AlbertoParravicini Sep 14, 2021
ab27a67
[GrCUDA-58] Refactored package hierarchy (#16)
AlbertoParravicini Sep 15, 2021
1c23b24
[GrCUDA-21] Documentation update (#17)
AlbertoParravicini Sep 20, 2021
311ea3e
fixed installation for 8+ not being done in the right folder (#18)
AlbertoParravicini Sep 21, 2021
d63678d
Grcuda 61 cleanup for release 1 (#19)
AlbertoParravicini Sep 23, 2021
9c9a357
[GrCUDA-HOTFIX] fix install and benchmarks (#20)
AlbertoParravicini Sep 26, 2021
c41af85
[GRCUDA-32] Missing support to libraries (cublas, cuml) - Async execu…
luisacicolini Oct 14, 2021
426b03d
[GRCUDA-hotfix] updated install script to support nvswitch (#22)
AlbertoParravicini Oct 18, 2021
c45e6c9
[GrCUDA-48] Logging (#23)
gwdidonato Nov 9, 2021
725a045
[GrCUDA-50] option map (#24)
sarahnastasi Nov 15, 2021
fd93a0c
Grcuda 43 support kernel timers (#26)
merklefruit Nov 18, 2021
f6a457d
Grcuda 85 support cusparse (#25)
luisacicolini Dec 8, 2021
920f52c
[GrCUDA-66] fix arity deprecation (#27)
AlbertoParravicini Dec 11, 2021
a3e7bfe
[GrCUDA-82] support for graalvm 21.3 (#35)
AlbertoParravicini Jan 3, 2022
fe830ac
GrCUDA MultiGPU Release (#47)
gwdidonato Jun 29, 2022
e224629
Updated graal and mx commit in readme to match setup script (#48)
gwdidonato Jun 29, 2022
85dcf30
Added Java implementation of the benchmark suite (#49)
ian-ofgod Sep 21, 2022
7208790
Grcuda 132 refactor deviceselectionpolicy in grcudastreampolicy (#50)
DavideMaffi Mar 20, 2023
84b07b4
Updated default config for V100 and A100 (#51)
ian-ofgod Jul 18, 2023
ca826df
[GRCUDA-hotfix] move B9M into the Java benchmark suite (#52)
ian-ofgod Jul 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,13 @@ mx.grcuda/eclipse-launches
/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDAListener.java
/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDAParser.java
/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/GrCUDAVisitor.java
**.log
/scratch
**.nvvp
projects/resources/cuda/bin
data/results/*
data/nvprof_log/*
data/pickle/*
/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.g4.stamp
/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.interp
/projects/com.nvidia.grcuda/src/com/nvidia/grcuda/parser/antlr/NIDL.tokens
Expand All @@ -50,3 +57,10 @@ mx.grcuda/eclipse-launches
tensorrt/build
examples/tensorrt/python/logs
examples/tensorrt/cpp/build
venv
out/
*.files
*.csv
grcuda_token.txt
projects/demos/image_pipeline/cuda/build
projects/demos/image_pipeline/img_out
8 changes: 8 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[submodule "grcuda-data"]
path = grcuda-data
url = https://github.com/AlbertoParravicini/grcuda-data.git
branch = master
[submodule "projects/resources/python/plotting/segretini_matplottini"]
path = projects/resources/python/plotting/segretini_matplottini
url = [email protected]:AlbertoParravicini/segretini-matplottini.git
branch = master
189 changes: 189 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# 2022-06-01

* Added scheduling DAG export functionality. It is now possible to retrieve a graphic version of the scheduling DAG of the execution by adding `ExportDAG` in the startup options. The graph will be exported in .dot format in the path specified by the user as option argument.
* This information can be leveraged to better understand the achieved runtime performance and to compare the schedules derived from different policies. Moreover, poorly written applications will results in DAGs with low-level of task-parallelism independently of the selected policy, suggesting designers to change their applications’ logic.


# 2022-04-15

* Updated install.sh to compute the interconnection graph
* Updated benchmark_wrapper to retrieve connection_graph correctly
* Enabled min-transfer-size test in benchmark_wrapper
* Benchmark_wrapper set for V100

# 2022-02-16

* Added logging for multiple computations (List of floats) on the same deviceID. This information could be used in future history-based adaptive scheduling policies.

# 2022-01-26

* Added mocked benchmarks: for each multi-gpu benchmark in our suite, there is a mocked version where we check that the GPU assignment is the one we expect. Added utility functions to easily test mocked benchmarks
simplified round-robin device selection policy, now it works more or less as before but it is faster to update when using a subset of devices
* Added threshold parameter for data-aware device selection policies. When using min-transfer-size or minmax/min-transfer-time, consider only devices that have at least 10% (or X %) of the requested data. Basically, if a device only has a very small amount of data already available it is not worth preferring it to other devices, and it can cause scheduling to converge to a unique device. See B9 and B11, for example.
* Updated python benchmark suite to use new options, and optimized initialization of B1 (it is faster now) and B9 (it didn't work on matrices with 50K rows, as python uses 32-bit array indexing)
* Fixed performance regression in DeviceArray access. For a simple python code that writes 160M values on a DeviceArray, performance went from 4sec to 20sec by using GraalVM 21.3 instead of 21.2. Reverted GraalVM to 21.2. Using non-static final Logger in GrCUDAComputationalElement increased time from 4sec to 130sec (not sure why, they are not created in repeated array accesses): fixed this regression.

# 2022-01-14

* Modified the "new stream creation policy FIFO" to simply reuse an existing free stream, without using a FIFO policy. Using FIFO did not give any benefit (besides a more predictable stream assignment), but it was more complex (we needed both a set and a FIFO, now we just use a set for the free streams)
* Added device manager to track devices. This is mostly an abstraction layer over CUDARuntime, and allows retrieving the currently active GPU, or retrieving a specific device.
* DeviceManager is only a "getter", it cannot change the state of the system (e.g. it does not allow changing the current GPU)
* Compared to the original multi-GPU branch, we have cleaner separation. StreamManager has access to StreamPolicy, StreamPolicy has access to DeviceManager. StreamManager still has access to the runtime (for event creation, sync etc.), but we might completely hide CUDARuntime inside DeviceManager to have even more separation.
* Re-added script to build connection graph. We might want to call it automatically from grcuda if the output CSV is not found. Otherwise we need to update the documentation to tell users how to use the script

# 2022-01-12

* Modified DeviceSelectionPolicy to select a device from a specified list of GPUs, instead of looking at all GPUs.
That's useful because when we want to reuse a parent's stream we have to choose among the devices used by the parents, instead of considering all devices.
* Added new SelectParentStreamPolicy where we find the parents' streams that can be reused, and then looks at the best device among the devices where these streams are, instead of considering all the devices in the system as in the previous policy. The old policy is still available.

# 2021-12-21, Release 2

* Added support for GraalVM 21.3.
* Removed `ProfilableElement` Boolean flag, as it was always true.

# 2021-12-09

* Replaced old isLastComputationArrayAccess" with new device tracking API
* The old isLastComputationArrayAccess was a performance optimization used to track if the last computation on an array was an access done by the CPU (the only existing CPU computations), to skip scheduling of further array accesses done by the CPU
* Implicitly, the API tracked if a certain array was up-to-date on the CPU or on the GPU (for a 1 GPU system).
* The new API that tracks locations of arrays completely covers the old API, making it redundant. If an array is up-to-date on the CPU, we can perform read/write without any ComputationalElement scheduling.
* Checking if an array is up-to-date on the CPU requires a hashset lookup. It might be optimized if necessary, using a tracking flag.

# 2021-12-06

* Fixed major bug that prevented CPU reads on read-only arrays in-use by the GPU. The problem appeared only on devices since Pascal.
* Started integrating API to track on which devices a certain array is currently up-to-date. Slightly modified from the original multi-GPU API.

# 2021-12-05

* Updated options in GrCUDA to support new multi-gpu flags.
* Improved initialization of ExecutionContext, now it takes GrCUDAOptionMap as parameter.
* Improved GrCUDAOptionMap testing, and integrated preliminary multi-GPU tests.
* Renamed GrCUDAExecutionContext to AsyncGrCUDAExecutionContext.
* Integrated multi-GPU features into CUDARuntime
* Improved interface to measure execution time of computationalelements (now the role of "ProfilableElement" is clearer, and execution time logging has been moved inside ComputationElement instead of using StreamManager)
* Improved manual selection of GPU
* Unsupported tests (e.g. tests for multiGPU if just 1 GPU is available) are properly skipped, instead of failing or completing successfully without info
temporary fix for GRCUDA-56: cuBLAS is disabled on pre-pascal if async scheduler is selected

# 2021-11-30

* Updated python benchmark suite to integrate multi-gpu code.
* Minor updates in naming conventions (e.g. using snake_case instead of CamelCase)
* We might still want to update the python suite (for example the output dict structure), but for now this should work.

# 2021-11-29

* Removed deprecation warning for Truffle's ArityException.
* Updated benchmark suite with CUDAs multiGPU benchmarks. Also fixed GPU OOB in B9.

# 2021-11-21

* Enabled support for cuSPARSE
* Added support for CSR and COO `spmv` and `gemvi`.
* **Known limitation:** Tgemvi works only with single-precision floating-point arithmetics.

# 2021-11-17

* Added the support of precise timing of kernels, for debugging and complex scheduling policies
* Associated a CUDA event to the start of the computation in order to get the elapsed time from start to the end
* Added` ElapsedTime` function to compute the elapsed time between events, aka the total execution time
* Logging of kernel timers is controlled by the `grcuda.TimeComputation` option, which is false by default
* Implemented with the ProfilableElement class to store timing values in a hash table and support future business logic
* Updated documentation for the use of the new `TimeComputation` option in README
* Considerations:
* `ProfilableElement` is profilable (`true`) by default, and any `ConfiguredKernel` is initialized with this configuration. To date, there isn't any use for a `ProfilableElement` that is not profilable (`false`)
* To date, we are tracking only the last execution of a `ConfiguredKernel` on each device. It will be useful in the future to track all the executions and leverage this information in our scheduler

# 2021-11-15

* Added read-only polyglot map to retrieve grcuda options. Retrieve it with `getoptions`. Option names and values are provided as strings. Find the full list of options in `GrCUDAOptions`.

# 2021-11-04

* Enabled the usage of TruffleLoggers for logging the execution of grcuda code
* GrCUDA is characterized by the presence of several different types of loggers, each one with its own functionality
* Implemented GrCUDALogger class is in order to have access to loggers of interest when specific features are needed
* Changed all the print in the source code in log events, with different logging levels
* Added documentation about logging in docs

# 2021-10-13

* Enabled support for cuBLAS and cuML in the async scheduler
* Streams' management is now supported both for CUML and CUBLAS
* This feature can be possibly applied to any library, by extending the `LibrarySetStreamFunction` class
* Set TensorRT support to experimental
* TensorRT is currently not supported on CUDA 11.4, making it impossible to use along a recent version of cuML
* **Known limitation:** due to this incompatibility, TensorRT is currently not available on the async scheduler

# 2021-09-30, Release 1

## API Changes

* Added option to specify arguments in NFI kernel signatures as `const`
* The effect is the same as marking them as `in` in the NIDL syntax
* It is not strictly required to have the corresponding arguments in the CUDA kernel marked as `const`, although
that's recommended
* Marking arguments as `const` or `in` enables the async scheduler to overlap kernels that use the same read-only
arguments

## New asynchronous scheduler

* Added a new asynchronous scheduler for GrCUDA, enable it with `--experimental-options --grcuda.ExecutionPolicy=async`
* With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes
immediately
* The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU
data are accessed by the host thread
* Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using
different streams
* Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams
* The scheduler supports different options, see `README.md` for the full list
* It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a
Polyglot GPU Runtime" (IPDPS 2021)

## New features

* Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray,
MultiDimDeviceArrayView, and provides high-level array interfaces
* Added API for prefetching
* If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before
executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance
* Added API for stream attachment
* Always enabled in GPUs with with architecture older than Pascal, and the async scheduler is active. With the sync
scheduler, it can be manually enabled
* It restricts the visibility of GPU data to the specified stream
* In architectures newer or equal than Pascal it can provide a small performance benefit
* Added `copyTo/copyFrom` functions on generic arrays (Truffle interoperable objects that expose the array API)
* Internally, the copy is implemented as a for loop, instead of using CUDA's `memcpy`
* It is still faster than copying using loops in the host languages, in many cases, and especially if host code is
not JIT-ted
* It is also used for copying data to/from DeviceArrays with column-major layout, as `memcpy` cannot copy
non-contiguous data

## Demos, benchmarks and code samples

* Added demo used at SeptembeRSE 2021 (`demos/image_pipeline_local` and `demos/image_pipeline_web`)
* It shows an image processing pipeline that applies a retro look to images. We have a local version and a web
version that displays results a in web page
* Added benchmark suite written in Graalpython, used in "DAG-based Scheduling with Resource Sharing for Multi-task
Applications in a Polyglot GPU Runtime" (IPDPS 2021)
* It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling.

## Miscellaneosus

* Added dependency to `grcuda-data` submodule, used to store data, results and plots used in publications and demos.
* Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it?
* Added support for Java 11 along with Java 8
* Added option to specify the location of cuBLAS and cuML with environment variables (`LIBCUBLAS_DIR` and `LIBCUML_DIR`)
* Refactored package hierarchy to reflect changes to current GrCUDA (e.g. `gpu -> runtime`)
* Added basic support for TruffleLogger
* Removed a number of existing deprecation warnings
* Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking
* Updated documentation
* Bumped GraalVM version to 21.2
* Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (
see `oci_setup/`)
* Added documentation to setup IntelliJ Idea for GrCUDA development
* Added documentation about Python benchmark suite
* Added documentation on asynchronous scheduler options
12 changes: 10 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2019, 2020, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2019, 2020, Oracle and/or its affiliates. All rights reserved.
Copyright (c) 2020, 2021, NECSTLab, Politecnico di Milano. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
Expand All @@ -11,6 +13,12 @@ are met:
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
* Neither the name of NECSTLab nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
* Neither the name of Politecnico di Milano nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
Expand All @@ -25,5 +33,5 @@ OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


grCUDA depends on Truffle APIs licensed under the Universal Permissive
GrCUDA depends on Truffle APIs licensed under the Universal Permissive
License (UPL), Version 1.0 (https://opensource.org/licenses/UPL).
Loading