Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting TPC Data compression decoding to GPU #12616

Merged
merged 16 commits into from
Mar 4, 2024

Conversation

cima22
Copy link
Contributor

@cima22 cima22 commented Jan 30, 2024

Average of 300 clusters over 53175874 not correctly decoded, maybe fast-math problem. Still need to incorporate GPU decoding in the offline chain.

@cima22 cima22 requested review from davidrohr, wiechula, shahor02 and a team as code owners January 30, 2024 10:09
davidrohr
davidrohr previously approved these changes Jan 30, 2024
Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to start the CI, please do not merge yet!
The code-formatting check fails. Could you please run git-clang-format, to fix the formatting. Please also squash all test commits, and leave only meaningfull commits in. Alternatively, we can squash the full PR when merging.

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI for f0249ef at 2024-01-30 11:34:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
"/sw/SOURCES/O2/12616-slc8_x86-64/0/GPU/GPUTracking/Base/GPUConstantMem.h", line 73: error: 
ninja: build stopped: subcommand failed.

Full log here.

@cima22 cima22 changed the title Porting TPC Data compression decoding to GPU -- not ready yet Porting TPC Data compression decoding to GPU Jan 30, 2024
@cima22 cima22 force-pushed the TPCGPUDecoding branch 2 times, most recently from c49f074 to 166b6b8 Compare January 30, 2024 22:07
Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good so far, though I didn't fully check the functionality. I have mostly a couple of cosmetic comments.

CMakeLists.txt Outdated Show resolved Hide resolved
GPU/CMakeLists.txt Outdated Show resolved Hide resolved
GPU/GPUTracking/CMakeLists.txt Outdated Show resolved Hide resolved
GPU/GPUTracking/DataCompression/GPUTPCDecompression.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Global/GPUChainTrackingCompression.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Global/GPUChainTrackingCompression.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Global/GPUChainTrackingCompression.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Global/GPUChainTrackingCompression.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Global/GPUChainTrackingCompression.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Standalone/Benchmark/standalone.cxx Outdated Show resolved Hide resolved
@cima22 cima22 force-pushed the TPCGPUDecoding branch 3 times, most recently from f920549 to ce91114 Compare February 20, 2024 15:15
@cima22 cima22 force-pushed the TPCGPUDecoding branch 2 times, most recently from 46aab55 to 60b937e Compare February 27, 2024 16:07
Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved to start the CI

davidrohr
davidrohr previously approved these changes Feb 28, 2024
Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's see what the CI says. I have only 2 cosmetic comments below, which can also be fixed afterwards.

GPU/GPUTracking/Standalone/Benchmark/standalone.cxx Outdated Show resolved Hide resolved
GPU/GPUTracking/Base/cuda/CMakeLists.txt Outdated Show resolved Hide resolved
@alibuild
Copy link
Collaborator

alibuild commented Feb 28, 2024

Error while checking build/O2/fullCI for 60b937e at 2024-02-29 08:55:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'


## sw/BUILD/O2-full-system-test-latest/log
Detected critical problem in logfile reco_ASYNC.log
reco_ASYNC.log:[56150:gpu-reconstruction]: [07:55:15][ERROR] Exception caught: cluster native output ptrs out of sync 
[56139:ctp-entropy-decoder]: [07:55:09][ERROR] LM:375 L0:0 L1:2 TwI:9 Trigger classes wo input:336
[56150:gpu-reconstruction]: [07:55:15][ERROR] Exception caught: cluster native output ptrs out of sync 
[ERROR] Workflow crashed - PID 56150 (gpu-reconstruction) did not exit correctly however it's not clear why. Exit code forced to 128.
[ERROR]  - Device gpu-reconstruction: pid 56150 (exit 128)
[ERROR] SEVERE: Device gpu-reconstruction (56150) returned with 128


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/f7f0975f063caa4cd3c7259ea5a388a3baf92584/slc8_x86-64/o2checkcode/1.0-local1965/etc/modulefiles
++ cat
--

Full log here.

@davidrohr
Copy link
Collaborator

@ktf : It seems framework core ctest is failing regularly in the CI today, this seems to be a new problem?

@ktf
Copy link
Member

ktf commented Feb 28, 2024

I am checking.

@ktf
Copy link
Member

ktf commented Feb 29, 2024

Should be fixed by #12781 .

@alibuild
Copy link
Collaborator

alibuild commented Mar 1, 2024

Error while checking build/O2/fullCI for d3ce442 at 2024-03-01 18:19:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'


## sw/BUILD/O2-full-system-test-latest/log
Detected critical problem in logfile reco_ASYNC.log
reco_ASYNC.log:[57300:gpu-reconstruction]: [17:18:59][ERROR] Exception caught: cluster native output ptrs out of sync 
[57289:ctp-entropy-decoder]: [17:18:45][ERROR] LM:274 L0:0 L1:1 TwI:5 Trigger classes wo input:226
[57300:gpu-reconstruction]: [17:18:59][ERROR] Exception caught: cluster native output ptrs out of sync 
[ERROR] Workflow crashed - PID 57300 (gpu-reconstruction) did not exit correctly however it's not clear why. Exit code forced to 128.


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/a676fe5b5d28da865a191682e551427da3fe0d0f/slc8_x86-64/o2checkcode/1.0-local1343/etc/modulefiles
++ cat
--

Full log here.

@alibuild
Copy link
Collaborator

alibuild commented Mar 1, 2024

Error while checking build/O2/fullCI for 7ee82f7 at 2024-03-02 00:54:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'


## sw/BUILD/O2-full-system-test-latest/log
task timeout reached .. killing all processes


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/048753e9843ae3df9b1f1af0116c6bbe04e1e974/slc8_x86-64/o2checkcode/1.0-local150/etc/modulefiles
++ cat
--

Full log here.

@alibuild
Copy link
Collaborator

alibuild commented Mar 3, 2024

Error while checking build/O2/fullCI for 579d623 at 2024-03-03 20:55:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'


## sw/BUILD/O2-full-system-test-latest/log
task timeout reached .. killing all processes


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/9ef1b1e20b4211ceeba526cedaf5ed6530849596/slc8_x86-64/o2checkcode/1.0-local162/etc/modulefiles
++ cat
--

Full log here.

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be merged now if the FullCI becomes green

@davidrohr
Copy link
Collaborator

Full CI again failed randomly. Tried it locally and it worked, and also works as part of #12799, so should be fine, merging.

@davidrohr davidrohr merged commit b4a04fd into AliceO2Group:dev Mar 4, 2024
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants