Skip to content

Commit

Permalink
GPU docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Andrew-McNab-UK committed May 24, 2024
1 parent a4e7b16 commit f88dca3
Show file tree
Hide file tree
Showing 9 changed files with 79 additions and 8 deletions.
1 change: 1 addition & 0 deletions agents/justin-wrapper-job
Original file line number Diff line number Diff line change
Expand Up @@ -653,6 +653,7 @@ with open('home/justin-jobscript-env.sh','w') as f:
% (jobscriptDict['jobsub_id'], jobscriptDict['jobscript_secret']))

if getJobscriptDict['gpu_uuid']:
f.write('export CUDA_VISIBLE_DEVICES=%s\n' % getJobscriptDict['gpu_uuid'])
f.write('export LD_LIBRARY_PATH=/.singularity.d/libs\n')

# JSON for justin-get-file command to use
Expand Down
5 changes: 2 additions & 3 deletions dashboard/justin-wsgi-dashboard
Original file line number Diff line number Diff line change
Expand Up @@ -2448,7 +2448,7 @@ def showJob(environ, user, cgiValues):
'<td colspan="2">%s</td></tr>'
% spaceForUnixEpoch(jobRow["heartbeat_time"]))

output += ('<tr><td rowspan="7">From worker node</td><td>Hostname</td>'
output += ('<tr><td rowspan="8">From worker node</td><td>Hostname</td>'
'<td>%s</td></tr>'
% html.escape(jobRow["hostname"], quote=True))

Expand All @@ -2464,8 +2464,7 @@ def showJob(environ, user, cgiValues):
'<td>%s</td></tr>'
% jobRow["processors"])

if jobRow['gpu_info']:
output += ('<tr><td>GPU</td>'
output += ('<tr><td>GPU</td>'
'<td>%s</td></tr>'
% jobRow["gpu_info"])

Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ design described in the DUNE Offline Computing Conceptual Design Report.
- [Jobscripts](jobscripts.md)
- [Interactive testing](jobscripts.interactive_tests.md)
- [Rapid Code Distribution Service](jobscripts.rcds.md)
- [Support for GPU jobscripts](jobscripts.gpu.md)

## System Components

Expand Down
24 changes: 24 additions & 0 deletions docs/jobscripts.gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Support for GPU jobscripts

If your application requires an NVIDIA GPU, then you can request one by
giving the option `--gpu` to the `justin create-stage` or
`justin simple-workflow` commands as described in the
[justin man page](justin_command.man_page.md).

**Currently there are a limited number of sites offering GPUs to DUNE and you
may need to wait significantly longer (hours?) than usual for jobs in the workflow
to start running.**

The CUDA libraries, drivers, /dev/nvidiaX devices, and tools like `nvidia-smi` are
made available to your jobscript in the usual way. `$CUDA_VISIBLE_DEVICES` is
set to the UUID of the GPU allocated to your job by the site, in the newer form
`GPU-uuid` *not* as 0, 1, 2 etc. Please do not try to use any other GPUs you
might be able to access: CUDA should respect `$CUDA_VISIBLE_DEVICES` as
given and do what the site wants.

Once the job starts, it reports to justIN information about the GPU it has
discovered, including the GPU model name, the driver version, the compute
capability, the VBIOS version, and the nonreserved memory in MiB.
This information is shown on the job's own page in the dashboard.


File renamed without changes.
8 changes: 5 additions & 3 deletions docs/justin_command.man_page.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ This man page is distributed along with the

create-stage --workflow-id ID --stage-id ID --jobscript
FILENAME|--jobscript-git ORG/PATH:TAG [--wall-seconds N]
[--rss-mib N] [--processors N] [--max-distance DIST]
[--rss-mib N] [--processors N] [--gpu] [--max-distance DIST]
[--output-pattern PATTERN[:DESTINATION]]
[--output-pattern-next-stage PATTERN[:DATASET]] [--output-rse
NAME] [--lifetime-days DAYS] [--env NAME=VALUE] [--classad
Expand Down Expand Up @@ -136,7 +136,9 @@ This man page is distributed along with the
$JUSTIN_RSS_MIB. If the script can make use of multiple
processors then --processors can be used to give the number
needed, with a default of 1 if not given. The value used is
available to jobscripts as $JUSTIN_PROCESSORS.
available to jobscripts as $JUSTIN_PROCESSORS. If given then
--gpu will require that jobs for this stage have access to a
GPU.

By default, input files will only be allocated to a script which
are on storages at the same site (distance=0). This can be
Expand Down Expand Up @@ -207,7 +209,7 @@ This man page is distributed along with the
[--scope SCOPE] [--refind-end-date YYYYMMDD]
[--refind-interval-hours HOURS] --jobscript
FILENAME|--jobscript-git ORG/PATH:TAG [--wall-seconds N]
[--rss-mib N] [--processors N] [--max-distance DIST]
[--rss-mib N] [--processors N] [--gpu] --max-distance DIST]
[--output-pattern PATTERN[:DESTINATION]] [--output-rse NAME]
[--lifetime-days DAYS] [--env NAME=VALUE] [--classad NAME=VALUE]
Combines the create-workflow, create-stage and submit-workflow
Expand Down
2 changes: 1 addition & 1 deletion docs/make-justin-man-pages
Original file line number Diff line number Diff line change
Expand Up @@ -66,5 +66,5 @@ justin-webdav-upload command itself.
EOF
mandoc -T ascii ../commands/justin-webdav-upload.1 | col -b | sed 's/.*/ &/'

) > jobscripts.webdav-upload.man_page.md
) > justin-webdav-upload.man_page.md

2 changes: 1 addition & 1 deletion services/justin-wsgi-allocator
Original file line number Diff line number Diff line change
Expand Up @@ -720,7 +720,7 @@ def getJobscriptMethod(startResponse, jsonDict, environ):
'processors=%d,'
'wall_seconds=%d,'
'has_inner_apptainer=%s,'
'gpu_info="" '
'gpu_info="%s" '
'WHERE jobsub_id="%s" AND '
'workflow_id=%d AND stage_id=%d AND '
'job_state="submitted"' %
Expand Down
44 changes: 44 additions & 0 deletions testing/hello-gpu.jobscript
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash
: <<'EOF'
GPU Hello World jobscript for justIN
Submit a workflow like this to run 10 jobs on workers with GPUs:
justin simple-workflow --monte-carlo 10 --jobscript hello-world.jobscript --gpu
Or like this to run jobs and put the output file into Rucio-managed storage:
justin simple-workflow \
--monte-carlo 10 \
--jobscript hello-world.jobscript \
--gpu \
--description 'Hello GPU!!!' \
--scope usertests \
--output-pattern 'hello-world-*.txt:output-test-01'
EOF

# Check the GPU environment
printenv | grep -i cuda
nvidia-smi

# Try to get an unprocessed file from this stage
did_pfn_rse=`$JUSTIN_PATH/justin-get-file`

if [ "$did_pfn_rse" != "" ] ; then
did=`echo $did_pfn_rse | cut -f1 -d' '`
pfn=`echo $did_pfn_rse | cut -f2 -d' '`
rse=`echo $did_pfn_rse | cut -f3 -d' '`

# Hello world to a txt file
echo "Hello world $pfn" >hello-world-`date +%s.%N.txt`

# Hello world to the jobscript log
echo "Hello world $pfn"
if [ $? = 0 ] ; then
# If echo returns 0, then say we processed the file successfully
echo "$pfn" > justin-processed-pfns.txt
fi
fi
exit 0

0 comments on commit f88dca3

Please sign in to comment.