Improving the GPU usage efficiency during training #75

WJiangH · 2024-08-28T13:27:19Z

Dear authors,

When I use Tesla 4 GPU to train a model and found that the node utilization is quite low.

for example:

Job is running on nodes: inf018

Node utilization is:
    node  cores   load    pct      mem     used    pct
  inf018     32    1.6    5.1  187.5GB   43.2GB   23.1

only 5.1% core is being used. While the time/eval ~ 70 mcs/at for 640 atomic functions is acceptable, the usage efficiency I think can be improved.

Here is more details about the node:

NodeName=inf018 Arch=x86_64 CoresPerSocket=16 
   CPUAlloc=28 CPUEfctv=32 CPUTot=32 CPULoad=1.62
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:turing:1
   NodeAddr=inf018 NodeHostName=inf018 Version=23.02.7
   OS=Linux 3.10.0-1160.90.1.el7.x86_64 #1 SMP Thu May 4 15:21:22 UTC 2023 
   RealMemory=191960 AllocMem=159488 FreeMem=147676 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=t4_dev_q 
   BootTime=2024-05-20T09:17:14 SlurmdStartTime=2024-05-20T09:24:21
   LastBusyTime=2024-08-28T09:03:28 ResumeAfterTime=None
   CfgTRES=cpu=32,mem=191960M,billing=32,gres/gpu=1,gres/gpu:turing=1
   AllocTRES=cpu=28,mem=159488M,gres/gpu=1,gres/gpu:turing=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Since BFGS is fully based on CPU, I expect the improvement on CPU usage can shorten the training time. Also, will the training time be shortened by increasing the # of GPUs?

Best,
JJ

The text was updated successfully, but these errors were encountered:

yury-lysogorskiy · 2024-08-28T13:36:54Z

there are two stages

GPU stage with gradients accumulation
CPU stage with BFGS update
depending on which moment of time you check your usage you see either low CPU or low GPU usage. use nvitop to see both simulatenously and history

pacemaker does use only signle GPU
if you want speed up

try to increase batch size as max as possible (to fully utilize GPU),
maybe switch from BFGS to L-BFGS-B optimization algo. But it is less efficient, i.e. requires more steps, so overall walltime can be the same

WJiangH · 2024-08-28T18:53:33Z

there are two stages

GPU stage with gradients accumulation

CPU stage with BFGS update
depending on which moment of time you check your usage you see either low CPU or low GPU usage. use nvitop to see both simulatenously and history

pacemaker does use only signle GPU if you want speed up

try to increase batch size as max as possible (to fully utilize GPU),

maybe switch from BFGS to L-BFGS-B optimization algo. But it is less efficient, i.e. requires more steps, so overall walltime can be the same

I output the cpu and gpu utilization every 5 mins, gpus actually is fully used.

2024/08/28 10:16:53.515, Tesla T4, 0 %, 0 %, 15360 MiB, 15101 MiB, 0 MiB
2024/08/28 10:21:53.516, Tesla T4, 100 %, 64 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:26:53.518, Tesla T4, 96 %, 61 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:31:53.519, Tesla T4, 68 %, 23 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:36:53.520, Tesla T4, 94 %, 72 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:41:53.521, Tesla T4, 93 %, 54 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:46:53.522, Tesla T4, 79 %, 32 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:51:53.522, Tesla T4, 64 %, 43 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:56:53.524, Tesla T4, 75 %, 32 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:01:53.525, Tesla T4, 91 %, 67 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:06:53.526, Tesla T4, 77 %, 53 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:11:53.527, Tesla T4, 79 %, 39 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:16:53.528, Tesla T4, 83 %, 61 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:21:53.529, Tesla T4, 96 %, 74 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:26:53.529, Tesla T4, 77 %, 30 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:31:53.530, Tesla T4, 92 %, 55 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:36:53.531, Tesla T4, 83 %, 33 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:41:53.531, Tesla T4, 70 %, 32 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:46:53.533, Tesla T4, 74 %, 34 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:51:53.534, Tesla T4, 80 %, 34 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:56:53.535, Tesla T4, 79 %, 55 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 12:01:53.536, Tesla T4, 73 %, 30 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 12:06:53.536, Tesla T4, 77 %, 50 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 12:11:53.537, Tesla T4, 95 %, 56 %, 15360 MiB, 851 MiB, 14251 MiB

but cpu is under-utilized, for example

Linux 3.10.0-1160.90.1.el7.x86_64 (inf018)      08/28/2024      _x86_64_        (32 CPU)
12:12:16 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:12:17 PM  all    6.67    0.00    1.19    0.00    0.00    0.00    0.00    0.00    0.00   92.14
12:12:17 PM    0    6.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    1    4.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM    2    5.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   94.00
12:12:17 PM    3   10.10    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   88.89
12:12:17 PM    4    7.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   92.00
12:12:17 PM    5    8.82    0.00    1.96    0.00    0.00    0.00    0.00    0.00    0.00   89.22
12:12:17 PM    6    6.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    7    6.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    8    5.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    9    5.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   10    2.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   97.00
12:12:17 PM   11    4.04    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   94.95
12:12:17 PM   12    4.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   96.00
12:12:17 PM   13    5.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   14    5.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   94.00
12:12:17 PM   15    4.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   16   20.41    0.00    5.10    0.00    0.00    0.00    0.00    0.00    0.00   74.49
12:12:17 PM   17    4.95    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   94.06
12:12:17 PM   18    4.04    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.96
12:12:17 PM   19   11.22    0.00    2.04    0.00    0.00    0.00    0.00    0.00    0.00   86.73
12:12:17 PM   20    8.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   91.00
12:12:17 PM   21    3.96    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   95.05
12:12:17 PM   22    8.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   91.00
12:12:17 PM   23    5.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   24    4.08    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.92
12:12:17 PM   25    5.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   94.06
12:12:17 PM   26   15.00    0.00    6.00    0.00    0.00    0.00    0.00    0.00    0.00   79.00
12:12:17 PM   27    6.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   94.00
12:12:17 PM   28   14.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00    0.00   83.00
12:12:17 PM   29    8.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   91.00
12:12:17 PM   30    4.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   96.00
12:12:17 PM   31    4.04    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.96

I later can probably try to use L-BFGS-B. Just was surprised why the cpu usage is so low.

JJ

yury-lysogorskiy · 2024-08-28T20:47:51Z

are u sure that 5min is good enough sample rate to see in-epoch CPU/GPU utilization?? easiest way - ssh to computational node and use interactive tools like nvitop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the GPU usage efficiency during training #75

Improving the GPU usage efficiency during training #75

WJiangH commented Aug 28, 2024

yury-lysogorskiy commented Aug 28, 2024

WJiangH commented Aug 28, 2024

yury-lysogorskiy commented Aug 28, 2024 •

edited

Loading

Improving the GPU usage efficiency during training #75

Improving the GPU usage efficiency during training #75

Comments

WJiangH commented Aug 28, 2024

yury-lysogorskiy commented Aug 28, 2024

WJiangH commented Aug 28, 2024

yury-lysogorskiy commented Aug 28, 2024 • edited Loading

yury-lysogorskiy commented Aug 28, 2024 •

edited

Loading