Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the GPU usage efficiency during training #75

Open
WJiangH opened this issue Aug 28, 2024 · 3 comments
Open

Improving the GPU usage efficiency during training #75

WJiangH opened this issue Aug 28, 2024 · 3 comments

Comments

@WJiangH
Copy link

WJiangH commented Aug 28, 2024

Dear authors,

When I use Tesla 4 GPU to train a model and found that the node utilization is quite low.

for example:

Job is running on nodes: inf018

Node utilization is:
    node  cores   load    pct      mem     used    pct
  inf018     32    1.6    5.1  187.5GB   43.2GB   23.1

only 5.1% core is being used. While the time/eval ~ 70 mcs/at for 640 atomic functions is acceptable, the usage efficiency I think can be improved.

Here is more details about the node:

NodeName=inf018 Arch=x86_64 CoresPerSocket=16 
   CPUAlloc=28 CPUEfctv=32 CPUTot=32 CPULoad=1.62
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:turing:1
   NodeAddr=inf018 NodeHostName=inf018 Version=23.02.7
   OS=Linux 3.10.0-1160.90.1.el7.x86_64 #1 SMP Thu May 4 15:21:22 UTC 2023 
   RealMemory=191960 AllocMem=159488 FreeMem=147676 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=t4_dev_q 
   BootTime=2024-05-20T09:17:14 SlurmdStartTime=2024-05-20T09:24:21
   LastBusyTime=2024-08-28T09:03:28 ResumeAfterTime=None
   CfgTRES=cpu=32,mem=191960M,billing=32,gres/gpu=1,gres/gpu:turing=1
   AllocTRES=cpu=28,mem=159488M,gres/gpu=1,gres/gpu:turing=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Since BFGS is fully based on CPU, I expect the improvement on CPU usage can shorten the training time. Also, will the training time be shortened by increasing the # of GPUs?

Best,
JJ

@yury-lysogorskiy
Copy link
Member

there are two stages

  1. GPU stage with gradients accumulation
  2. CPU stage with BFGS update
    depending on which moment of time you check your usage you see either low CPU or low GPU usage. use nvitop to see both simulatenously and history

pacemaker does use only signle GPU
if you want speed up

  1. try to increase batch size as max as possible (to fully utilize GPU),
  2. maybe switch from BFGS to L-BFGS-B optimization algo. But it is less efficient, i.e. requires more steps, so overall walltime can be the same

@WJiangH
Copy link
Author

WJiangH commented Aug 28, 2024

there are two stages

  1. GPU stage with gradients accumulation
  2. CPU stage with BFGS update
    depending on which moment of time you check your usage you see either low CPU or low GPU usage. use nvitop to see both simulatenously and history

pacemaker does use only signle GPU if you want speed up

  1. try to increase batch size as max as possible (to fully utilize GPU),
  2. maybe switch from BFGS to L-BFGS-B optimization algo. But it is less efficient, i.e. requires more steps, so overall walltime can be the same

I output the cpu and gpu utilization every 5 mins, gpus actually is fully used.

2024/08/28 10:16:53.515, Tesla T4, 0 %, 0 %, 15360 MiB, 15101 MiB, 0 MiB
2024/08/28 10:21:53.516, Tesla T4, 100 %, 64 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:26:53.518, Tesla T4, 96 %, 61 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:31:53.519, Tesla T4, 68 %, 23 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:36:53.520, Tesla T4, 94 %, 72 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:41:53.521, Tesla T4, 93 %, 54 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:46:53.522, Tesla T4, 79 %, 32 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:51:53.522, Tesla T4, 64 %, 43 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 10:56:53.524, Tesla T4, 75 %, 32 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:01:53.525, Tesla T4, 91 %, 67 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:06:53.526, Tesla T4, 77 %, 53 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:11:53.527, Tesla T4, 79 %, 39 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:16:53.528, Tesla T4, 83 %, 61 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:21:53.529, Tesla T4, 96 %, 74 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:26:53.529, Tesla T4, 77 %, 30 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:31:53.530, Tesla T4, 92 %, 55 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:36:53.531, Tesla T4, 83 %, 33 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:41:53.531, Tesla T4, 70 %, 32 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:46:53.533, Tesla T4, 74 %, 34 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:51:53.534, Tesla T4, 80 %, 34 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 11:56:53.535, Tesla T4, 79 %, 55 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 12:01:53.536, Tesla T4, 73 %, 30 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 12:06:53.536, Tesla T4, 77 %, 50 %, 15360 MiB, 851 MiB, 14251 MiB
2024/08/28 12:11:53.537, Tesla T4, 95 %, 56 %, 15360 MiB, 851 MiB, 14251 MiB

but cpu is under-utilized, for example

Linux 3.10.0-1160.90.1.el7.x86_64 (inf018)      08/28/2024      _x86_64_        (32 CPU)
12:12:16 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:12:17 PM  all    6.67    0.00    1.19    0.00    0.00    0.00    0.00    0.00    0.00   92.14
12:12:17 PM    0    6.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    1    4.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM    2    5.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   94.00
12:12:17 PM    3   10.10    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   88.89
12:12:17 PM    4    7.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   92.00
12:12:17 PM    5    8.82    0.00    1.96    0.00    0.00    0.00    0.00    0.00    0.00   89.22
12:12:17 PM    6    6.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    7    6.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    8    5.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00    0.00   93.00
12:12:17 PM    9    5.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   10    2.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   97.00
12:12:17 PM   11    4.04    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   94.95
12:12:17 PM   12    4.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   96.00
12:12:17 PM   13    5.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   14    5.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   94.00
12:12:17 PM   15    4.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   16   20.41    0.00    5.10    0.00    0.00    0.00    0.00    0.00    0.00   74.49
12:12:17 PM   17    4.95    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   94.06
12:12:17 PM   18    4.04    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.96
12:12:17 PM   19   11.22    0.00    2.04    0.00    0.00    0.00    0.00    0.00    0.00   86.73
12:12:17 PM   20    8.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   91.00
12:12:17 PM   21    3.96    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   95.05
12:12:17 PM   22    8.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   91.00
12:12:17 PM   23    5.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.00
12:12:17 PM   24    4.08    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.92
12:12:17 PM   25    5.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   94.06
12:12:17 PM   26   15.00    0.00    6.00    0.00    0.00    0.00    0.00    0.00    0.00   79.00
12:12:17 PM   27    6.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   94.00
12:12:17 PM   28   14.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00    0.00   83.00
12:12:17 PM   29    8.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   91.00
12:12:17 PM   30    4.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   96.00
12:12:17 PM   31    4.04    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   95.96

I later can probably try to use L-BFGS-B. Just was surprised why the cpu usage is so low.

JJ

@yury-lysogorskiy
Copy link
Member

yury-lysogorskiy commented Aug 28, 2024

are u sure that 5min is good enough sample rate to see in-epoch CPU/GPU utilization?? easiest way - ssh to computational node and use interactive tools like nvitop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants