You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to reproduce the complete example on a Hyperstack cloud machine (A100-80G-PCIe, OS Image Ubuntu Server 22.04 LTS, R535 CUDA 12.2). Since using a single A100, I reduced the batch size, this command starts the training: python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28 gradient_accumulation_steps=2 batch_size=16 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16
Unfortunately, training fails when saving at the first checkpoint at 20000 examples with stack trace:
Error executing job with overrides: ['model=pythia28', 'datasets=[hh]', 'loss=sft', 'exp_name=anthropic_dpo_pythia28', 'gradient_accumulation_steps=2', 'batch_size=16', 'eval_batch_size=16', 'trainer=FSDPTrainer', 'sample_during_eval=false', 'model.fsdp_policy_mp=bfloat16']
Traceback (most recent call last):
File "/home/ubuntu/dpo-examples/direct-preference-optimization/train.py", line 111, in main
mp.spawn(worker_main, nprocs=world_size, args=(world_size, config, policy, reference_model), join=True)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/train.py", line 44, in worker_main
trainer.train()
File "/home/ubuntu/dpo-examples/direct-preference-optimization/trainers.py", line 352, in train
self.save(output_dir, mean_eval_metrics)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/trainers.py", line 501, in save
policy_state_dict = self.policy.state_dict()
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1722, in _save_to_state_dict
hook(self, prefix, keep_vars)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 669, in _pre_state_dict_hook
_pre_state_dict_hook_fn[fsdp_state._state_dict_type](
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 271, in _full_pre_state_dict_hook
_common_unshard_pre_state_dict_hook(
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 143, in _common_unshard_pre_state_dict_hook
_enter_unshard_params_ctx(
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 109, in _enter_unshard_params_ctx
fsdp_state._unshard_params_ctx[module].__enter__()
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 171, in _unshard_fsdp_state_params
_validate_unshard_params_args(
File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 140, in _validate_unshard_params_args
raise NotImplementedError(
NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
It seems that either I installed the incompatible version of a library, or an incompatible library came with the preinstalled cloud image?
What is funny is that I tried to install only py dependencies available by 2023-06-22, which was the date of the last commit on requirements.txt, using pypi-timemachine, but it seems I still failed somewhere. Here are the versions on my cloud machine:
After some tinkering I will answer my own question:
There was no problem with libraries, above library versions are fine. The error ("NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet") only happens on when I used single A100 GPU. I switched to 4 x A100 cluster and the error disappeared, the SFT was working.
However, this brings me to another issue: SFT training is noticeably slower than expected. README.md says:
a machine with 4 80GB A100s; on this hardware, SFT takes about 1hr 30min
However, on my 4 x A100 PCIe cluster, 2h of training, SFT only got to example 23872, out of 160k examples.
Anyone knows why my SFT performance differs so much from the official time?
I tried to reproduce the complete example on a Hyperstack cloud machine (A100-80G-PCIe, OS Image Ubuntu Server 22.04 LTS, R535 CUDA 12.2). Since using a single A100, I reduced the batch size, this command starts the training:
python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28 gradient_accumulation_steps=2 batch_size=16 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16
Unfortunately, training fails when saving at the first checkpoint at 20000 examples with stack trace:
It seems that either I installed the incompatible version of a library, or an incompatible library came with the preinstalled cloud image?
What is funny is that I tried to install only py dependencies available by 2023-06-22, which was the date of the last commit on requirements.txt, using pypi-timemachine, but it seems I still failed somewhere. Here are the versions on my cloud machine:
Does any of you good souls know what is wrong and which library and version is causing a problem?
The text was updated successfully, but these errors were encountered: