Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When trying to reproduce the complete example, "NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet" is thrown #91

Open
ZSvedic opened this issue Nov 12, 2024 · 1 comment

Comments

@ZSvedic
Copy link

ZSvedic commented Nov 12, 2024

I tried to reproduce the complete example on a Hyperstack cloud machine (A100-80G-PCIe, OS Image Ubuntu Server 22.04 LTS, R535 CUDA 12.2). Since using a single A100, I reduced the batch size, this command starts the training:
python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28 gradient_accumulation_steps=2 batch_size=16 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16

Unfortunately, training fails when saving at the first checkpoint at 20000 examples with stack trace:

Error executing job with overrides: ['model=pythia28', 'datasets=[hh]', 'loss=sft', 'exp_name=anthropic_dpo_pythia28', 'gradient_accumulation_steps=2', 'batch_size=16', 'eval_batch_size=16', 'trainer=FSDPTrainer', 'sample_during_eval=false', 'model.fsdp_policy_mp=bfloat16']
Traceback (most recent call last):
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/train.py", line 111, in main
    mp.spawn(worker_main, nprocs=world_size, args=(world_size, config, policy, reference_model), join=True)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/train.py", line 44, in worker_main
    trainer.train()
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/trainers.py", line 352, in train
    self.save(output_dir, mean_eval_metrics)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/trainers.py", line 501, in save
    policy_state_dict = self.policy.state_dict()
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1722, in _save_to_state_dict
    hook(self, prefix, keep_vars)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 669, in _pre_state_dict_hook
    _pre_state_dict_hook_fn[fsdp_state._state_dict_type](
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 271, in _full_pre_state_dict_hook
    _common_unshard_pre_state_dict_hook(
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 143, in _common_unshard_pre_state_dict_hook
    _enter_unshard_params_ctx(
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 109, in _enter_unshard_params_ctx
    fsdp_state._unshard_params_ctx[module].__enter__()
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 171, in _unshard_fsdp_state_params
    _validate_unshard_params_args(
  File "/home/ubuntu/dpo-examples/direct-preference-optimization/.venv-20230622/lib/python3.10/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 140, in _validate_unshard_params_args
    raise NotImplementedError(
NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

It seems that either I installed the incompatible version of a library, or an incompatible library came with the preinstalled cloud image?
What is funny is that I tried to install only py dependencies available by 2023-06-22, which was the date of the last commit on requirements.txt, using pypi-timemachine, but it seems I still failed somewhere. Here are the versions on my cloud machine:

accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
backcall==0.2.0
beautifulsoup4==4.12.2
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.4
comm==0.1.3
datasets==2.12.0
debugpy==1.6.7
decorator==5.1.1
dill==0.3.6
docker-pycreds==0.4.0
executing==1.2.0
filelock==3.12.2
frozenlist==1.3.3
fsspec==2023.6.0
gitdb==4.0.10
GitPython==3.1.31
huggingface-hub==0.15.1
hydra-core==1.3.2
idna==3.4
ipykernel==6.23.1
ipython==8.14.0
jedi==0.18.2
Jinja2==3.1.2
jupyter_client==8.2.0
jupyter_core==5.3.1
lit==16.0.6
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
nest-asyncio==1.5.6
networkx==3.1
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
omegaconf==2.3.0
packaging==23.1
pandas==2.0.2
parso==0.8.3
pathtools==0.1.2
peft==0.3.0
pexpect==4.8.0
pickleshare==0.7.5
pip==22.0.2
platformdirs==3.7.0
prompt-toolkit==3.0.38
protobuf==4.23.3
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==12.0.1
Pygments==2.15.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.6.3
requests==2.31.0
responses==0.18.0
sentry-sdk==1.25.1
setproctitle==1.3.2
setuptools==59.6.0
six==1.16.0
smmap==5.0.0
soupsieve==2.4.1
stack-data==0.6.2
sympy==1.12
tensor-parallel==1.2.4
tokenizers==0.13.3
torch==2.0.1
tornado==6.3.2
tqdm==4.65.0
traitlets==5.9.0
transformers==4.29.2
triton==2.0.0
typing_extensions==4.6.3
tzdata==2023.3
urllib3==2.0.3
wandb==0.15.3
wcwidth==0.2.6
wheel==0.40.0
xxhash==3.2.0
yarl==1.9.2

Does any of you good souls know what is wrong and which library and version is causing a problem?

@ZSvedic
Copy link
Author

ZSvedic commented Nov 19, 2024

After some tinkering I will answer my own question:

There was no problem with libraries, above library versions are fine. The error ("NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet") only happens on when I used single A100 GPU. I switched to 4 x A100 cluster and the error disappeared, the SFT was working.

However, this brings me to another issue: SFT training is noticeably slower than expected. README.md says:

a machine with 4 80GB A100s; on this hardware, SFT takes about 1hr 30min
However, on my 4 x A100 PCIe cluster, 2h of training, SFT only got to example 23872, out of 160k examples.

Anyone knows why my SFT performance differs so much from the official time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant