Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trt-modelopt is not compatible with pytorch FSDP? #139

Open
Vieeo opened this issue Feb 21, 2025 · 3 comments
Open

trt-modelopt is not compatible with pytorch FSDP? #139

Vieeo opened this issue Feb 21, 2025 · 3 comments

Comments

@Vieeo
Copy link

Vieeo commented Feb 21, 2025

How to solve it ?

@kevalmorabia97
Copy link
Collaborator

Can you please share more detailed on what your use case is and what errors if any you observe?

@Vieeo
Copy link
Author

Vieeo commented Feb 25, 2025

Can you please share more detailed on what your use case is and what errors if any you observe?

basic version info:
python 3.12.0
pytorch 2.5.0
nvidia-modelopt 0.21.0
cuda: 12.6

I'm training flux model, it seems that forward is ok, backward is not ok for sparsity.

This is FSDP config with accelerator.

distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: SIZE_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: true
fsdp_min_num_params: 1000000
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true

[rank1]: Traceback (most recent call last):
[rank1]: File "/data/train_flux.py", line 438, in
[rank1]: main()
[rank1]: File "/data/train_flux.py", line 374, in main
[rank1]: accelerator.backward(loss)
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/accelerate/accelerator.py", line 2196, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/_tensor.py", line 581, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/autograd/init.py", line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 734, in _post_backward_hook
[rank1]: handle._use_unsharded_grad_views()
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 1982, in _use_unsharded_grad_views
[rank1]: hasattr(module, param_name),
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/opt/dynamic.py", line 806, in getattr
[rank1]: return manager.get_da_cb(name)(self, value)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/opt/dynamic.py", line 83, in call
[rank1]: val = cb(self_module, val)
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/sparsity/module.py", line 34, in _get_weight
[rank1]: masked_weight = weight * mod._weight_mask
[rank1]: ~~~~~~~^~~~~~~~~~~~~~~~~~
[rank1]: RuntimeError: The size of tensor a (2360064) must match the size of tensor b (3072) at non-singleton dimension 1

@Vieeo
Copy link
Author

Vieeo commented Feb 26, 2025

when I do it:
flux = mto.restore(flux, sparse_ckpt)
flux = accelerator.prepare_model(flux)
print(flux)

Errors as follows:

[rank1]: Traceback (most recent call last):
[rank1]: File "/data/train_flux.py", line 447, in
[rank1]: main()
[rank1]: File "/data/train_flux.py", line 165, in main
[rank1]: print(dit)
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2943, in repr
[rank1]: mod_str = repr(module)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2943, in repr
[rank1]: mod_str = repr(module)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2937, in repr
[rank1]: extra_repr = self.extra_repr()
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/opt/dynamic.py", line 861, in extra_repr
[rank1]: val = getattr(self, name)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/opt/dynamic.py", line 806, in getattr
[rank1]: return manager.get_da_cb(name)(self, value)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/opt/dynamic.py", line 83, in call
[rank1]: val = cb(self_module, val)
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/modelopt/torch/sparsity/module.py", line 35, in _get_weight
[rank1]: masked_weight = weight * mod._weight_mask
[rank1]: ~~~~~~~^~~~~~~~~~~~~~~~~~
[rank1]: RuntimeError: The size of tensor a (0) must match the size of tensor b (64) at non-singleton dimension 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants