Not able to run LLaVA-Next pretraining with NeMo 2.0 using container version nemo:24.12 #11741

bernardhan33 · 2025-01-03T00:22:56Z

Describe the bug

I would love to run LLaVA-Next pretraining with NeMo 2.0 following the documentation but failed with various errors with either nemo:24.12, nemo:24.09 or nemo:dev.

Steps/Code to reproduce bug

Pull the latest NeMo container version:

docker pull nvcr.io/nvidia/nemo:24.12

Start the docker container:

docker run --gpus all -it --rm --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.12

Within the docker container, create the Python code pretrain.py and fill with the sample code from the documentation:

from nemo.collections import vlm

finetune = vlm.llava_next_7b.pretrain_recipe(
    name="llava_next_7b_pretrain",
    dir=f"/NeMo/new-ckpts",
    num_nodes=1,
    num_gpus_per_node=8,
    language_model_from_pretrained='/NeMo/neva/checkpoints/llama-3-8b-instruct.nemo', # This is the directory where I transformed the Llama3-8b-Instruct checkpoint to .nemo format
    # Can be None or change based on local checkpoint path
)

import nemo_run as run

run.run(finetune, executor=run.LocalExecutor())

Run the code

python3 pretrain.py

Got error

TypeError: pretrain_recipe() got an unexpected keyword argument 'language_model_from_pretrained'

Confirmed from the code path /opt/NeMo/nemo/collections/vlm/recipes/llava_next_7b.py that the code does not support language_model_from_pretrained.
Removed the line that specified language_model_from_pretrained and tried again. Got error

AttributeError: 'MockDataModule' object has no attribute 'micro_batch_size'

Also tried container versions nemo:dev and nemo:24.09. Failed with errors.

AttributeError: module 'nemo.collections.vlm' has no attribute 'llava_next_7b'

Confirmed from code path that the recipes do not exist yet in those versions.

Expected behavior

I should be able to follow the public documentation to get the LLaVA-NEXT pretraining run just fine.

Environment overview (please complete the following information)

Environment location: GCP.
Method of NeMo install: Docker.
If method of install is [Docker], provide docker pull & docker run commands used: see above.

Environment details

N/A.

Additional context

N/A.

The text was updated successfully, but these errors were encountered:

bernardhan33 · 2025-01-03T01:23:09Z

At step 7 when we got the error

AttributeError: 'MockDataModule' object has no attribute 'micro_batch_size'

could this be a similar issue to this stackoverflow question, where some dependency imports are messed up?

yashaswikarnati · 2025-01-07T20:16:14Z

Hello, Sorry for the inconvenience. This particular PR 11424 was missed by our cherrypicking process into release branch. While we are actively working on fixing that, could you try with ToT main. Thank you!

yashaswikarnati · 2025-01-07T23:58:24Z

#11783

We would be releasing a new container with the fixes soon.

bernardhan33 · 2025-01-23T00:16:51Z

@yashaswikarnati Sorry for the late response. I have tried both options of "buildiung a nemo container with Dockerfile from source" and "reinstalling the source code from within the container" but both are giving various different errors.

Buildiung a nemo container with Dockerfile from source

# In the A3M node.
git clone [[email protected]](mailto:[email protected]):NVIDIA/NeMo.git

cd NeMo

DOCKER_BUILDKIT=1 docker build -f Dockerfile -t nemo:latest .

Got error

(base) bernardhan_google_com@bernardhan-a3:~/NeMo$ DOCKER_BUILDKIT=1 docker build -f Dockerfile -t nemo:latest .
[+] Building 0.0s (2/2) FINISHED                                                                                                                                                    
 => [internal] load build definition from Dockerfile                                                                                                                           0.0s
 => => transferring dockerfile: 2B                                                                                                                                             0.0s
 => CANCELED [internal] load .dockerignore                                                                                                                                     0.0s
 => => transferring context:                                                                                                                                                   0.0s
failed to solve with frontend dockerfile.v0: failed to read dockerfile: open /var/lib/docker/tmp/buildkit-mount3914441233/Dockerfile: no such file or directory

StackOverflow questions such as this have not been helpful.

Reinstalling the source code from within the container

Attempted to reinstall the NeMo dependency from the main branch within the prebuilt nemo:24.12 container.

docker run --gpus all -it --rm -v /home/bernardhan_google_com/nemo-multimodal:/NeMo --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 [nvcr.io/nvidia/nemo:24.12](http://nvcr.io/nvidia/nemo:24.12)

cd /opt
rm -rf NeMo

# Follow https://github.com/NVIDIA/NeMo?tab=readme-ov-file#build-from-source
git clone https://github.com/NVIDIA/NeMo
apt-get update && apt-get install -y libsndfile1 ffmpeg
./[reinstall.sh](http://reinstall.sh/)
# This prints "All Done" which indicates success

However, running the following code

from nemo.collections import vlm

finetune = vlm.llava_next_7b.pretrain_recipe(
    name="llava_next_7b_pretrain",
    dir=f"/NeMo/new-ckpts",
    num_nodes=1,
    num_gpus_per_node=8,
    language_model_from_pretrained='/NeMo/neva/checkpoints/llama-3-8b-instruct.nemo', # This is the directory where I transformed the Llama3-8b-Instruct checkpoint to .nemo format
    # Can be None or change based on local checkpoint path
)

import nemo_run as run

run.run(finetune, executor=run.LocalExecutor())

yields a different error

Traceback (most recent call last):
  File "/workspace/p.py", line 1, in <module>
    from nemo.collections import vlm
  File "/opt/NeMo/nemo/collections/vlm/__init__.py", line 16, in <module>
    from nemo.collections.vlm.hf.model.hf_auto_model_for_image_text_to_text import HFAutoModelForImageTextToText
  File "/opt/NeMo/nemo/collections/vlm/hf/model/hf_auto_model_for_image_text_to_text.py", line 18, in <module>
    from transformers import AutoConfig, AutoModelForImageTextToText, AutoProcessor
ImportError: cannot import name 'AutoModelForImageTextToText' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)

Could you advise further?

yashaswikarnati · 2025-01-23T16:28:28Z

Hi @bernardhan33,

The fix for the original issue you raised was pushed into this container - nvcr.io/nvidia/nemo:24.12.rc3

Re: Reinstalling from source - I think ToT needs a different version of transformers than what comes with the container. You could try doing - pip install transformers==4.48

Re: building from docker from source, are you trying to build from https://github.com/NVIDIA/NeMo/blob/main/Dockerfile.ci?

bernardhan33 added the bug Something isn't working label Jan 3, 2025

yashaswikarnati self-assigned this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to run LLaVA-Next pretraining with NeMo 2.0 using container version nemo:24.12 #11741

Not able to run LLaVA-Next pretraining with NeMo 2.0 using container version nemo:24.12 #11741

bernardhan33 commented Jan 3, 2025

bernardhan33 commented Jan 3, 2025 •

edited

Loading

yashaswikarnati commented Jan 7, 2025

yashaswikarnati commented Jan 7, 2025

bernardhan33 commented Jan 23, 2025

yashaswikarnati commented Jan 23, 2025

Not able to run LLaVA-Next pretraining with NeMo 2.0 using container version nemo:24.12 #11741

Not able to run LLaVA-Next pretraining with NeMo 2.0 using container version nemo:24.12 #11741

Comments

bernardhan33 commented Jan 3, 2025

bernardhan33 commented Jan 3, 2025 • edited Loading

yashaswikarnati commented Jan 7, 2025

yashaswikarnati commented Jan 7, 2025

bernardhan33 commented Jan 23, 2025

Buildiung a nemo container with Dockerfile from source

Reinstalling the source code from within the container

yashaswikarnati commented Jan 23, 2025

bernardhan33 commented Jan 3, 2025 •

edited

Loading