You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using path "/data/models/mlc/dist/models/VILA-2.7b" for model "VILA-2.7b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.00<?, ?tensors/s]
Get old param: 1%|▏ | 2/197 [00:02<03:51, 1.19s/tensors]Traceback (most recent call last): | 1/327 [00:02<15:24, 2.84s/tensors]
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/NanoLLM/nano_llm/chat/main.py", line 32, in
model = NanoLLM.from_pretrained(
File "/opt/NanoLLM/nano_llm/nano_llm.py", line 71, in from_pretrained
model = MLCModel(model_path, **kwargs)
File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File "/opt/NanoLLM/nano_llm/models/mlc.py", line 274, in quantize
subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx256 --use-safetensors ' died with <Signals.SIGKILL: 9>.
The text was updated successfully, but these errors were encountered:
I was following the tutorial on NanoVLM, https://www.jetson-ai-lab.com/tutorial_nano-vlm.html, but I couldn't manage to solve it. I am using Jetson Orin Nano 8GB and using Efficient-Large-Model/VILA-2.7b.
Input:
$ jetson-containers run $(autotag nano_llm) python3 -m nano_llm.chat --api=mlc --model Efficient-Large-Model/VILA-2.7b --max-context-len 256 --max-new-tokens 32
Output:
Namespace(packages=['nano_llm'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.3.0 JETPACK_VERSION=6.0 CUDA_VERSION=12.2
-- Finding compatible container image for ['nano_llm']
dustynv/nano_llm:r36.2.0
localuser:root being added to access control list
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using
TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. UseHF_HOME
instead.warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
Fetching 10 files: 100%|█████████████████████| 10/10 [00:00<00:00, 62601.55it/s]
Fetching 12 files: 100%|██████████████████████| 12/12 [00:00<00:00, 9336.24it/s]
13:40:23 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA-2.7b/snapshots/2ed82105eefd5926cccb46af9e71b0ca77f12704 with MLC
13:40:26 | INFO | running MLC quantization:
python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx256 --use-safetensors
Using path "/data/models/mlc/dist/models/VILA-2.7b" for model "VILA-2.7b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.00<?, ?tensors/s]
Get old param: 1%|▏ | 2/197 [00:02<03:51, 1.19s/tensors]Traceback (most recent call last): | 1/327 [00:02<15:24, 2.84s/tensors]
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/NanoLLM/nano_llm/chat/main.py", line 32, in
model = NanoLLM.from_pretrained(
File "/opt/NanoLLM/nano_llm/nano_llm.py", line 71, in from_pretrained
model = MLCModel(model_path, **kwargs)
File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in init
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File "/opt/NanoLLM/nano_llm/models/mlc.py", line 274, in quantize
subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx256 --use-safetensors ' died with <Signals.SIGKILL: 9>.
The text was updated successfully, but these errors were encountered: