[Bug] 2nd_finetune for internvl2.5 using lora ('resource_tracker: There appear to be %d ') #845

lzk9508 opened this issue Jan 12, 2025 · 1 comment
lzk9508 commented Jan 12, 2025


  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

the error emerge in the step:
[2025-01-12 17:52:25,277] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2025-01-12 17:52:33,530] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2025-01-12 17:52:41,724] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2025-01-12 17:52:49,987] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2025-01-12 17:52:58,392] [INFO] [] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 2020


set -x


export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export MASTER_PORT=34229
export LAUNCHER=pytorch


if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"

number of gpus: 2

batch size per gpu: 4

gradient accumulation steps: 2

total batch size: 16

epoch: 1

--model_name_or_path "/245_disk/mb_train_sft_1121/InternVL2_5-8B-MPO"
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path "./shell/data/mb_train_sft_250110.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.0
--freeze_llm True
--freeze_mlp False
--freeze_backbone True
--use_llm_lora 8
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 1
--learning_rate 4e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 8192
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"


absl-py                   2.0.0
accelerate                0.33.0
adabench                  1.2.64
aiofiles                  23.2.1
aiohttp                   3.9.1
aiosignal                 1.3.1
aistudio-checkpoint       0.1.241220
aistudio-notebook         2.0.128
alipay-pcache             0.1.6
aliyun-python-sdk-core    2.14.0
aliyun-python-sdk-kms     2.16.2
altair                    5.2.0
annotated-types           0.6.0
ant-couler                0.0.1rc17
anyio                     4.2.0
apex                      0.1
archspec                  0.2.1
argo-workflows            3.5.1
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
astroid                   3.0.2
asttokens                 2.4.1
async-timeout             4.0.3
atorch                    1.2.1
attrs                     23.1.0
autopep8                  2.0.4
Babel                     2.15.0
backcall                  0.2.0
beautifulsoup4            4.12.2
bigmodelvis               0.0.1
bitarray                  2.8.5
bitsandbytes              0.42.0
bleach                    6.1.0
blinker                   1.7.0
boltons                   23.0.0
boto3                     1.34.2
botocore                  1.34.2
Brotli                    1.0.9
cachetools                3.1.1
cattrs                    23.2.3
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        2.0.4
cheroot                   10.0.0
click                     6.7
click-config-file         0.6.0
cloudpickle               3.0.0
colorama                  0.4.6
comm                      0.2.1
conda                     23.11.0
conda-content-trust       0.2.0
conda-libmamba-solver     23.12.0
conda-package-handling    2.2.0
conda_package_streaming   0.9.0
configobj                 5.0.8
configparser              6.0.0
contourpy                 1.1.1
couler-core               0.1.1rc11
crcmod                    1.7
cryptography              41.0.7
cycler                    0.12.1
Cython                    3.0.6
datasets                  2.15.0
debugpy                   1.8.0
decorator                 5.1.1
decord                    0.6.0
deepspeed                 0.15.4
defusedxml                0.7.1
delta-center-client       0.0.4
Deprecated                1.2.14
deprecation               2.1.0
dill                      0.3.7
distlib                   0.3.8
distro                    1.8.0
dlrover                   0.3.6
docker                    4.1.0
docstring-to-markdown     0.13
easydl-sdk                0.0.6
einops                    0.7.0
entrypoints               0.4
evaluate                  0.4.0
exceptiongroup            1.2.0
executing                 2.0.1
fairscale                 0.4.1
fastapi                   0.108.0
fastjsonschema            2.19.1
fastmoe                   1.0.0
fasttext                  0.9.2
fe                        0.3.33
ffmpy                     0.3.1
filelock                  3.13.1
flake8                    6.1.0
flash-attn                2.0.4
flash-attn-1              0.2.6.post2
Flask                     3.0.0
fonttools                 4.46.0
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2023.10.0
ftfy                      6.1.3
gitdb                     4.0.11
GitPython                 3.1.40
google-auth               2.25.2
google-auth-oauthlib      0.4.6
gradio                    4.13.0
gradio_client             0.8.0
grpcio                    1.34.1
grpcio-tools              1.34.1
h11                       0.14.0
hjson                     3.1.0
httpcore                  1.0.2
httpx                     0.26.0
huggingface-hub           0.27.0
icetk                     0.0.7
idna                      3.4
imageio                   2.35.1
importlib-metadata        7.0.0
importlib-resources       6.1.1
iniconfig                 2.0.0
ipykernel                 6.29.5
ipython                   8.12.3
ipython-genutils          0.2.0
isodate                   0.6.1
isoduration               20.11.0
isort                     5.13.2
itsdangerous              2.1.2
jaraco.functools          4.0.0
jedi                      0.19.1
jedi-language-server      0.41.2
Jinja2                    2.11.3
jinjasql                  0.1.8
jmespath                  0.10.0
joblib                    1.3.2
json5                     0.9.25
jsonpatch                 1.32
jsonpath-ng               1.6.0
jsonpointer               2.1
jsonschema                4.20.0
jsonschema-specifications 2023.11.2
jupyter_client            8.6.0
jupyter_core              5.7.1
jupyter-events            0.9.0
jupyter-lsp               2.2.5
jupyter_server            2.14.2
jupyter_server_terminals  0.5.1
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
kiwisolver                1.4.5
kmitool                   0.0.9
kubemaker                 0.2.17
kubernetes                9.0.0
langdetect                1.0.9
libmambapy                1.5.3
libro                     0.1.11
loralib                   0.1.1
lsh                       0.1.2
lsprotocol                2023.0.0
lxml                      4.9.3
M2Crypto                  0.38.0
Markdown                  3.5.1
markdown-it-py            3.0.0
MarkupSafe                1.1.1
marshmallow               3.20.1
matplotlib                3.7.4
matplotlib-inline         0.1.6
mccabe                    0.7.0
mdurl                     0.1.2
megatron.core             0.1
menuinst                  2.0.1
mistune                   0.8.4
mock                      5.1.0
more-itertools            10.1.0
mpi4py                    3.1.5
mpmath                    1.3.0
msgpack                   1.0.7
multidict                 6.0.4
multiprocess              0.70.15
nbclient                  0.5.13
nbconvert                 6.4.4
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.0
nltk                      3.8.1
notebook                  6.4.6
numpy                     1.23.5
nvidia-ml-py              12.560.30
oauthlib                  3.2.2
odps                      3.5.1
opendelta                 0.3.2
orjson                    3.9.10
oss2                      2.6.0
osscmd                    0.4.5
overrides                 7.7.0
packaging                 23.1
pandas                    1.0.0
pandocfilters             1.5.0
parameterized             0.9.0
parso                     0.8.3
pathos                    0.3.0
peft                      0.3.0
peppercorn                0.6
pexpect                   4.9.0
pickleshare               0.7.5
Pillow                    9.3.0
pip                       23.3.1
pkgutil_resolve_name      1.3.10
platformdirs              3.10.0
pluggy                    1.0.0
ply                       3.11
pox                       0.3.3
prettytable               3.9.0
prometheus-client         0.19.0
prompt-toolkit            3.0.43
protobuf                  3.20.0
psutil                    5.9.6
PTable                    0.9.2
ptyprocess                0.7.0
pure-eval                 0.2.2
py                        1.11.0
py-cpuinfo                9.0.0
py-spy                    0.3.14
pyaml                     21.10.1
pyarrow                   12.0.0
pyarrow-hotfix            0.6
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pybind11                  2.11.1
pycodestyle               2.11.1
pycosat                   0.6.6
pycparser                 2.21
pycryptodome              3.19.0
pydantic                  2.10.4
pydantic_core             2.27.2
pyDes                     2.0.1
pydocstyle                6.3.0
pydub                     0.25.1
pyflakes                  3.1.0
pygls                     1.2.1
Pygments                  2.17.2
pyhocon                   0.3.60
pyinotify                 0.9.6
pylint                    3.0.3
pynvml                    11.4.1
Pyomo                     6.7.0
pyOpenSSL                 23.2.0
pyparsing                 3.1.1
PySocks                   1.7.1
pytest                    7.4.3
python-dateutil           2.8.2
python-json-logger        2.0.7
python-lsp-jsonrpc        1.1.2
python-lsp-server         1.9.0
python-multipart          0.0.6
pytoolconfig              1.2.6
pytz                      2023.3.post1
PyWavelets                1.4.1
PyYAML                    6.0.1
pyzmq                     25.1.2
ray                       2.9.0
referencing               0.32.0
regex                     2023.10.3
requests                  2.31.0
requests-file             1.5.1
requests-oauthlib         1.3.1
requests-toolbelt         1.0.0
responses                 0.18.0
retry                     0.9.2
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.7.0
rope                      1.11.0
rouge-chinese             1.0.3
rouge-score               0.1.2
rpds-py                   0.14.1
rsa                       4.9
ruamel.yaml               0.16.10
ruamel.yaml.clib          0.2.6
ruff                      0.1.11
ruff-lsp                  0.0.54
s3transfer                0.9.0
safetensors               0.4.1
scikit-learn              1.3.2
scipy                     1.10.1
semantic-version          2.10.0
Send2Trash                1.8.2
sentencepiece             0.1.97
setuptools                68.2.2
shellingham               1.5.4
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
snowballstemmer           2.2.0
soupsieve                 2.5
sqlparse                  0.4.4
stack-data                0.6.3
starlette                 0.32.0.post1
stringcase                1.2.0
StringGenerator           0.4.4
sympy                     1.12
tabulate                  0.8.2
tensorboard               2.11.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorboardX              2.6
termcolor                 2.4.0
terminado                 0.18.0
testpath                  0.6.0
threadpoolctl             3.2.0
timm                      1.0.9
tinycss2                  1.2.1
titans                    0.0.7
tldextract                5.1.1
tokenizers                0.19.1
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.0
torch                     2.1.0+cu121
torchaudio                2.1.0+cu121
torchpippy                0.1.1+cecc4fc
torchvision               0.16.0+cu121
tornado                   6.4
tqdm                      4.65.0
traitlets                 5.14.1
transformers              4.40.0
triton                    2.1.0
typer                     0.9.0
typing_extensions         4.12.2
tzdata                    2023.3
ujson                     5.9.0
uncertainty-calibration   0.1.4
Unidecode                 1.3.7
unifile-sdk               0.1.14
uri-template              1.3.0
urllib3                   1.26.18
uvicorn                   0.25.0
virtualenv                20.25.0
watchdog                  2.3.1
wcwidth                   0.2.12                    0.62
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
websockets                11.0.3
Werkzeug                  3.0.1
wget                      3.2
whatthepatch              1.0.5
wheel                     0.41.2
wrapt                     1.16.0
xattr                     1.0.0
xxhash                    3.4.1
yacs                      0.1.8
yapf                      0.40.2
yarl                      1.9.4
zdfs-dfs                  2.3.2
zeep                      4.2.1
zipp                      3.17.0
zstandard                 0.19.0

Error traceback

Discovered apex.normalization.FusedRMSNorm - will use it instead of InternLM2RMSNorm
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-01-12 17:53:14,141] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 36605) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/", line 806, in main
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/", line 797, in run
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/", line 264, in launch_agent
    raise ChildFailedError(
internvl/train/ FAILED
Root Cause (first observed failure):
  time      : 2025-01-12_17:53:14
  host      : gpulingjun033184121099.sa127
  rank      : 0 (local_rank: 0)
  exitcode  : -8 (pid: 36605)
  error_file: <N/A>
  traceback : Signal 8 (SIGFPE) received by PID 36605
/opt/conda/lib/python3.8/multiprocessing/ UserWarning: resource_tracker: There appear to be 57 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
@lzk9508 lzk9508 changed the title [Bug] 2nd_finetune for internvl2.5 using lora [Bug] 2nd_finetune for internvl2.5 using lora ('resource_tracker: There appear to be %d ') Jan 13, 2025
lzk9508 commented Jan 13, 2025

any guys?

