Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

顺利在 Apple silicon M3 上运行 README 中 Llama3-8B 相关示例工作流的小波折 #4341

Closed
mapix opened this issue Jun 17, 2024 · 6 comments
Labels
good first issue Good for newcomers solved This problem has been already solved

Comments

@mapix
Copy link

mapix commented Jun 17, 2024

  1. 微调训练的时候开始就出现 b16 问题,直接改配置 yaml, 增加 fp16: false 一路无痛
  2. 当合并 Lora 进行 Chat 推理时问题会稍微麻烦点
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml

会有如下错误

/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py:2026: UserWarning: for base_model.model.model.layers.31.mlp.gate_proj.lora_A.default.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py:2026: UserWarning: for base_model.model.model.layers.31.mlp.gate_proj.lora_B.default.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py:2026: UserWarning: for base_model.model.model.layers.31.mlp.up_proj.lora_A.default.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py:2026: UserWarning: for base_model.model.model.layers.31.mlp.up_proj.lora_B.default.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py:2026: UserWarning: for base_model.model.model.layers.31.mlp.down_proj.lora_A.default.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py:2026: UserWarning: for base_model.model.model.layers.31.mlp.down_proj.lora_B.default.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
Traceback (most recent call last):
  File "/Users/mapix/miniconda/envs/llama-factory/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/cli.py", line 81, in main
    run_chat()
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 127, in run_chat
    chat_model = ChatModel()
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 43, in __init__
    self.engine: "BaseEngine" = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 58, in __init__
    self.model = load_model(
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/model/loader.py", line 160, in load_model
    model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/model/adapter.py", line 301, in init_adapter
    model = _setup_lora_tuning(
  File "/Users/mapix/workspace/LLaMA-Factory/src/llamafactory/model/adapter.py", line 191, in _setup_lora_tuning
    model: "LoraModel" = PeftModel.from_pretrained(model, adapter, **init_kwargs)
  File "/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/peft/peft_model.py", line 475, in from_pretrained
    model.load_adapter(
  File "/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/peft/peft_model.py", line 1076, in load_adapter
    self._update_offload(offload_index, adapters_weights)
  File "/Users/mapix/miniconda/envs/llama-factory/lib/python3.10/site-packages/peft/peft_model.py", line 957, in _update_offload
    safe_module = dict(self.named_modules())[extended_prefix]
KeyError: 'base_model.model.model.model.layers.10.input_layernorm'

2.1 首先很多这样的 UserWarning 要么直接全局禁 Warning,或者你的本地内存足够大的时候可以直接去掉 offline 内存的逻辑, 在配置 yaml 中添加 low_cpu_mem_usage: false,于是这部分 warning 消失。

2.2 至于这个 KeyError,不是很清楚不过根据这个出错看 model 这个单词连着拼了三遍,发现 dict 中都是两个,直接改源码绕过 peft/peft_model.py

                       #extended_prefix = prefix + block_id + safe_key[:suffix_pos] 
                       extended_prefix = prefix + safe_key[:suffix_pos]

2.3 进一步会出现 MPS 兼容问题,在命令行设置环境变量可以解决。

NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

最终命令
PYTORCH_ENABLE_MPS_FALLBACK=1 llamafactory-cli chat examples/inference/llama3_lora_sft.yaml

不是太确定是否是因为缓存问题,当我想回头记录一下的时候会退代码并没有复现,先记录一下防止有人踩同样的坑。

image

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 17, 2024
@injet-zhou
Copy link
Contributor

nice work

@hiyouga hiyouga added the good first issue Good for newcomers label Jun 18, 2024
@wwwbq
Copy link

wwwbq commented Jun 24, 2024

我在用mac m3pro进行训练时发现在模型前向的时候卡住,请问您有遇到过这个情况吗

@mapix
Copy link
Author

mapix commented Jun 24, 2024

@wwwbq
我最初以为是卡住了,等了很久能出来,非常慢。 调成参数 low_cpu_mem_usage: false 后就很快了 (我本地内存大一些),不过第一次依然会慢一些,可能是因为要编译 kernel。

@wwwbq
Copy link

wwwbq commented Jun 24, 2024

@wwwbq 我最初以为是卡住了,等了很久能出来,非常慢。 调成参数 low_cpu_mem_usage: false 后就很快了 (我本地内存大一些),不过第一次依然会慢一些,可能是因为要编译 kernel。

感谢! 我去试试 😊

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 28, 2024
@hiyouga hiyouga closed this as completed Jun 28, 2024
@inspirewind
Copy link

@wwwbq 我最初以为是卡住了,等了很久能出来,非常慢。 调成参数 low_cpu_mem_usage: false 后就很快了 (我本地内存大一些),不过第一次依然会慢一些,可能是因为要编译 kernel。

多大内存的机器,模型参数是多少呢?

@jinleic
Copy link

jinleic commented Nov 26, 2024

太伟大了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

6 participants