请问在sft多个数据集时，max_steps应该怎么算 #6581

128Ghe980 · 2025-01-09T10:27:30Z

Reminder

I have read the README and searched the existing issues.

System Info

deepspeed --hostfile=/hostfile_remote
src/train.py
--deepspeed $deepspeed_config
--stage sft
--do_train
--model_name_or_path $MODEL
--dataset XXX_img,XXX_noimg
--template qwen2_vl
--finetuning_type full
--streaming True
--buffer_size 128
--preprocessing_batch_size 128
--dispatch_batches False
--output_dir $output_dir
--overwrite_cache
--overwrite_output_dir True
--warmup_ratio 0.01
--weight_decay 0.1
--adam_beta2 0.95
--per_device_train_batch_size $per_device_train_batch_size
--gradient_accumulation_steps $gradient_accumulation_steps
--ddp_timeout 9000
--learning_rate 5e-6
--lr_scheduler_type cosine
--logging_steps 1
--cutoff_len 10240
--save_steps 2000
--save_total_limit 3
--plot_loss
--num_train_epochs 1
--report_to 'wandb'
--bf16 True
--tf32 True

Reproduction

无需复现

Others

我现在正在sft qwen2-vl，使用了两个数据集。由于数据量大，直接处理会在tokenizer处卡很久（7h+），所以我按照其他issue加入了
--streaming True \ --buffer_size 128 \ --preprocessing_batch_size 128 \ --dispatch_batches False \
但是训练时会报错：
raise ValueError("Please specify max_steps in streaming mode.")

所以这个max_steps是必须的吗？
另外我该怎么计算这个值呢？特别是两个数据集时

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-01-09T14:06:04Z

max_steps = num_examples * num_epochs // (per_device_train_batch_size * gradient_accumulation_steps * num_gpus)

github-actions bot added the pending This problem is yet to be addressed label Jan 9, 2025

hiyouga closed this as completed Jan 9, 2025

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请问在sft多个数据集时，max_steps应该怎么算 #6581

请问在sft多个数据集时，max_steps应该怎么算 #6581

128Ghe980 commented Jan 9, 2025

hiyouga commented Jan 9, 2025

请问在sft多个数据集时，max_steps应该怎么算 #6581

请问在sft多个数据集时，max_steps应该怎么算 #6581

Comments

128Ghe980 commented Jan 9, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Jan 9, 2025