DeepSeek-V3-lite naming conventions? #594

gabrielolympie · 2025-02-06T12:55:33Z

Hello, i am currently working on a pruned version of DeepSeek V3,

The methodology involves layer wise routed expert pruning and distillation, then post training on the full model.
I already tested the pipeline on DeepSeek V2 lite, bringing 64@6 experts to 16@4 experts and it seems to give correct results.

I just started running the same method on Deepseek V3 with the following pruned target:
Base Model: 256@8 => DeepSeek-V3-671B@37B-full
22@6 => DeepSeek-V3-Lite-72B@31B-large
16@4 => DeepSeek-V3-Lite-57B@26B-medium
8@2 => DeepSeek-V3-Lite-36B@21B-small
4@1 => DeepSeek-V3-Lite-26B@19B-nano

I'll upload them on huggingface when the pipeline finish to run (it should take about 3 days on my 2x3090 rig).

Do you authorize me to adopt the naming convention as above for the uploads?

If the methodology gives good result, i'll transfer it to the R1 and R1-Zero as well.

gabrielolympie · 2025-02-07T15:07:00Z

Update : Distillation is faster than expected, the first stage of the pipeline is at 37 layers processed / 61 layers.

mowentian closed this as completed Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-V3-lite naming conventions? #594

DeepSeek-V3-lite naming conventions? #594

gabrielolympie commented Feb 6, 2025

gabrielolympie commented Feb 7, 2025

DeepSeek-V3-lite naming conventions? #594

DeepSeek-V3-lite naming conventions? #594

Comments

gabrielolympie commented Feb 6, 2025

gabrielolympie commented Feb 7, 2025