Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek-V3-lite naming conventions? #594

Closed
gabrielolympie opened this issue Feb 6, 2025 · 1 comment
Closed

DeepSeek-V3-lite naming conventions? #594

gabrielolympie opened this issue Feb 6, 2025 · 1 comment

Comments

@gabrielolympie
Copy link

Hello, i am currently working on a pruned version of DeepSeek V3,

The methodology involves layer wise routed expert pruning and distillation, then post training on the full model.
I already tested the pipeline on DeepSeek V2 lite, bringing 64@6 experts to 16@4 experts and it seems to give correct results.

I just started running the same method on Deepseek V3 with the following pruned target:
Base Model: 256@8 => DeepSeek-V3-671B@37B-full
22@6 => DeepSeek-V3-Lite-72B@31B-large
16@4 => DeepSeek-V3-Lite-57B@26B-medium
8@2 => DeepSeek-V3-Lite-36B@21B-small
4@1 => DeepSeek-V3-Lite-26B@19B-nano

I'll upload them on huggingface when the pipeline finish to run (it should take about 3 days on my 2x3090 rig).

Do you authorize me to adopt the naming convention as above for the uploads?

If the methodology gives good result, i'll transfer it to the R1 and R1-Zero as well.

@gabrielolympie
Copy link
Author

Update : Distillation is faster than expected, the first stage of the pipeline is at 37 layers processed / 61 layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants