-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DeepSeek V2 #2744
base: main
Are you sure you want to change the base?
Implement DeepSeek V2 #2744
Conversation
} | ||
|
||
pub trait SplitOp { | ||
fn split<D: Dim>(&self, splits: &[usize], dim: D) -> Result<Vec<Tensor>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like split
is only used to split two tensors at a time, having a specialized op split2
instead that returns a pair of tensors would seem more convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my idea was to maybe eventually migrate all the ...Op
traits in this file to candle_nn
as they are both used in DeepSeek 3 (that PR duplicates these ops) and might be very useful.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather keep candle_nn
on the simpler side, here it seems that the ops can be just specialized to what is exactly required by the model, I would even suggest defining helper functions rather than going through traits, and the deepseek-v3 implementation can use some functions from the deepseek-v2 module.
} | ||
|
||
pub trait SplitOp { | ||
fn split<D: Dim>(&self, splits: &[usize], dim: D) -> Result<Vec<Tensor>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather keep candle_nn
on the simpler side, here it seems that the ops can be just specialized to what is exactly required by the model, I would even suggest defining helper functions rather than going through traits, and the deepseek-v3 implementation can use some functions from the deepseek-v2 module.
// (n, topk_group) | ||
let group_idx = scores.topk_unsorted(self.cfg.topk_group)?.indices; | ||
// (n, n_group) | ||
let mut group_mask = group_scores.zeros_like()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot you just avoid this mut
by chaining calls or using a local scope, seems fairly easy to do. Please also review the other remaining muts.
This PR implements the DeepSeek V2 architecture.