-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [Do Not Mrege] Add Sparse Float Vector support to milvus #29421
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhengbuqian The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@zhengbuqian E2e jenkins job failed, comment |
/hold |
@zhengbuqian ut workflow job failed, comment |
42b1dad
to
6792cac
Compare
@zhengbuqian E2e jenkins job failed, comment |
06c75fb
to
ea0d50a
Compare
@zhengbuqian E2e jenkins job failed, comment |
ea0d50a
to
3ee1d58
Compare
@zhengbuqian E2e jenkins job failed, comment |
3ee1d58
to
9a3bec8
Compare
/hold |
Invalid PR Title Format Detected Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:
Required Title Structure:
Where Example:
Please review and update your PR to comply with these guidelines. |
9a3bec8
to
11361ea
Compare
@zhengbuqian E2e jenkins job failed, comment |
Thanks @zhengbuqian! I'm able to build successfully now. I can also run standard dense vector operations on the built image. I still have some trouble with the sparse vector search though -
Even though the inserted sparse matrix looks fine. |
hi @ashkrisk I think that is caused by it not allowing empty sparse vector(0 at all dimensions). try update the vec dim and density to make it more dense to prevent |
Looks like that was the issue - all the matrices which failed to insert had at least one row with no elements. Everything looks fine as long as there are no zero vectors in the input! |
Great! Let me know if you encountered any more issues so I can fix them! |
maybe we split this pr to smaller chunks for ease of review. |
I'm working on going through the pr today |
@xiaofan-luan I have already splitted this PR:
Due to the lack of stacking PR support in Github, each of those PRs also includes commits from all previous PRs, so from within each PR reviewer should only look at the last commit. @czs007 is also reviewing those PRs now. |
@zhengbuqian ut workflow job failed, comment |
// varies depending on the number of non-zeros. Using sparse vector | ||
// generated by SPLADE as reference and returning size of a sparse | ||
// vector with 150 non-zeros. | ||
res += 1200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dangerous.
We should not depends on any of this kind of estimation but so far lets keep it.
make this configurable
|
||
fieldData.Contents = append(fieldData.Contents, row) | ||
rowDim := typeutil.SparseFloatRowDim(row) | ||
if rowDim > fieldData.Dim { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dimension useful anywhere? or it just used as a meta information?
@@ -326,6 +337,15 @@ func (v *validateUtil) checkBinaryVectorFieldData(field *schemapb.FieldData, fie | |||
return nil | |||
} | |||
|
|||
func (v *validateUtil) checkSparseFloatFieldData(field *schemapb.FieldData, fieldSchema *schemapb.FieldSchema) error { | |||
sparseRows := field.GetVectors().GetSparseFloatVector().GetContents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we need to check all the sparse vector has at least nonzero value here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what other limitation do we have about sparse embeddings?
@@ -204,6 +204,7 @@ func checkAndSetData(body string, collSchema *schemapb.CollectionSchema) (error, | |||
} | |||
|
|||
switch fieldType { | |||
// TODO(SPARSE) add sparse field support in this file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a string representation of sparse embedding, used for restfulAPI and import
My recommendation is this is same as what other system do
{"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}
Any thought?
For parquet import, we could probably directly use proto.
explicit ConcurrentVectorImpl(ssize_t elements_per_row, | ||
int64_t size_per_chunk) | ||
: VectorBase(size_per_chunk), | ||
elements_per_row_(is_type_entire_row ? 1 : elements_per_row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the meaning of elements_per_row?
should it be be elements_per_chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elements_per_row
is to denote how many Type
objects we need to represent a single row:
In the old code:
- for a
float
scalar field,Type = float, is_scalar = True
: each row value has 1 float - for a dense
float
vector field,Type = float, is_scalar = False
: each row value hasdim
floats
it implicitly assumes all scalar fields have a dim of 1, and all vector fields have a fixed dim. this assumption does not hold for sparse vector.
thus I renamed is_scalar
to is_type_entire_row
: if true, each row value consists of 1 Type
object; or else each row value consists of elements_per_row
of Type
objects.
- for a
float
scalar field,Type = float, is_type_entire_row = True
: each row value haselements_per_row = 1
float - for a dense
float
vector field,Type = float, is_type_entire_row = False
: each row haselements_per_row = dim
floats - for a sparse
float
vector field,Type = SparseRow, is_type_entire_row = True
: each row haselements_per_row = 1
SparseRow
and we no longer have the notation of dim
in the base class ConcurrentVectorImpl
internal/core/src/common/Utils.h
Outdated
// serialize a sparse row to bytes that will be written into binlog file. | ||
inline std::vector<uint8_t> | ||
SerializeSparseRow(const knowhere::sparse::SparseRow<float>& row) { | ||
std::vector<uint8_t> buffer(SparseRowSerializedSize(row)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all in memory format, let's use knowhere::sparse::SparseRow
Other that, for all Serdes, let's use protobuf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline: we'll use densely packed idx, val, idx, val, ...
byte array everywhere: in schema.proto::SparseFloatArray
, in golang/c++ memory, in persisted binlog, in placeholder of search requests, etc.
places we expect to see different format:
- in SDK and import utils we accept various formats as insert/search input, but we convert those before sending to milvus
- in restful endpoints we accept string representation like
{"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the design lgtm
dd75d72
to
267af1e
Compare
@zhengbuqian E2e jenkins job failed, comment |
267af1e
to
78c6649
Compare
@zhengbuqian E2e jenkins job failed, comment |
@zhengbuqian ut workflow job failed, comment |
78c6649
to
0a46cec
Compare
@zhengbuqian is this ready to review and merge? |
@zhengbuqian ut workflow job failed, comment |
This commit adds sparse float vector support to segcore with the following: 1. data type enum declarations 2. Adds corresponding data structures for handling sparse float vectors in various scenarios, including: * FieldData as a bridge between the binlog and the in memory data structures * mmap::Column as the in memory representation of a sparse float vector column of a sealed segment; * ConcurrentVector as the in memory representation of a sparse float vector of a growing segment which supports inserts. 3. Adds logic in payload reader/writer to serialize/deserialize from/to binlog 4. Adds the ability to allow the index node to build sparse float vector index 5. Adds the ability to allow the query node to build growing index for growing segment and temp index for sealed segment without index built This commit also includes some code cleanness, comment improvement, and some unit tests for sparse vector. Signed-off-by: Buqian Zheng <[email protected]>
… raw vector by id added lots of unit tests, converted many segcore tests into parameter tests that works for both dense and sparse float vector Signed-off-by: Buqian Zheng <[email protected]>
milvus components, including proxy, data node to receive and write sparse float vectors to binlog, query node to handle search requests, index node to build index for sparse float column, etc. Signed-off-by: Buqian Zheng <[email protected]>
Signed-off-by: Buqian Zheng <[email protected]>
… to add sparse vector support Signed-off-by: Buqian Zheng <[email protected]>
@zhengbuqian E2e jenkins job failed, comment |
@zhengbuqian ut workflow job failed, comment |
closing this PR. codes in this PR have been merged in smaller PRs, see linked issue. |
issue: #29419
Supporting inserting, loading, building index, search from index, GetVectorByIds for now. Advanced features such as range search not yet implemented.
DO NOT MERGE this PR. I'll split this PR into multiple smaller PRs for code review. This PR is used for reviewers to grab a whole picture when reviewing individual PRs.