Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [Do Not Mrege] Add Sparse Float Vector support to milvus #29421

Closed
wants to merge 5 commits into from

Conversation

zhengbuqian
Copy link
Collaborator

@zhengbuqian zhengbuqian commented Dec 22, 2023

issue: #29419

Supporting inserting, loading, building index, search from index, GetVectorByIds for now. Advanced features such as range search not yet implemented.

DO NOT MERGE this PR. I'll split this PR into multiple smaller PRs for code review. This PR is used for reviewers to grab a whole picture when reviewing individual PRs.

@sre-ci-robot sre-ci-robot added area/compilation size/XXL Denotes a PR that changes 1000+ lines. labels Dec 22, 2023
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhengbuqian
To complete the pull request process, please assign congqixia after the PR has been reviewed.
You can assign the PR to them by writing /assign @congqixia in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Dec 22, 2023
Copy link
Contributor

mergify bot commented Dec 22, 2023

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@zhengbuqian
Copy link
Collaborator Author

/hold

Copy link
Contributor

mergify bot commented Dec 22, 2023

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

@buqian-zilliz buqian-zilliz force-pushed the sparse-v4 branch 2 times, most recently from 42b1dad to 6792cac Compare December 28, 2023 09:21
@sre-ci-robot sre-ci-robot added the area/dependency Pull requests that update a dependency file label Dec 28, 2023
Copy link
Contributor

mergify bot commented Dec 28, 2023

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jan 9, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jan 18, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@zhengbuqian zhengbuqian changed the title feat: Add Sparse Float Vector support to milvus [Do Not Mrege] feat: Add Sparse Float Vector support to milvus Jan 23, 2024
@zhengbuqian
Copy link
Collaborator Author

/hold

Copy link
Contributor

mergify bot commented Jan 23, 2024

@zhengbuqian

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

  1. Title Format: The PR title must begin with one of these prefixes:
  • feat: for introducing a new feature.
  • fix: for bug fixes.
  • enhance: for improvements to existing functionality.
  • test: for add tests to existing functionality.
  • doc: for modifying documentation.
  • auto: for the pull request from bot.
  1. Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly 

Please review and update your PR to comply with these guidelines.

@zhengbuqian zhengbuqian changed the title [Do Not Mrege] feat: Add Sparse Float Vector support to milvus feat: [Do Not Mrege] Add Sparse Float Vector support to milvus Jan 23, 2024
Copy link
Contributor

mergify bot commented Jan 30, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@ashkrisk
Copy link
Contributor

Thanks @zhengbuqian! I'm able to build successfully now. I can also run standard dense vector operations on the built image.

I still have some trouble with the sparse vector search though - hello_sparse.py doesn't run cleanly. Not sure if the issue is with milvus or pymilvus - insert fails with the following error:

pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=nil indices in SparseFloatRow)>

Even though the inserted sparse matrix looks fine.

@zhengbuqian
Copy link
Collaborator Author

hi @ashkrisk I think that is caused by it not allowing empty sparse vector(0 at all dimensions). try update the vec dim and density to make it more dense to prevent rand(num_entities, dim, density=density, format='csr') from generating empty row(or at lease reduce the probability).

@ashkrisk
Copy link
Contributor

Looks like that was the issue - all the matrices which failed to insert had at least one row with no elements. Everything looks fine as long as there are no zero vectors in the input!

@zhengbuqian
Copy link
Collaborator Author

Great! Let me know if you encountered any more issues so I can fix them!

@xiaofan-luan
Copy link
Collaborator

maybe we split this pr to smaller chunks for ease of review.

@xiaofan-luan
Copy link
Collaborator

I'm working on going through the pr today

@zhengbuqian
Copy link
Collaborator Author

@xiaofan-luan I have already splitted this PR:

  1. feat: [Sparse Float Vector] segcore basics and index building #30357 segcore basics and index buidling
  2. feat: [Sparse Float Vector] segcore to support sparse vector search and get raw vector by id #30629 segcore search and get vector by id
  3. feat: [Sparse Float Vector] add sparse vector support to milvus components #30630 all golang code
  4. another PR to add more integration tests incoming

Due to the lack of stacking PR support in Github, each of those PRs also includes commits from all previous PRs, so from within each PR reviewer should only look at the last commit. @czs007 is also reviewing those PRs now.

Copy link
Contributor

mergify bot commented Feb 21, 2024

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

// varies depending on the number of non-zeros. Using sparse vector
// generated by SPLADE as reference and returning size of a sparse
// vector with 150 non-zeros.
res += 1200
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous.
We should not depends on any of this kind of estimation but so far lets keep it.
make this configurable

internal/storage/utils.go Outdated Show resolved Hide resolved
internal/storage/stats.go Outdated Show resolved Hide resolved

fieldData.Contents = append(fieldData.Contents, row)
rowDim := typeutil.SparseFloatRowDim(row)
if rowDim > fieldData.Dim {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this dimension useful anywhere? or it just used as a meta information?

@@ -326,6 +337,15 @@ func (v *validateUtil) checkBinaryVectorFieldData(field *schemapb.FieldData, fie
return nil
}

func (v *validateUtil) checkSparseFloatFieldData(field *schemapb.FieldData, fieldSchema *schemapb.FieldSchema) error {
sparseRows := field.GetVectors().GetSparseFloatVector().GetContents()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could has nil reference?
Uploading image.png…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to check all the sparse vector has at least nonzero value here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other limitation do we have about sparse embeddings?

@@ -204,6 +204,7 @@ func checkAndSetData(body string, collSchema *schemapb.CollectionSchema) (error,
}

switch fieldType {
// TODO(SPARSE) add sparse field support in this file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a string representation of sparse embedding, used for restfulAPI and import
My recommendation is this is same as what other system do

{"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}

Any thought?

For parquet import, we could probably directly use proto.

internal/core/src/common/Schema.h Outdated Show resolved Hide resolved
explicit ConcurrentVectorImpl(ssize_t elements_per_row,
int64_t size_per_chunk)
: VectorBase(size_per_chunk),
elements_per_row_(is_type_entire_row ? 1 : elements_per_row) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of elements_per_row?
should it be be elements_per_chunk?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elements_per_row is to denote how many Type objects we need to represent a single row:

In the old code:

  • for a float scalar field, Type = float, is_scalar = True: each row value has 1 float
  • for a dense float vector field, Type = float, is_scalar = False: each row value has dim floats

it implicitly assumes all scalar fields have a dim of 1, and all vector fields have a fixed dim. this assumption does not hold for sparse vector.

thus I renamed is_scalar to is_type_entire_row: if true, each row value consists of 1 Type object; or else each row value consists of elements_per_row of Type objects.

  • for a float scalar field, Type = float, is_type_entire_row = True: each row value has elements_per_row = 1 float
  • for a dense float vector field, Type = float, is_type_entire_row = False: each row has elements_per_row = dim floats
  • for a sparse float vector field, Type = SparseRow, is_type_entire_row = True: each row has elements_per_row = 1 SparseRow

and we no longer have the notation of dim in the base class ConcurrentVectorImpl

internal/core/src/index/VectorMemIndex.cpp Outdated Show resolved Hide resolved
internal/storage/payload_reader.go Outdated Show resolved Hide resolved
// serialize a sparse row to bytes that will be written into binlog file.
inline std::vector<uint8_t>
SerializeSparseRow(const knowhere::sparse::SparseRow<float>& row) {
std::vector<uint8_t> buffer(SparseRowSerializedSize(row));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all in memory format, let's use knowhere::sparse::SparseRow

Other that, for all Serdes, let's use protobuf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline: we'll use densely packed idx, val, idx, val, ... byte array everywhere: in schema.proto::SparseFloatArray, in golang/c++ memory, in persisted binlog, in placeholder of search requests, etc.

places we expect to see different format:

  • in SDK and import utils we accept various formats as insert/search input, but we convert those before sending to milvus
  • in restful endpoints we accept string representation like {"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the design lgtm

Copy link
Contributor

mergify bot commented Feb 28, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Feb 28, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Feb 28, 2024

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

@xiaofan-luan
Copy link
Collaborator

@zhengbuqian is this ready to review and merge?

Copy link
Contributor

mergify bot commented Feb 29, 2024

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

This commit adds sparse float vector support to segcore with the following:

1. data type enum declarations
2. Adds corresponding data structures for handling sparse float vectors
   in various scenarios, including:
  * FieldData as a bridge between the binlog and the in memory data structures
  * mmap::Column as the in memory representation of a sparse float vector
    column of a sealed segment;
  * ConcurrentVector as the in memory representation of a sparse float
    vector of a growing segment which supports inserts.
3. Adds logic in payload reader/writer to serialize/deserialize from/to binlog
4. Adds the ability to allow the index node to build sparse float vector index
5. Adds the ability to allow the query node to build growing index for
   growing segment and temp index for sealed segment without index built

This commit also includes some code cleanness, comment improvement, and
some unit tests for sparse vector.

Signed-off-by: Buqian Zheng <[email protected]>
… raw vector by id

added lots of unit tests, converted many segcore tests into parameter
tests that works for both dense and sparse float vector

Signed-off-by: Buqian Zheng <[email protected]>
milvus components, including proxy, data node to receive and write
sparse float vectors to binlog, query node to handle search requests,
index node to build index for sparse float column, etc.

Signed-off-by: Buqian Zheng <[email protected]>
… to add sparse vector support

Signed-off-by: Buqian Zheng <[email protected]>
Copy link
Contributor

mergify bot commented Mar 6, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Mar 6, 2024

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

@zhengbuqian
Copy link
Collaborator Author

closing this PR. codes in this PR have been merged in smaller PRs, see linked issue.

@zhengbuqian zhengbuqian deleted the sparse-v4 branch March 14, 2024 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/compilation area/dependency Pull requests that update a dependency file area/test dco-passed DCO check passed. do-not-merge/hold kind/feature Issues related to feature request from users sig/testing size/XXL Denotes a PR that changes 1000+ lines. test/integration integration test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants