feat: [Do Not Mrege] Add Sparse Float Vector support to milvus #29421

zhengbuqian · 2023-12-22T06:08:10Z

Supporting inserting, loading, building index, search from index, GetVectorByIds for now. Advanced features such as range search not yet implemented.

DO NOT MERGE this PR. I'll split this PR into multiple smaller PRs for code review. This PR is used for reviewers to grab a whole picture when reviewing individual PRs.

sre-ci-robot · 2023-12-22T06:08:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhengbuqian
To complete the pull request process, please assign congqixia after the PR has been reviewed.
You can assign the PR to them by writing /assign @congqixia in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mergify · 2023-12-22T06:16:12Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

zhengbuqian · 2023-12-22T06:19:57Z

/hold

mergify · 2023-12-22T09:09:11Z

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

mergify · 2023-12-28T09:44:54Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify · 2024-01-09T02:16:31Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify · 2024-01-18T12:00:54Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

zhengbuqian · 2024-01-23T07:21:48Z

/hold

mergify · 2024-01-23T07:22:18Z

@zhengbuqian

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

Title Format: The PR title must begin with one of these prefixes:

feat: for introducing a new feature.
fix: for bug fixes.
enhance: for improvements to existing functionality.
test: for add tests to existing functionality.
doc: for modifying documentation.
auto: for the pull request from bot.

Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly

Please review and update your PR to comply with these guidelines.

mergify · 2024-01-30T05:59:25Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

ashkrisk · 2024-02-20T03:13:46Z

Thanks @zhengbuqian! I'm able to build successfully now. I can also run standard dense vector operations on the built image.

I still have some trouble with the sparse vector search though - hello_sparse.py doesn't run cleanly. Not sure if the issue is with milvus or pymilvus - insert fails with the following error:

pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=nil indices in SparseFloatRow)>

Even though the inserted sparse matrix looks fine.

zhengbuqian · 2024-02-20T03:30:12Z

hi @ashkrisk I think that is caused by it not allowing empty sparse vector(0 at all dimensions). try update the vec dim and density to make it more dense to prevent rand(num_entities, dim, density=density, format='csr') from generating empty row(or at lease reduce the probability).

ashkrisk · 2024-02-20T03:41:02Z

Looks like that was the issue - all the matrices which failed to insert had at least one row with no elements. Everything looks fine as long as there are no zero vectors in the input!

zhengbuqian · 2024-02-20T03:43:37Z

Great! Let me know if you encountered any more issues so I can fix them!

xiaofan-luan · 2024-02-21T02:14:31Z

maybe we split this pr to smaller chunks for ease of review.

xiaofan-luan · 2024-02-21T02:15:01Z

I'm working on going through the pr today

zhengbuqian · 2024-02-21T02:23:21Z

@xiaofan-luan I have already splitted this PR:

feat: [Sparse Float Vector] segcore basics and index building #30357 segcore basics and index buidling
feat: [Sparse Float Vector] segcore to support sparse vector search and get raw vector by id #30629 segcore search and get vector by id
feat: [Sparse Float Vector] add sparse vector support to milvus components #30630 all golang code
another PR to add more integration tests incoming

Due to the lack of stacking PR support in Github, each of those PRs also includes commits from all previous PRs, so from within each PR reviewer should only look at the last commit. @czs007 is also reviewing those PRs now.

mergify · 2024-02-21T03:24:15Z

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

xiaofan-luan · 2024-02-21T05:53:22Z

pkg/util/typeutil/schema.go

+			// varies depending on the number of non-zeros. Using sparse vector
+			// generated by SPLADE as reference and returning size of a sparse
+			// vector with 150 non-zeros.
+			res += 1200


This is dangerous.
We should not depends on any of this kind of estimation but so far lets keep it.
make this configurable

internal/storage/utils.go

internal/storage/stats.go

xiaofan-luan · 2024-02-21T11:48:44Z

internal/storage/payload_reader.go

+
+		fieldData.Contents = append(fieldData.Contents, row)
+		rowDim := typeutil.SparseFloatRowDim(row)
+		if rowDim > fieldData.Dim {


Is this dimension useful anywhere? or it just used as a meta information?

xiaofan-luan · 2024-02-21T12:00:21Z

internal/proxy/validate_util.go

@@ -326,6 +337,15 @@ func (v *validateUtil) checkBinaryVectorFieldData(field *schemapb.FieldData, fie
 	return nil
 }

+func (v *validateUtil) checkSparseFloatFieldData(field *schemapb.FieldData, fieldSchema *schemapb.FieldSchema) error {
+	sparseRows := field.GetVectors().GetSparseFloatVector().GetContents()


This could has nil reference?

maybe we need to check all the sparse vector has at least nonzero value here

what other limitation do we have about sparse embeddings?

xiaofan-luan · 2024-02-21T13:12:18Z

internal/distributed/proxy/httpserver/utils.go

@@ -204,6 +204,7 @@ func checkAndSetData(body string, collSchema *schemapb.CollectionSchema) (error,
 				}

 				switch fieldType {
+				// TODO(SPARSE) add sparse field support in this file


We need a string representation of sparse embedding, used for restfulAPI and import
My recommendation is this is same as what other system do

{"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}

Any thought?

For parquet import, we could probably directly use proto.

internal/core/src/common/Schema.h

xiaofan-luan · 2024-02-21T14:01:03Z

internal/core/src/segcore/ConcurrentVector.h

+    explicit ConcurrentVectorImpl(ssize_t elements_per_row,
+                                  int64_t size_per_chunk)
+        : VectorBase(size_per_chunk),
+          elements_per_row_(is_type_entire_row ? 1 : elements_per_row) {


what is the meaning of elements_per_row?
should it be be elements_per_chunk?

elements_per_row is to denote how many Type objects we need to represent a single row:

In the old code:

for a float scalar field, Type = float, is_scalar = True: each row value has 1 float

for a dense float vector field, Type = float, is_scalar = False: each row value has dim floats

it implicitly assumes all scalar fields have a dim of 1, and all vector fields have a fixed dim. this assumption does not hold for sparse vector.

thus I renamed is_scalar to is_type_entire_row: if true, each row value consists of 1 Type object; or else each row value consists of elements_per_row of Type objects.

for a float scalar field, Type = float, is_type_entire_row = True: each row value has elements_per_row = 1 float

for a dense float vector field, Type = float, is_type_entire_row = False: each row has elements_per_row = dim floats

for a sparse float vector field, Type = SparseRow, is_type_entire_row = True: each row has elements_per_row = 1 SparseRow

and we no longer have the notation of dim in the base class ConcurrentVectorImpl

internal/core/src/index/VectorMemIndex.cpp

internal/storage/payload_reader.go

xiaofan-luan · 2024-02-22T07:31:32Z

internal/core/src/common/Utils.h

+// serialize a sparse row to bytes that will be written into binlog file.
+inline std::vector<uint8_t>
+SerializeSparseRow(const knowhere::sparse::SparseRow<float>& row) {
+    std::vector<uint8_t> buffer(SparseRowSerializedSize(row));


For all in memory format, let's use knowhere::sparse::SparseRow

Other that, for all Serdes, let's use protobuf

discussed offline: we'll use densely packed idx, val, idx, val, ... byte array everywhere: in schema.proto::SparseFloatArray, in golang/c++ memory, in persisted binlog, in placeholder of search requests, etc.

places we expect to see different format:

in SDK and import utils we accept various formats as insert/search input, but we convert those before sending to milvus

in restful endpoints we accept string representation like {"indices": [102, 16, 18, ...], "values": [0.21, 0.11, 0.15, ...]}

the design lgtm

mergify · 2024-02-28T04:33:16Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify · 2024-02-28T07:27:05Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify · 2024-02-28T08:03:09Z

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

xiaofan-luan · 2024-02-29T12:03:17Z

@zhengbuqian is this ready to review and merge?

mergify · 2024-02-29T12:48:33Z

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

This commit adds sparse float vector support to segcore with the following: 1. data type enum declarations 2. Adds corresponding data structures for handling sparse float vectors in various scenarios, including: * FieldData as a bridge between the binlog and the in memory data structures * mmap::Column as the in memory representation of a sparse float vector column of a sealed segment; * ConcurrentVector as the in memory representation of a sparse float vector of a growing segment which supports inserts. 3. Adds logic in payload reader/writer to serialize/deserialize from/to binlog 4. Adds the ability to allow the index node to build sparse float vector index 5. Adds the ability to allow the query node to build growing index for growing segment and temp index for sealed segment without index built This commit also includes some code cleanness, comment improvement, and some unit tests for sparse vector. Signed-off-by: Buqian Zheng <[email protected]>

… raw vector by id added lots of unit tests, converted many segcore tests into parameter tests that works for both dense and sparse float vector Signed-off-by: Buqian Zheng <[email protected]>

milvus components, including proxy, data node to receive and write sparse float vectors to binlog, query node to handle search requests, index node to build index for sparse float column, etc. Signed-off-by: Buqian Zheng <[email protected]>

Signed-off-by: Buqian Zheng <[email protected]>

… to add sparse vector support Signed-off-by: Buqian Zheng <[email protected]>

mergify · 2024-03-06T07:37:32Z

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify · 2024-03-06T08:29:45Z

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

zhengbuqian · 2024-03-14T07:28:03Z

closing this PR. codes in this PR have been merged in smaller PRs, see linked issue.

sre-ci-robot added area/compilation size/XXL Denotes a PR that changes 1000+ lines. labels Dec 22, 2023

sre-ci-robot requested review from aoiasd and bigsheeper December 22, 2023 06:08

mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Dec 22, 2023

sre-ci-robot added the do-not-merge/hold label Dec 22, 2023

zhengbuqian mentioned this pull request Dec 22, 2023

[Feature]: Add Sparse Float Vector support #29419

Closed

1 task

buqian-zilliz force-pushed the sparse-v4 branch 2 times, most recently from 42b1dad to 6792cac Compare December 28, 2023 09:21

sre-ci-robot added the area/dependency Pull requests that update a dependency file label Dec 28, 2023

buqian-zilliz force-pushed the sparse-v4 branch from 06c75fb to ea0d50a Compare January 9, 2024 02:08

buqian-zilliz force-pushed the sparse-v4 branch from ea0d50a to 3ee1d58 Compare January 18, 2024 11:47

buqian-zilliz force-pushed the sparse-v4 branch from 3ee1d58 to 9a3bec8 Compare January 22, 2024 02:33

zhengbuqian changed the title ~~feat: Add Sparse Float Vector support to milvus~~ [Do Not Mrege] feat: Add Sparse Float Vector support to milvus Jan 23, 2024

mergify bot added the do-not-merge/invalid-pr-format label Jan 23, 2024

zhengbuqian changed the title ~~[Do Not Mrege] feat: Add Sparse Float Vector support to milvus~~ feat: [Do Not Mrege] Add Sparse Float Vector support to milvus Jan 23, 2024

mergify bot removed the do-not-merge/invalid-pr-format label Jan 23, 2024

zhengbuqian mentioned this pull request Jan 29, 2024

feat: Add sparse float vector support to PyMilvus milvus-io/pymilvus#1902

Merged

buqian-zilliz force-pushed the sparse-v4 branch from 9a3bec8 to 11361ea Compare January 30, 2024 05:40

xiaofan-luan reviewed Feb 21, 2024

View reviewed changes

xiaofan-luan reviewed Feb 22, 2024

View reviewed changes

zhengbuqian mentioned this pull request Feb 27, 2024

use plain byte array to represent a single sparse row milvus-io/milvus-proto#246

Merged

buqian-zilliz force-pushed the sparse-v4 branch from dd75d72 to 267af1e Compare February 28, 2024 04:27

buqian-zilliz force-pushed the sparse-v4 branch from 267af1e to 78c6649 Compare February 28, 2024 07:10

buqian-zilliz force-pushed the sparse-v4 branch from 78c6649 to 0a46cec Compare February 29, 2024 03:09

zhengbuqian added 5 commits March 6, 2024 15:13

[Sparse Float Vector] segcore to support sparse vector search and get…

907a83b

… raw vector by id added lots of unit tests, converted many segcore tests into parameter tests that works for both dense and sparse float vector Signed-off-by: Buqian Zheng <[email protected]>

[Sparse Float Vector] added some integrated tests

80a7531

Signed-off-by: Buqian Zheng <[email protected]>

[Sparse Float Vector] added some TODOs in httpserver and import utils…

087375d

… to add sparse vector support Signed-off-by: Buqian Zheng <[email protected]>

zhengbuqian force-pushed the sparse-v4 branch from 0a46cec to 087375d Compare March 6, 2024 07:22

zhengbuqian closed this Mar 14, 2024

zhengbuqian deleted the sparse-v4 branch March 14, 2024 07:28

feat: [Do Not Mrege] Add Sparse Float Vector support to milvus #29421

feat: [Do Not Mrege] Add Sparse Float Vector support to milvus #29421

Conversation

zhengbuqian commented Dec 22, 2023 • edited Loading

sre-ci-robot commented Dec 22, 2023

mergify bot commented Dec 22, 2023

zhengbuqian commented Dec 22, 2023

mergify bot commented Dec 22, 2023

mergify bot commented Dec 28, 2023

mergify bot commented Jan 9, 2024

mergify bot commented Jan 18, 2024

zhengbuqian commented Jan 23, 2024

mergify bot commented Jan 23, 2024

mergify bot commented Jan 30, 2024

ashkrisk commented Feb 20, 2024

zhengbuqian commented Feb 20, 2024

ashkrisk commented Feb 20, 2024

zhengbuqian commented Feb 20, 2024

xiaofan-luan commented Feb 21, 2024

xiaofan-luan commented Feb 21, 2024

zhengbuqian commented Feb 21, 2024

mergify bot commented Feb 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Feb 28, 2024

mergify bot commented Feb 28, 2024

mergify bot commented Feb 28, 2024

xiaofan-luan commented Feb 29, 2024

mergify bot commented Feb 29, 2024

mergify bot commented Mar 6, 2024

mergify bot commented Mar 6, 2024

zhengbuqian commented Mar 14, 2024

zhengbuqian commented Dec 22, 2023 •

edited

Loading