Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [Sparse Float Vector] segcore basics and index building #30357

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

zhengbuqian
Copy link
Collaborator

This commit adds sparse float vector support to segcore with the following:

  1. data type enum declarations
  2. Adds corresponding data structures for handling sparse float vectors in various scenarios, including:
  • FieldData as a bridge between the binlog and the in memory data structures
  • mmap::Column as the in memory representation of a sparse float vector column of a sealed segment;
  • ConcurrentVector as the in memory representation of a sparse float vector of a growing segment which supports inserts.
  1. Adds logic in payload reader/writer to serialize/deserialize from/to binlog
  2. Adds the ability to allow the index node to build sparse float vector index
  3. Adds the ability to allow the query node to build growing index for growing segment and temp index for sealed segment without index built

This commit also includes some code cleanness, comment improvement, and some unit tests for sparse vector.

#29419

@sre-ci-robot sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. area/compilation labels Jan 30, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Jan 30, 2024
Copy link
Contributor

mergify bot commented Jan 30, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jan 30, 2024

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

Copy link
Contributor

mergify bot commented Jan 31, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Feb 18, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Feb 20, 2024

@zhengbuqian ut workflow job failed, comment rerun ut can trigger the job again.

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: Patch coverage is 48.55072% with 213 lines in your changes are missing coverage. Please review.

Project coverage is 80.94%. Comparing base (7e17f24) to head (1abc16c).
Report is 18 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #30357      +/-   ##
==========================================
- Coverage   80.99%   80.94%   -0.05%     
==========================================
  Files         977      965      -12     
  Lines      141490   142009     +519     
==========================================
+ Hits       114594   114948     +354     
- Misses      23045    23209     +164     
- Partials     3851     3852       +1     
Files Coverage Δ
internal/core/src/common/FieldData.cpp 72.64% <100.00%> (+2.53%) ⬆️
internal/core/src/common/FieldData.h 100.00% <100.00%> (ø)
internal/core/src/common/Schema.cpp 74.28% <100.00%> (+5.71%) ⬆️
internal/core/src/common/Schema.h 94.11% <ø> (-0.17%) ⬇️
internal/core/src/common/Span.h 100.00% <ø> (ø)
internal/core/src/index/IndexFactory.cpp 77.18% <100.00%> (+3.43%) ⬆️
internal/core/src/indexbuilder/IndexFactory.h 43.47% <ø> (-2.36%) ⬇️
internal/core/src/segcore/SegmentGrowing.h 100.00% <ø> (ø)
internal/core/src/segcore/SegmentGrowingImpl.cpp 73.03% <100.00%> (+0.28%) ⬆️
internal/core/src/segcore/SegmentGrowingImpl.h 81.03% <ø> (+0.67%) ⬆️
... and 24 more

... and 222 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Feb 28, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Feb 29, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@zhengbuqian zhengbuqian force-pushed the sparse-v7-pr1-official branch 2 times, most recently from 53ec559 to 2c23d04 Compare March 8, 2024 04:38
Copy link
Contributor

mergify bot commented Mar 8, 2024

@zhengbuqian E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@zhengbuqian
Copy link
Collaborator Author

/run-cpu-e2e

@@ -68,6 +68,10 @@ unsupported_index_combinations() {
static std::vector<std::tuple<IndexType, MetricType>> ret{
std::make_tuple(knowhere::IndexEnum::INDEX_FAISS_BIN_IVFFLAT,
knowhere::metric::L2),
std::make_tuple(knowhere::IndexEnum::INDEX_SPARSE_INVERTED_INDEX,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand.
Do we support cosine metrics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, IP is the only supported metric. Updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and bin ivf index don't support IP, cosine as well.

I think this check is meaningless.... what about INDEX_SPARSE_INVERTED_INDEX plus hamming

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.. this check is never completed since the beginning and we always rely on checks elsewhere to guarantee the precondition. Since this check originally has only 1 pair, I'll see if I can remove it in a later PR.

sparse index has no hamming distance support.

auto buf = std::shared_ptr<uint8_t[]>(new uint8_t[total_size]);
int64_t offset = 0;
for (auto data : field_datas) {
std::memcpy(buf.get() + offset, data->Data(), data->Size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can atually avoid this copy by introducing a wrapper?
This may save tons of memories.
Leave it for optimization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is very true. Currently we copy for both dense and sparse. I tried updating this but it requires some major refactoring to allow the field data to give up ownership of its underlying data. such a change will affect lots of code(most of them should be beneficial) and is way beyond the scope of this PR so I just keep it as is.

I am adding a TODO.

// TODO(SPARSE): this is for mmap to write data to disk so that
// the file can be mmaped into memory.
throw std::runtime_error(
"WriteFieldData for VECTOR_SPARSE_FLOAT not implemented");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means mmap is not supported for mmap yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no mmap support for sparse yet.

auto suc = insert_data->ParseFromArray(data_info, data_info_len);
auto insert_record_proto =
std::make_unique<milvus::InsertRecordProto>();
auto suc =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might introduce too much copy here.
but so far is good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a naming change from using InsertData = proto::segcore::InsertRecord; to using InsertRecordProto = proto::segcore::InsertRecord;. the name InsertData is overloaded for several purposes and ambiguous.

@xiaofan-luan
Copy link
Collaborator

/ltgm
/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xiaofan-luan, zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xiaofan-luan
Copy link
Collaborator

Except for not supporting mmap for now, the rest of the pr looks good to me.
Let's ship it and avoid unnecessary blokcing

This commit adds sparse float vector support to segcore with the following:

1. data type enum declarations
2. Adds corresponding data structures for handling sparse float vectors
   in various scenarios, including:
  * FieldData as a bridge between the binlog and the in memory data structures
  * mmap::Column as the in memory representation of a sparse float vector
    column of a sealed segment;
  * ConcurrentVector as the in memory representation of a sparse float
    vector of a growing segment which supports inserts.
3. Adds logic in payload reader/writer to serialize/deserialize from/to binlog
4. Adds the ability to allow the index node to build sparse float vector index
5. Adds the ability to allow the query node to build growing index for
   growing segment and temp index for sealed segment without index built

This commit also includes some code cleanness, comment improvement, and
some unit tests for sparse vector.

Signed-off-by: Buqian Zheng <[email protected]>
@zhengbuqian zhengbuqian force-pushed the sparse-v7-pr1-official branch from 2c23d04 to 1abc16c Compare March 10, 2024 06:32
@czs007 czs007 added ci-passed manual-pass manually set pass before ci-passed labeled labels Mar 11, 2024
@czs007
Copy link
Collaborator

czs007 commented Mar 11, 2024

/lgtm

@sre-ci-robot sre-ci-robot merged commit 070dfc7 into milvus-io:master Mar 11, 2024
13 of 14 checks passed
@zhengbuqian zhengbuqian deleted the sparse-v7-pr1-official branch March 11, 2024 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/compilation area/dependency Pull requests that update a dependency file ci-passed dco-passed DCO check passed. kind/feature Issues related to feature request from users lgtm manual-pass manually set pass before ci-passed labeled size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants