-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [Sparse Float Vector] segcore basics and index building #30357
feat: [Sparse Float Vector] segcore basics and index building #30357
Conversation
@zhengbuqian E2e jenkins job failed, comment |
@zhengbuqian ut workflow job failed, comment |
@zhengbuqian E2e jenkins job failed, comment |
e706b12
to
049e907
Compare
049e907
to
10149dd
Compare
@zhengbuqian E2e jenkins job failed, comment |
10149dd
to
c5101b4
Compare
@zhengbuqian ut workflow job failed, comment |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #30357 +/- ##
==========================================
- Coverage 80.99% 80.94% -0.05%
==========================================
Files 977 965 -12
Lines 141490 142009 +519
==========================================
+ Hits 114594 114948 +354
- Misses 23045 23209 +164
- Partials 3851 3852 +1
|
c5101b4
to
f9bf5e4
Compare
@zhengbuqian E2e jenkins job failed, comment |
f9bf5e4
to
b64864d
Compare
@zhengbuqian E2e jenkins job failed, comment |
b64864d
to
43c15e0
Compare
53ec559
to
2c23d04
Compare
@zhengbuqian E2e jenkins job failed, comment |
/run-cpu-e2e |
@@ -68,6 +68,10 @@ unsupported_index_combinations() { | |||
static std::vector<std::tuple<IndexType, MetricType>> ret{ | |||
std::make_tuple(knowhere::IndexEnum::INDEX_FAISS_BIN_IVFFLAT, | |||
knowhere::metric::L2), | |||
std::make_tuple(knowhere::IndexEnum::INDEX_SPARSE_INVERTED_INDEX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't quite understand.
Do we support cosine metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, IP is the only supported metric. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and bin ivf index don't support IP, cosine as well.
I think this check is meaningless.... what about INDEX_SPARSE_INVERTED_INDEX plus hamming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.. this check is never completed since the beginning and we always rely on checks elsewhere to guarantee the precondition. Since this check originally has only 1 pair, I'll see if I can remove it in a later PR.
sparse index has no hamming distance support.
auto buf = std::shared_ptr<uint8_t[]>(new uint8_t[total_size]); | ||
int64_t offset = 0; | ||
for (auto data : field_datas) { | ||
std::memcpy(buf.get() + offset, data->Data(), data->Size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can atually avoid this copy by introducing a wrapper?
This may save tons of memories.
Leave it for optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is very true. Currently we copy for both dense and sparse. I tried updating this but it requires some major refactoring to allow the field data to give up ownership of its underlying data. such a change will affect lots of code(most of them should be beneficial) and is way beyond the scope of this PR so I just keep it as is.
I am adding a TODO.
// TODO(SPARSE): this is for mmap to write data to disk so that | ||
// the file can be mmaped into memory. | ||
throw std::runtime_error( | ||
"WriteFieldData for VECTOR_SPARSE_FLOAT not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means mmap is not supported for mmap yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, no mmap support for sparse yet.
auto suc = insert_data->ParseFromArray(data_info, data_info_len); | ||
auto insert_record_proto = | ||
std::make_unique<milvus::InsertRecordProto>(); | ||
auto suc = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might introduce too much copy here.
but so far is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a naming change from using InsertData = proto::segcore::InsertRecord;
to using InsertRecordProto = proto::segcore::InsertRecord;
. the name InsertData
is overloaded for several purposes and ambiguous.
/ltgm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: xiaofan-luan, zhengbuqian The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Except for not supporting mmap for now, the rest of the pr looks good to me. |
This commit adds sparse float vector support to segcore with the following: 1. data type enum declarations 2. Adds corresponding data structures for handling sparse float vectors in various scenarios, including: * FieldData as a bridge between the binlog and the in memory data structures * mmap::Column as the in memory representation of a sparse float vector column of a sealed segment; * ConcurrentVector as the in memory representation of a sparse float vector of a growing segment which supports inserts. 3. Adds logic in payload reader/writer to serialize/deserialize from/to binlog 4. Adds the ability to allow the index node to build sparse float vector index 5. Adds the ability to allow the query node to build growing index for growing segment and temp index for sealed segment without index built This commit also includes some code cleanness, comment improvement, and some unit tests for sparse vector. Signed-off-by: Buqian Zheng <[email protected]>
2c23d04
to
1abc16c
Compare
/lgtm |
This commit adds sparse float vector support to segcore with the following:
This commit also includes some code cleanness, comment improvement, and some unit tests for sparse vector.
#29419