<MilvusException: (code=65535, message=empty sparse float vector row)> #32972

shilei4260 · 2024-05-11T05:23:03Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

稀疏和密集向量时出现的报错https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 · 2024-05-11T06:31:07Z

@shilei4260 which version are you running for Milvus?
please offer milvus logs for investigation, thx
/assign @shilei4260
/unassign

xiaofan-luan · 2024-05-11T07:22:13Z

what model you are using?
random or M3?

stale · 2024-06-11T03:07:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

xxxfzxxx · 2024-06-26T08:33:08Z

hi, I met a similar error. I use bm25 embedding function, I use encode_queries function: sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query]) but sparse_embedding is return empty. Why is that? the bm25_ef is def bm25_ef(self):
bm = BM25EmbeddingFunction(build_default_analyzer(language="zh"))
bm.load("bm25_params.json")
return bm. Note that my input query is "图片尺寸", I think bm25 tokenizer, aka, default_analyzer should split it to "图片" and "尺寸". I can find the ascii code for "图片" and "尺寸" in the bm25_params.json. I think the problem is the default analyzer does not tokenize my query.

xxxfzxxx · 2024-06-26T08:50:25Z

urgent

wxywb · 2024-06-26T10:35:08Z

@xxxfzxxx I'm checking this issue.

wxywb · 2024-06-26T11:05:12Z

@xxxfzxxx your observation is correct.

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
import jieba

analyzer = build_default_analyzer(language="zh")

corpus = [
   "在登记册上所有的图片尺寸需要保持一致"
]

# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print(analyzer.tokenizer.__dict__)
print("tokens:", tokens)

tokens: ['登记册', '上', '图片尺寸', '保持一致']

The popular Chinese tokenizer project jieba used by this implementation will not split '图片尺寸' in two words. However jieba supports adjusting its vocabulary by user.
You can create a new file called custom.txt

图片 10000
尺寸 10000

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
import jieba

jieba.load_userdict("./custom.txt")
analyzer = build_default_analyzer(language="zh")

corpus = [
   "在登记册上所有的图片尺寸需要保持一致"
]

# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print(analyzer.tokenizer.__dict__)
print("tokens:", tokens)

tokens: ['登记册', '上', '图片', '尺寸', '保持一致']

wxywb · 2024-06-26T11:13:14Z

Adjusting jieba vocab cannot handle all corner cases. At lease we could have a naive method.

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction

class SimpleChineseTokenizer():
    def __init__(self):
        pass

    def tokenize(self, text: str):
        return list(text)


analyzer = build_default_analyzer(language="zh")
analyzer.tokenizer = SimpleChineseTokenizer()

corpus = [
   "在登记册上所有的图片尺寸需要保持一致"
]

# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print(analyzer.tokenizer.__dict__)
print("tokens:", tokens)

tokens: ['登', '记', '册', '上', '图', '片', '尺', '寸', '需', '保', '持', '致']

xxxfzxxx · 2024-06-26T11:20:43Z

I wonder how the milvus builtin bm25embeddingFunction will embed the unseen word in the query? From my observation, it will give nothing(None). What is the best solution if the tokens in the query does not occur in the previous bm25 tokens dict?

wxywb · 2024-06-26T11:41:20Z

@xxxfzxxx ，bm25 in this implementation will calculate the statistics(term frequencies, idfs) over tokenized words in documents. If a word tokenized in query not seem in documents then it would contribute nothing to the relevance score.
If you have such concerns, I think the best strategy is tokenizing Chinese sentences into single characters. For English, you need to tokenize them into subwords(like GPT's BPE tokens).

wxywb · 2024-06-26T12:59:42Z

hi, I met a similar error. I use bm25 embedding function, I use encode_queries function: sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query]) but sparse_embedding is return empty. Why is that? the bm25_ef is def bm25_ef(self): bm = BM25EmbeddingFunction(build_default_analyzer(language="zh")) bm.load("bm25_params.json") return bm. Note that my input query is "图片尺寸", I think bm25 tokenizer, aka, default_analyzer should split it to "图片" and "尺寸". I can find the ascii code for "图片" and "尺寸" in the bm25_params.json. I think the problem is the default analyzer does not tokenize my query.

Do you mean you get a zero-size sparse embedding or a sparse embedding all with zeros(size equals your len(idf)).

xiaofan-luan · 2024-06-26T13:36:13Z

if corpus don't have this word, you will get 0 in this dimension.
becasue no corpus will match this word

xxxfzxxx · 2024-06-27T02:55:22Z

hi, I met a similar error. I use bm25 embedding function, I use encode_queries function: sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query]) but sparse_embedding is return empty. Why is that? the bm25_ef is def bm25_ef(self): bm = BM25EmbeddingFunction(build_default_analyzer(language="zh")) bm.load("bm25_params.json") return bm. Note that my input query is "图片尺寸", I think bm25 tokenizer, aka, default_analyzer should split it to "图片" and "尺寸". I can find the ascii code for "图片" and "尺寸" in the bm25_params.json. I think the problem is the default analyzer does not tokenize my query.

Do you mean you get a zero-size sparse embedding or a sparse embedding all with zeros(size equals your len(idf)).

Yes, I print the "图片尺寸“ sparse embedding and it output nothing. It should give me a csr matrix right?

wxywb · 2024-06-27T02:57:08Z

hi, I met a similar error. I use bm25 embedding function, I use encode_queries function: sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query]) but sparse_embedding is return empty. Why is that? the bm25_ef is def bm25_ef(self): bm = BM25EmbeddingFunction(build_default_analyzer(language="zh")) bm.load("bm25_params.json") return bm. Note that my input query is "图片尺寸", I think bm25 tokenizer, aka, default_analyzer should split it to "图片" and "尺寸". I can find the ascii code for "图片" and "尺寸" in the bm25_params.json. I think the problem is the default analyzer does not tokenize my query.

Do you mean you get a zero-size sparse embedding or a sparse embedding all with zeros(size equals your len(idf)).

Yes, I print the "图片尺寸“ sparse embedding and it output nothing. It should give me a csr matrix right?

please show me your full code

xxxfzxxx · 2024-06-27T04:48:26Z

`dense_embeddings = [self.bgem3_model.get_embedding([query])[0]['dense_vecs']]
rewritten_query = self.get_query_rewrite(query)
sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query])
col = Collection(name=collection_name)
col.load()
search_param_dense = {
"data": dense_embeddings,
"anns_field": "dense_vector",
"param": {
"metric_type": "COSINE",
"params": {"nprobe": 10}
},
"limit": 100
}
search_param_sparse = {
"data": sparse_embeddings,
"anns_field": "sparse_vector",
"param": {
"metric_type": "IP",
"params": {"nprobe": 10}
},
"limit": 100 # TODO
}
request_dense = AnnSearchRequest(**search_param_dense)
request_sparse = AnnSearchRequest(**search_param_sparse)

    reqs = [request_dense, request_sparse]
    weighted_rerank = WeightedRanker(dense_weight, 1 - dense_weight)

    res = col.hybrid_search(
        reqs,
        weighted_rerank,
        limit=retrieved_cnt,
        output_fields=['doc_id', 'text', 'metadata']
    )`

hi, I met a similar error. I use bm25 embedding function, I use encode_queries function: sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query]) but sparse_embedding is return empty. Why is that? the bm25_ef is def bm25_ef(self): bm = BM25EmbeddingFunction(build_default_analyzer(language="zh")) bm.load("bm25_params.json") return bm. Note that my input query is "图片尺寸", I think bm25 tokenizer, aka, default_analyzer should split it to "图片" and "尺寸". I can find the ascii code for "图片" and "尺寸" in the bm25_params.json. I think the problem is the default analyzer does not tokenize my query.

Do you mean you get a zero-size sparse embedding or a sparse embedding all with zeros(size equals your len(idf)).

Yes, I print the "图片尺寸“ sparse embedding and it output nothing. It should give me a csr matrix right?

please show me your full code

wxywb · 2024-06-27T05:28:18Z

I wonder how you get the None sparse embedding.
https://github.com/milvus-io/milvus-model/blob/d812c9a84f2c530919ddffec8bf4024cce841e6b/milvus_model/sparse/bm25/bm25.py#L130
you can get a csr_array even you have an empty self.idf.

xxxfzxxx · 2024-06-27T06:07:38Z

I wonder how you get the None sparse embedding. https://github.com/milvus-io/milvus-model/blob/d812c9a84f2c530919ddffec8bf4024cce841e6b/milvus_model/sparse/bm25/bm25.py#L130 you can get a csr_array even you have an empty self.idf.

My bad, I check the type of the sparse_embeddings
print(">>>>", type(sparse_embeddings), sparse_embeddings) and the output is >>>> <class 'scipy.sparse._csr.csr_matrix'> meaning that sparse embedding is an csr_matrix. Since all values in the matrix are zeros so it does not print anything.

Then, how do I search for it, can you tell me how to update my hybrid search?
`search_param_dense = {
"data": dense_embeddings,
"anns_field": "dense_vector",
"param": {
"metric_type": "COSINE",
"params": {"nprobe": 10}
},
"limit": 100
}
search_param_sparse = {
"data": sparse_embeddings,
"anns_field": "sparse_vector",
"param": {
"metric_type": "IP",
"params": {"nprobe": 10}
},
"limit": 100 # TODO
}
request_dense = AnnSearchRequest(**search_param_dense)
request_sparse = AnnSearchRequest(**search_param_sparse)

    reqs = [request_dense, request_sparse]
    weighted_rerank = WeightedRanker(dense_weight, 1 - dense_weight)

    res = col.hybrid_search(
        reqs,
        weighted_rerank,
        limit=retrieved_cnt,
        output_fields=['doc_id', 'text', 'metadata']
    )`

raise MilvusException(status.code, status.reason, status.error_code)
pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=fail to search on QueryNode 33: worker(33) query failed: Assert "size > 0" at /go/src/github.com/milvus-io/milvus/internal/core/src/common/Utils.h:227
=> Sparse row data should not be empty)>

wxywb · 2024-06-27T06:56:42Z

@xxxfzxxx Your sparse embeddings seem to have zero length. Using following code to verify this.

print(sparse_embeddings.toarray().shape)

I think it will be a 0-length sparse embedding. Then you need to verify your bm25 idf, by.

print('elements in idf:', len(bm25_ef.idf))

It shouldn't be empty if you have fitted your corpus.

xxxfzxxx · 2024-06-27T07:27:39Z

(1, 18722)
elements in idf: 18722

xxxfzxxx · 2024-06-27T07:28:46Z

Note that the sparse_vector schema is FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR)

sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}

xxxfzxxx · 2024-06-27T07:48:19Z

@xxxfzxxx Your sparse embeddings seem to have zero length. Using following code to verify this.
print(sparse_embeddings.toarray().shape)
I think it will be a 0-length sparse embedding. Then you need to verify your bm25 idf, by.
print('elements in idf:', len(bm25_ef.idf))
It shouldn't be empty if you have fitted your corpus.

my query's sparse embeddings are not zero length. it is actually a all zero csr matrix.

wxywb · 2024-06-27T08:26:31Z

Milvus's sparse embedding requires the number of non-zeros (nnz) in the sparse embedding(both the doc and the query) to be greater than 0. The users need to check the nnz of every row of the sparse embeddings before inserting/searching. When it equals zero, you need to fall back on dense retrieval.

sparse_embeddings.nnz # nnz of all rows of sparse_embeddings if sparse_embeddings contains multiple rows.
sparse_embeddings[0].nnz # nnz of the first row of sparse_embeddings.

The reason behind this is that as IP is the only available distance metric, an embedding with 0 non zero values will have a 0 IP distance to any other embeddings, thus a distance judgement cannot be made.

wxywb · 2024-06-27T08:29:38Z

It seems that for the BM25EmbeddingFunction, there is a risk of generating an all-zero query sparse embedding, which is not supported by Milvus.

xxxfzxxx · 2024-07-01T11:44:04Z

I saw that https://github.com/milvus-io/milvus-model/blob/main/milvus_model/sparse/bm25/bm25.py line 194 has a json file to download(https://github.com/milvus-io/pymilvus-assets/releases/download/v0.1-bm25v1/bm25_msmarco_v1.json). But I cannot find it anywhere. Can you provide a chinese version?

wxywb · 2024-07-01T15:17:17Z

It will download this file where you executed the code. Currently I only fitted the BM25EmbeddingFunction on MS MARCO dataset for English language. If you can fit it on your dataset, you will get better results. If you want a pretrained sparse embedding function for Chinese. I strongly recommend you to test this https://milvus.io/docs/embed-with-bgm-m3.md.

stale · 2024-08-04T08:06:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

wangyiran33 · 2024-09-03T02:49:47Z

Can this PR solve the problem?

xiaofan-luan · 2024-09-04T00:29:51Z

I think an empty sparse float vector is a good signal, usually it means your corpus didn't fit the training dataset at all.
You should think of using another model like splade or m3

wangyiran33 · 2024-09-05T09:05:05Z

I think an empty sparse float vector is a good signal, usually it means your corpus didn't fit the training dataset at all. You should think of using another model like splade or m3

I am using BM25, which has the advantage of low training costs. However, it inevitably leads to cases where user queries are not present in the corpus, resulting in empty sparse vector queries. In such cases of hybrid retrieval, sparse retrieval should return no results, while the hybrid result should be the result of dense retrieval, which may be better than throwing an error.

xiaofan-luan · 2024-09-05T23:36:41Z

make sense to me @zhengbuqian
what do you think?

zhengbuqian · 2024-09-06T09:30:07Z

make sense to me @zhengbuqian what do you think?

Yes. That is the expected behavior in Milvus with #34700, where empty sparse vectors are allowed.

Currently if the users insert the empty sparse vectors using python list sparse_vecs = [{}, {}], the PyMilvus SDK will not see sparse_vecs as sparse vectors, converting it to a csr_matrix should solve the issue. We are working on a fix.

shilei4260 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2024

shilei4260 assigned yanliang567 May 11, 2024

sre-ci-robot assigned shilei4260 and unassigned yanliang567 May 11, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2024

stale bot added the stale indicates no udpates for 30 days label Jun 11, 2024

stale bot removed the stale indicates no udpates for 30 days label Jun 26, 2024

stale bot added the stale indicates no udpates for 30 days label Aug 4, 2024

stale bot closed this as completed Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<MilvusException: (code=65535, message=empty sparse float vector row)> #32972

<MilvusException: (code=65535, message=empty sparse float vector row)> #32972

shilei4260 commented May 11, 2024

yanliang567 commented May 11, 2024

xiaofan-luan commented May 11, 2024

stale bot commented Jun 11, 2024

xxxfzxxx commented Jun 26, 2024

xxxfzxxx commented Jun 26, 2024

wxywb commented Jun 26, 2024

wxywb commented Jun 26, 2024 •

edited

Loading

wxywb commented Jun 26, 2024

xxxfzxxx commented Jun 26, 2024 •

edited

Loading

wxywb commented Jun 26, 2024 •

edited

Loading

wxywb commented Jun 26, 2024

xiaofan-luan commented Jun 26, 2024

xxxfzxxx commented Jun 27, 2024

wxywb commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024

wxywb commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024

wxywb commented Jun 27, 2024 •

edited

Loading

xxxfzxxx commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024 •

edited

Loading

wxywb commented Jun 27, 2024

wxywb commented Jun 27, 2024

xxxfzxxx commented Jul 1, 2024

wxywb commented Jul 1, 2024

stale bot commented Aug 4, 2024

wangyiran33 commented Sep 3, 2024 •

edited

Loading

xiaofan-luan commented Sep 4, 2024

wangyiran33 commented Sep 5, 2024

xiaofan-luan commented Sep 5, 2024

zhengbuqian commented Sep 6, 2024

<MilvusException: (code=65535, message=empty sparse float vector row)> #32972

<MilvusException: (code=65535, message=empty sparse float vector row)> #32972

Comments

shilei4260 commented May 11, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented May 11, 2024

xiaofan-luan commented May 11, 2024

stale bot commented Jun 11, 2024

xxxfzxxx commented Jun 26, 2024

xxxfzxxx commented Jun 26, 2024

wxywb commented Jun 26, 2024

wxywb commented Jun 26, 2024 • edited Loading

wxywb commented Jun 26, 2024

xxxfzxxx commented Jun 26, 2024 • edited Loading

wxywb commented Jun 26, 2024 • edited Loading

wxywb commented Jun 26, 2024

xiaofan-luan commented Jun 26, 2024

xxxfzxxx commented Jun 27, 2024

wxywb commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024

wxywb commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024

wxywb commented Jun 27, 2024 • edited Loading

xxxfzxxx commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024

xxxfzxxx commented Jun 27, 2024 • edited Loading

wxywb commented Jun 27, 2024

wxywb commented Jun 27, 2024

xxxfzxxx commented Jul 1, 2024

wxywb commented Jul 1, 2024

stale bot commented Aug 4, 2024

wangyiran33 commented Sep 3, 2024 • edited Loading

xiaofan-luan commented Sep 4, 2024

wangyiran33 commented Sep 5, 2024

xiaofan-luan commented Sep 5, 2024

zhengbuqian commented Sep 6, 2024

wxywb commented Jun 26, 2024 •

edited

Loading

xxxfzxxx commented Jun 26, 2024 •

edited

Loading

wxywb commented Jun 26, 2024 •

edited

Loading

wxywb commented Jun 27, 2024 •

edited

Loading

xxxfzxxx commented Jun 27, 2024 •

edited

Loading

wangyiran33 commented Sep 3, 2024 •

edited

Loading