-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<MilvusException: (code=65535, message=empty sparse float vector row)> #32972
Comments
@shilei4260 which version are you running for Milvus? |
what model you are using? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
hi, I met a similar error. I use bm25 embedding function, I use encode_queries function: sparse_embeddings = self.bm25_ef.encode_queries([rewritten_query]) but sparse_embedding is return empty. Why is that? the bm25_ef is def bm25_ef(self): |
urgent |
@xxxfzxxx I'm checking this issue. |
@xxxfzxxx your observation is correct. from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
import jieba
analyzer = build_default_analyzer(language="zh")
corpus = [
"在登记册上所有的图片尺寸需要保持一致"
]
# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print(analyzer.tokenizer.__dict__)
print("tokens:", tokens) tokens: ['登记册', '上', '图片尺寸', '保持一致'] The popular Chinese tokenizer project jieba used by this implementation will not split '图片尺寸' in two words. However jieba supports adjusting its vocabulary by user. 图片 10000
尺寸 10000 from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
import jieba
jieba.load_userdict("./custom.txt")
analyzer = build_default_analyzer(language="zh")
corpus = [
"在登记册上所有的图片尺寸需要保持一致"
]
# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print(analyzer.tokenizer.__dict__)
print("tokens:", tokens) tokens: ['登记册', '上', '图片', '尺寸', '保持一致'] |
Adjusting jieba vocab cannot handle all corner cases. At lease we could have a naive method. from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
class SimpleChineseTokenizer():
def __init__(self):
pass
def tokenize(self, text: str):
return list(text)
analyzer = build_default_analyzer(language="zh")
analyzer.tokenizer = SimpleChineseTokenizer()
corpus = [
"在登记册上所有的图片尺寸需要保持一致"
]
# analyzer can tokenize the text into tokens
tokens = analyzer(corpus[0])
print(analyzer.tokenizer.__dict__)
print("tokens:", tokens) tokens: ['登', '记', '册', '上', '图', '片', '尺', '寸', '需', '保', '持', '致'] |
I wonder how the milvus builtin bm25embeddingFunction will embed the unseen word in the query? From my observation, it will give nothing(None). What is the best solution if the tokens in the query does not occur in the previous bm25 tokens dict? |
@xxxfzxxx ,bm25 in this implementation will calculate the statistics(term frequencies, idfs) over tokenized words in documents. If a word tokenized in query not seem in documents then it would contribute nothing to the relevance score. |
Do you mean you get a zero-size sparse embedding or a sparse embedding all with zeros(size equals your len(idf)). |
if corpus don't have this word, you will get 0 in this dimension. |
Yes, I print the "图片尺寸“ sparse embedding and it output nothing. It should give me a csr matrix right? |
please show me your full code |
`dense_embeddings = [self.bgem3_model.get_embedding([query])[0]['dense_vecs']]
|
I wonder how you get the |
My bad, I check the type of the sparse_embeddings Then, how do I search for it, can you tell me how to update my hybrid search?
raise MilvusException(status.code, status.reason, status.error_code) |
@xxxfzxxx Your sparse embeddings seem to have zero length. Using following code to verify this. print(sparse_embeddings.toarray().shape) I think it will be a 0-length sparse embedding. Then you need to verify your bm25 idf, by. print('elements in idf:', len(bm25_ef.idf)) It shouldn't be empty if you have fitted your corpus. |
(1, 18722) |
Note that the sparse_vector schema is FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR) sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"} |
my query's sparse embeddings are not zero length. it is actually a all zero csr matrix. |
Milvus's sparse embedding requires the number of non-zeros (nnz) in the sparse embedding(both the doc and the query) to be greater than 0. The users need to check the nnz of every row of the sparse embeddings before inserting/searching. When it equals zero, you need to fall back on dense retrieval. sparse_embeddings.nnz # nnz of all rows of sparse_embeddings if sparse_embeddings contains multiple rows.
sparse_embeddings[0].nnz # nnz of the first row of sparse_embeddings. The reason behind this is that as IP is the only available distance metric, an embedding with 0 non zero values will have a 0 IP distance to any other embeddings, thus a distance judgement cannot be made. |
It seems that for the BM25EmbeddingFunction, there is a risk of generating an all-zero query sparse embedding, which is not supported by Milvus. |
I saw that https://github.com/milvus-io/milvus-model/blob/main/milvus_model/sparse/bm25/bm25.py line 194 has a json file to download(https://github.com/milvus-io/pymilvus-assets/releases/download/v0.1-bm25v1/bm25_msmarco_v1.json). But I cannot find it anywhere. Can you provide a chinese version? |
It will download this file where you executed the code. Currently I only fitted the BM25EmbeddingFunction on MS MARCO dataset for English language. If you can fit it on your dataset, you will get better results. If you want a pretrained sparse embedding function for Chinese. I strongly recommend you to test this https://milvus.io/docs/embed-with-bgm-m3.md. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Can this PR solve the problem? |
I think an empty sparse float vector is a good signal, usually it means your corpus didn't fit the training dataset at all. |
I am using BM25, which has the advantage of low training costs. However, it inevitably leads to cases where user queries are not present in the corpus, resulting in empty sparse vector queries. In such cases of hybrid retrieval, sparse retrieval should return no results, while the hybrid result should be the result of dense retrieval, which may be better than throwing an error. |
make sense to me @zhengbuqian |
Yes. That is the expected behavior in Milvus with #34700, where empty sparse vectors are allowed. Currently if the users insert the empty sparse vectors using python list |
Is there an existing issue for this?
Environment
Current Behavior
稀疏和密集向量时出现的报错https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: