-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using upsert to add entities, the number of entities retrieved from query() gives all the entites and doesn't consider overwrites #249
Comments
So this is using milvus lite? Could you offer a script to reproduce this issue? |
@xiaofan-luan The query interface results will be deduplicated, and count(*) will directly return the results of segcore (current version 2.4.1) |
|
did you only insert in one entity? |
There is minor change to code above (insert has to be change with upsert)
I have upserted 3 vectors with the same id (primary_key). Shouldn't the upsert call override the the ids and finally the total entities after querying must be 1 (Please correct me if I am wrong)? This is the output of queries:
|
data = [ {"id": 1, "vector": vectors[i], "chunk_text": docs[i]} for i in range(len(vectors)) ] so all your data is with same ID? |
for milvus, each row should have a unique ID. |
I referred to this documentation https://milvus.io/docs/v2.4.x/insert-update-delete.md#Upsert-entities
|
usually this could be your DB primary key, a chunkID, a image url ID. we do fix some upsert issues to make sure upsert are not duplicated even with the same IDs. you can try with 2.5.4 and see. |
Let's say I have a use case where the hash of my chunk text acts as the primary key. I want to use upsert so that there is some level of dedup happening at milvus end. |
please take a look at this line data = [ {"id": 1, "vector": vectors[i], "chunk_text": docs[i]} for i in range(len(vectors)) ] I think it should be data = [ {"id": i, "vector": vectors[i], "chunk_text": docs[i]} for i in range(len(vectors)) ] My suggestion is to use LLM help you to debug and the current code seems to be buggy |
then your ID should be hashfunction(docs[i]) |
it will be overwritten with same IDs. that is true. |
I tried to replicate scenario where all of the ids are the same and verify the query() results With distributed milvus, the query() takes into account the duplicate primary keys. (Verified it through query() and attu) |
query will dedup pks on different segments but count won't |
Right now, if you insert two entities with same PK, we don't guarantee what happened. |
With pymilvus version 2.4.9, I am trying to upsert data to a collection. For a test example, I upserted entities with same primary_key to check for overrides. When I try to get the num of entities from query(), it returns the total number of entities and doesn't consider the overwrites because of the upsert calls.
But when I create a dataframe of the output of following query, the len of the dataframe is correct.
Is this the expected behaviour?
Thnak you for the help!!
The text was updated successfully, but these errors were encountered: