You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:
constcontentHash=this._hashStringToUUID(this.pageContent)try{constmetadataHash=this._hashNestedDictToUUID(this.metadata)this.contentHash=contentHashthis.metadataHash=metadataHash}catch(e){thrownewError(`Failed to hash metadata: ${e}. Please use a dict that can be serialized using json.`)}this.hash_=this._hashStringToUUID(this.contentHash+this.metadataHash)if(!this.uid){this.uid=this.hash_}
Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.
So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.
Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.
I can do a PR with backward compatibility if needed.
The text was updated successfully, but these errors were encountered:
After further digging, it seems that using the hash as key is a must have in the langchain RecordManager logic as it's used to retrieve existing docs excepted updated ones. It also appears that collisions in UUID V5 are very uncommon (so having 2 in the same document is quite impossible)
BTW, my first assumption was probably wrong because i had a duplicated key error when inserting, but if the hashes are the problem, the document should have been matched (and ignored) with this hash first.
I'm still not sure of what happened, which make me unsecure, but i think that keeping as is for now is the best solution.
I'll feed back with the result of my investigations.
My guess for now is that when i upserted, it tried to re-insert chunks that were already in the DB and didn't changes (and so having still the same key).
I'll probably make a PR to improve DB Cleaning when embedding fails.
Hi @HenryHengZJ,
I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:
Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.
So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.
Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.
I can do a PR with backward compatibility if needed.
The text was updated successfully, but these errors were encountered: