[Discussion] Purpose of hashed id for indexing #3868

JJK801 · 2025-01-14T16:28:51Z

I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:

const contentHash = this._hashStringToUUID(this.pageContent)

try {
    const metadataHash = this._hashNestedDictToUUID(this.metadata)
    this.contentHash = contentHash
    this.metadataHash = metadataHash
} catch (e) {
    throw new Error(`Failed to hash metadata: ${e}. Please use a dict that can be serialized using json.`)
}

this.hash_ = this._hashStringToUUID(this.contentHash + this.metadataHash)

if (!this.uid) {
    this.uid = this.hash_
}

Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.

So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.

Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.

I can do a PR with backward compatibility if needed.

HenryHengZJ · 2025-01-15T10:27:25Z

We can definitely use the chunk ID if we have access to it, feel free to raise the PR!

JJK801 · 2025-01-15T11:42:38Z

i'll do it ASAP

JJK801 · 2025-01-16T15:30:39Z

After further digging, it seems that using the hash as key is a must have in the langchain RecordManager logic as it's used to retrieve existing docs excepted updated ones. It also appears that collisions in UUID V5 are very uncommon (so having 2 in the same document is quite impossible)

BTW, my first assumption was probably wrong because i had a duplicated key error when inserting, but if the hashes are the problem, the document should have been matched (and ignored) with this hash first.

I'm still not sure of what happened, which make me unsecure, but i think that keeping as is for now is the best solution.

I'll feed back with the result of my investigations.

JJK801 · 2025-01-16T15:33:14Z

My guess for now is that when i upserted, it tried to re-insert chunks that were already in the DB and didn't changes (and so having still the same key).

I'll probably make a PR to improve DB Cleaning when embedding fails.

HenryHengZJ added the question Further information is requested label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Purpose of hashed id for indexing #3868

[Discussion] Purpose of hashed id for indexing #3868

JJK801 commented Jan 14, 2025

HenryHengZJ commented Jan 15, 2025

JJK801 commented Jan 15, 2025

JJK801 commented Jan 16, 2025

JJK801 commented Jan 16, 2025 •

edited

Loading

[Discussion] Purpose of hashed id for indexing #3868

[Discussion] Purpose of hashed id for indexing #3868

Comments

JJK801 commented Jan 14, 2025

HenryHengZJ commented Jan 15, 2025

JJK801 commented Jan 15, 2025

JJK801 commented Jan 16, 2025

JJK801 commented Jan 16, 2025 • edited Loading

JJK801 commented Jan 16, 2025 •

edited

Loading