Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Purpose of hashed id for indexing #3868

Open
JJK801 opened this issue Jan 14, 2025 · 4 comments
Open

[Discussion] Purpose of hashed id for indexing #3868

JJK801 opened this issue Jan 14, 2025 · 4 comments
Labels
question Further information is requested

Comments

@JJK801
Copy link
Contributor

JJK801 commented Jan 14, 2025

Hi @HenryHengZJ,

I'm currently in trouble with indexing feature for managing large files (8~10Mo CSV & PDF), because some ids are conflicting (duplicate key in postgres) probably (not sure but it seems obvious) due to the hashed ids:

const contentHash = this._hashStringToUUID(this.pageContent)

try {
    const metadataHash = this._hashNestedDictToUUID(this.metadata)
    this.contentHash = contentHash
    this.metadataHash = metadataHash
} catch (e) {
    throw new Error(`Failed to hash metadata: ${e}. Please use a dict that can be serialized using json.`)
}

this.hash_ = this._hashStringToUUID(this.contentHash + this.metadataHash)

if (!this.uid) {
    this.uid = this.hash_
}

Long story short: I upserted a large PDF (10Mo), everything goes well, then i tried to add 1 Large CSV (8Mo) and i got an error due to duplicated key, so i cleaned up and tried again with the first PDF and another small (200ko) PDF, duplicated key again.

So, by inspecting the code, i was wondering why we don't simply use the chunk id as the document id ? it should be way safer as they are auto generated and unique.

Anyway, it's a bad practice to generate a hash from other hashes, because it drastically rises the risk of collision.

I can do a PR with backward compatibility if needed.

@HenryHengZJ
Copy link
Contributor

We can definitely use the chunk ID if we have access to it, feel free to raise the PR!

@HenryHengZJ HenryHengZJ added the question Further information is requested label Jan 15, 2025
@JJK801
Copy link
Contributor Author

JJK801 commented Jan 15, 2025

i'll do it ASAP

@JJK801
Copy link
Contributor Author

JJK801 commented Jan 16, 2025

After further digging, it seems that using the hash as key is a must have in the langchain RecordManager logic as it's used to retrieve existing docs excepted updated ones. It also appears that collisions in UUID V5 are very uncommon (so having 2 in the same document is quite impossible)

BTW, my first assumption was probably wrong because i had a duplicated key error when inserting, but if the hashes are the problem, the document should have been matched (and ignored) with this hash first.

I'm still not sure of what happened, which make me unsecure, but i think that keeping as is for now is the best solution.

I'll feed back with the result of my investigations.

@JJK801
Copy link
Contributor Author

JJK801 commented Jan 16, 2025

My guess for now is that when i upserted, it tried to re-insert chunks that were already in the DB and didn't changes (and so having still the same key).

I'll probably make a PR to improve DB Cleaning when embedding fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants