Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reuse documents/embeddings for files with the same sha256sum (+fix pgx bug) #464

Merged

Conversation

iwilltry42
Copy link
Contributor

@iwilltry42 iwilltry42 commented Mar 4, 2025

  • calculate sha256sum for the content of each ingested file and record it in the DB
  • when ingesting a file, check if there's a file with the same checksum in the DB
  • if yes and the embeddingModel (config on the owning dataset) matches the incoming one, then fetch the file's documents and their embeddings to re-use for the incoming file

NOTE: This is fine in the Obot setup, as we're using a static flow config for Obot, but if someone is playing with the flow config to test different settings, this may lead to reusing documents from file that were ingested with an old config - we can add checks for that as a follow-up.

Also, I figured that there's a race condition in the pgvector implementation, because appending to the pgx batch queue is not lock protected and I saw some cases, where documents got lost due to this. This is fixed now.

Issue: obot-platform/obot#1803

@iwilltry42 iwilltry42 requested a review from g-linville March 4, 2025 19:14
@iwilltry42 iwilltry42 merged commit a6d62fe into obot-platform:main Mar 7, 2025
2 checks passed
@iwilltry42 iwilltry42 deleted the feat/knowledge-file-embeddings-reuse branch March 7, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants