Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add contraints and embedding directives #3405

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from

Conversation

fredcarle
Copy link
Collaborator

@fredcarle fredcarle commented Jan 25, 2025

Relevant issue(s)

Resolves #3350
Resolves #3351

Description

Sorry for having 3 technically separate features as part of 1 PR. The reason for this is that I started working on embeddings and the size contraint was initially part of the embedding. Since we discussed that it could be applied to all array fields (and later even to String and Blob), I extracted it into a contraints directive that has a size parameter (more contraints can be added in the future). Furthermore, embeddings returned from ML models are arrays of float32. This caused some precision issues because we only supported float64. When saving the float32 array, querying it would return an float64 array with slight precision issues. I decided to add the float32 type.

You can review the first commit for contraint and embedding related code and the second commit for the float related changes. Some float stuff might have leaked in the first commit. Sorry for this. I tried hard to separate the float32 related changes.

Note that the gql.Float type is now Float64 internally.

type User {
  points: Float
}

is the same as

type User {
  points: Float64
}

The embedding generation relies on a 3rd party package called chromem-go to call the model provider API. As long as one of the supported provider API is configured and accessible, the embeddings will be generated when adding new documents. I've added a step in the test workflow that will run the embedding specific tests on linux only (this is because installation on windows and mac is less straight forward) using Ollama (because it runs locally).

The call to the API has to be done synchronously otherwise the docID/CID won't be representative of the contents. The only alternative would be for the system to automatically update the document when returning from the API call but I see that as a inferior option as it hides the update step from the user. It could also make doc anchoring more complicated as the user would have to remember to wait on the doc update before anchoring the doc at the right CID.

We could avoid having embedding generation support and let the users do that call themselves and store the embedding vectors directly. However, having it as a feature allows us to support RAG vector search which would let users get started with AI with very little work. This seems to be something our partners are looking forward to.

I don't see the 3rd party API call inline with a mutation as a problem since this is something that has to be configured by users and users will expect the mutation calls to take more time as a result.

If you're interested in running it locally, install Ollama and define a schema like so

type User {
    name: String
    about: String
    name_v: [Float32!] @constraints(size: 768) @embedding(fields: ["name", "about"], provider: "ollama", model: "nomic-embed-text",  url: "http://localhost:11434/api") // contraint is optional and localhost:11434 is the default port for Ollama
}

Next steps:

  • Support templates for the content sent to the model.
  • Add the _similarity operation to calculate the cosine similarity between two arrays.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test and manual testing

Specify the platform(s) on which this was tested:

  • MacOS

@fredcarle fredcarle added feature New feature or request area/query Related to the query component area/schema Related to the schema system labels Jan 25, 2025
@fredcarle fredcarle added this to the DefraDB v0.16 milestone Jan 25, 2025
@fredcarle fredcarle requested a review from a team January 25, 2025 00:35
@fredcarle fredcarle self-assigned this Jan 25, 2025
@fredcarle fredcarle changed the title feat: Add contraints and embedding directives and Float32 type support feat: Add contraints and embedding directives Jan 25, 2025
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the calls out to 3rd party APIs needs expanded documentation and discussion with the team. I have reviewed very little so far but will continue whilst that issue is being resolved.

@@ -74,6 +74,12 @@ jobs:
DEFRA_MUTATION_TYPE: ${{ matrix.mutation-type }}

steps:
- name: Install ollama
run: curl -fsSL https://ollama.com/install.sh | sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This is only needed for basic test matrix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Why are you asking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because:

  1. I don't understand why this is not needed for other test types (acp, view, encryption, etc.).
  2. Is this a required dependency now? if so then:
    • document in README
    • add a make rule in Makefile
    • perhaps add this to the setup composite instead (./.github/composites/setup-defradb)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When you see "Ollama", do you need more documentation to know what this is about? I might be wrong to assume that would be enough to understand that this is only for embedding related testing.
  2. It's not a Defra dependency. Only a testing dependency for the embedding specific tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe I missed it but last I checked the embedding tests they didn't have SupportedTypes like config. So I assumed they will run with all configurations. For example when encryption or macos tests run, wouldn't they still run and hence be need this dep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yea so similar to how we have rust dep for lens test, this should have a make rule imo and will be part of test/dev deps

client/normal_new.go Show resolved Hide resolved
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given the production code a first-pass review, and will shortly look through the tests. The code looks good, most of the comments are about moving stuff around, or questions/thoughts RE the feature design.

thought: It looks very tedious and error prone to introduce new types atm, although numerics are probably most painful. Perhaps we should review certain aspects in the future to improve this.

client/document.go Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/request/graphql/schema/collection.go Outdated Show resolved Hide resolved
internal/request/graphql/schema/collection.go Show resolved Hide resolved
internal/request/graphql/schema/types/base.go Show resolved Hide resolved
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the tests, thanks for adding them, I think there are a couple of important test gaps though, please add tests for:

  1. Patching a size constraint, including removing it
  2. Patching an embedding property, and it's children:
    2.1 Adding an embedding description to a field that didnt have one before
    2.2 Mutating each of the embedding properties on a field that declared one on create
    2.3 Removing an embedding property from a field that had one on create
  3. Including a size constraint in a View declaration, if you don't/haven't blocked it off, please include a test that shows what happens if that constraint is violated
  4. Including an embedding property in a View declaration (I assume we want this to error for now).

Copy link
Member

@jsimnz jsimnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this as "request for changes" (you can dismiss at any time), even tho there aren't any explcit todos.

I have mainly focused on the definition/description code along with the collection/doc updating. Speed ran through the encoding section and the gql parsing.

Haven't reviewed the tests.

Main question/discussion points are "when" you are generating the embedding in the call path, and the location of the EmbeddingDescription.

client/collection_field_description.go Outdated Show resolved Hide resolved
internal/db/collection.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
internal/request/graphql/schema/collection.go Show resolved Hide resolved
client/document.go Show resolved Hide resolved
client/normal_value_test.go Outdated Show resolved Hide resolved
client/schema_field_description.go Outdated Show resolved Hide resolved
client/schema_field_description_test.go Show resolved Hide resolved
internal/db/embedding.go Outdated Show resolved Hide resolved
@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch from ef3b630 to dacfe46 Compare February 4, 2025 23:32
@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch from ae1d158 to 7f1cb9b Compare February 4, 2025 23:56
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 63.00623% with 475 lines in your changes missing coverage. Please review.

Project coverage is 55.05%. Comparing base (83e2523) to head (107576d).

Files with missing lines Patch % Lines
internal/db/embedding.go 3.37% 84 Missing and 2 partials ⚠️
client/document.go 11.27% 62 Missing and 1 partial ⚠️
internal/request/graphql/schema/collection.go 12.68% 62 Missing ⚠️
client/normal_new.go 55.45% 45 Missing ⚠️
internal/core/encoding.go 2.38% 41 Missing ⚠️
internal/planner/sum.go 15.79% 32 Missing ⚠️
client/schema_field_description.go 18.92% 30 Missing ⚠️
internal/db/definition_validation.go 32.00% 15 Missing and 2 partials ⚠️
internal/request/graphql/schema/types/scalars.go 93.28% 16 Missing ⚠️
internal/db/fetcher/indexer_matchers.go 22.22% 13 Missing and 1 partial ⚠️
... and 13 more
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##           develop    #3405       +/-   ##
============================================
- Coverage    78.31%   55.05%   -23.26%     
============================================
  Files          392      393        +1     
  Lines        36045    37080     +1035     
============================================
- Hits         28226    20412     -7814     
- Misses        6159    15165     +9006     
+ Partials      1660     1503      -157     
Flag Coverage Δ
all-tests 55.05% <63.01%> (-23.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
client/collection_field_description.go 71.43% <100.00%> (+1.43%) ⬆️
client/definitions.go 54.27% <100.00%> (-21.36%) ⬇️
client/normal_array.go 98.11% <100.00%> (+0.11%) ⬆️
client/normal_array_of_nillables.go 97.03% <100.00%> (+0.19%) ⬆️
client/normal_nil.go 95.12% <100.00%> (+0.84%) ⬆️
client/normal_nillable_array.go 98.80% <100.00%> (+0.09%) ⬆️
client/normal_nillable_array_of_nillables.go 100.00% <100.00%> (ø)
client/normal_nillable_scalar.go 100.00% <100.00%> (ø)
client/normal_scalar.go 100.00% <100.00%> (ø)
client/normal_util.go 93.22% <100.00%> (-3.01%) ⬇️
... and 34 more

... and 174 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83e2523...107576d. Read the comment docs.

@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch 6 times, most recently from f7004b7 to 117e1fd Compare February 5, 2025 16:33
@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch from 117e1fd to e54b83e Compare February 5, 2025 17:31
@@ -74,7 +74,7 @@ func TestColDescrUpdateAddCollections_Errors(t *testing.T) {
{ "op": "add", "path": "/2", "value": {"ID": 2, "Name": "Dogs"} }
]
`,
ExpectedError: "adding collections via patch is not supported. ID: 2",
ExpectedError: "schema name can't be empty",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Sorry, whilst my last comment was misplaced, this error was useful to the user, please revert this test and change the production code to accomodate.

Fields: []string{"name", "about"},
Provider: "ollama",
Model: "nomic-embed-text",
URL: "http://localhost:11434/api",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please add tests covering the omission of each property in a patch, and SDL ( e.g. what happens if the user does not provide a url?).

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The features look good, but I think there are still a few more tests and documentation to write.

Depending on any new/pending feedback from the others, it is still not too late to break this up into smaller PRs, especially if we want to deal with the collection vs schema description problem here.

tests/integration/test_case.go Show resolved Hide resolved
client/collection_description.go Show resolved Hide resolved
client/document.go Show resolved Hide resolved
internal/core/encoding.go Show resolved Hide resolved
internal/db/base/compare.go Show resolved Hide resolved
internal/db/collection_define.go Show resolved Hide resolved
internal/request/graphql/schema/collection.go Outdated Show resolved Hide resolved
internal/request/graphql/schema/collection.go Show resolved Hide resolved
internal/request/graphql/schema/errors.go Outdated Show resolved Hide resolved
@fredcarle fredcarle force-pushed the fredcarle/feat/embeddings branch from 58bce58 to 0c519c8 Compare February 6, 2025 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component area/schema Related to the schema system feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add size constraint to array types Create embeddings based on document fields
5 participants