-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add contraints and embedding directives #3405
base: develop
Are you sure you want to change the base?
feat: Add contraints and embedding directives #3405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the calls out to 3rd party APIs needs expanded documentation and discussion with the team. I have reviewed very little so far but will continue whilst that issue is being resolved.
.github/workflows/test-coverage.yml
Outdated
@@ -74,6 +74,12 @@ jobs: | |||
DEFRA_MUTATION_TYPE: ${{ matrix.mutation-type }} | |||
|
|||
steps: | |||
- name: Install ollama | |||
run: curl -fsSL https://ollama.com/install.sh | sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: This is only needed for basic test matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Why are you asking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because:
- I don't understand why this is not needed for other test types (acp, view, encryption, etc.).
- Is this a required dependency now? if so then:
- document in
README
- add a make rule in
Makefile
- perhaps add this to the setup composite instead (
./.github/composites/setup-defradb
)
- document in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When you see "Ollama", do you need more documentation to know what this is about? I might be wrong to assume that would be enough to understand that this is only for embedding related testing.
- It's not a Defra dependency. Only a testing dependency for the embedding specific tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe I missed it but last I checked the embedding tests they didn't have
SupportedTypes
like config. So I assumed they will run with all configurations. For example when encryption or macos tests run, wouldn't they still run and hence be need this dep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yea so similar to how we have rust dep for lens test, this should have a make rule imo and will be part of test/dev deps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given the production code a first-pass review, and will shortly look through the tests. The code looks good, most of the comments are about moving stuff around, or questions/thoughts RE the feature design.
thought: It looks very tedious and error prone to introduce new types atm, although numerics are probably most painful. Perhaps we should review certain aspects in the future to improve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the tests, thanks for adding them, I think there are a couple of important test gaps though, please add tests for:
- Patching a
size
constraint, including removing it - Patching an
embedding
property, and it's children:
2.1 Adding an embedding description to a field that didnt have one before
2.2 Mutating each of the embedding properties on a field that declared one on create
2.3 Removing an embedding property from a field that had one on create - Including a
size
constraint in a View declaration, if you don't/haven't blocked it off, please include a test that shows what happens if that constraint is violated - Including an
embedding
property in a View declaration (I assume we want this to error for now).
tests/integration/mutation/create/constraints/size_constraint_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this as "request for changes" (you can dismiss at any time), even tho there aren't any explcit todos
.
I have mainly focused on the definition/description code along with the collection/doc updating. Speed ran through the encoding section and the gql parsing.
Haven't reviewed the tests.
Main question/discussion points are "when" you are generating the embedding in the call path, and the location of the EmbeddingDescription
.
There is also improvement on the embedding config validation.
ef3b630
to
dacfe46
Compare
ae1d158
to
7f1cb9b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3405 +/- ##
============================================
- Coverage 78.31% 55.05% -23.26%
============================================
Files 392 393 +1
Lines 36045 37080 +1035
============================================
- Hits 28226 20412 -7814
- Misses 6159 15165 +9006
+ Partials 1660 1503 -157
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 174 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
f7004b7
to
117e1fd
Compare
117e1fd
to
e54b83e
Compare
tests/integration/collection_description/updates/add/collections_test.go
Show resolved
Hide resolved
@@ -74,7 +74,7 @@ func TestColDescrUpdateAddCollections_Errors(t *testing.T) { | |||
{ "op": "add", "path": "/2", "value": {"ID": 2, "Name": "Dogs"} } | |||
] | |||
`, | |||
ExpectedError: "adding collections via patch is not supported. ID: 2", | |||
ExpectedError: "schema name can't be empty", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Sorry, whilst my last comment was misplaced, this error was useful to the user, please revert this test and change the production code to accomodate.
Fields: []string{"name", "about"}, | ||
Provider: "ollama", | ||
Model: "nomic-embed-text", | ||
URL: "http://localhost:11434/api", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please add tests covering the omission of each property in a patch, and SDL ( e.g. what happens if the user does not provide a url
?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The features look good, but I think there are still a few more tests and documentation to write.
Depending on any new/pending feedback from the others, it is still not too late to break this up into smaller PRs, especially if we want to deal with the collection vs schema description problem here.
58bce58
to
0c519c8
Compare
Relevant issue(s)
Resolves #3350
Resolves #3351
Description
Sorry for having 3 technically separate features as part of 1 PR. The reason for this is that I started working on embeddings and the size contraint was initially part of the embedding. Since we discussed that it could be applied to all array fields (and later even to String and Blob), I extracted it into a contraints directive that has a size parameter (more contraints can be added in the future). Furthermore, embeddings returned from ML models are arrays of float32. This caused some precision issues because we only supported float64. When saving the float32 array, querying it would return an float64 array with slight precision issues. I decided to add the float32 type.
You can review the first commit for contraint and embedding related code and the second commit for the float related changes. Some float stuff might have leaked in the first commit. Sorry for this. I tried hard to separate the float32 related changes.
Note that the
gql.Float
type is nowFloat64
internally.is the same as
The embedding generation relies on a 3rd party package called
chromem-go
to call the model provider API. As long as one of the supported provider API is configured and accessible, the embeddings will be generated when adding new documents. I've added a step in the test workflow that will run the embedding specific tests on linux only (this is because installation on windows and mac is less straight forward) using Ollama (because it runs locally).The call to the API has to be done synchronously otherwise the docID/CID won't be representative of the contents. The only alternative would be for the system to automatically update the document when returning from the API call but I see that as a inferior option as it hides the update step from the user. It could also make doc anchoring more complicated as the user would have to remember to wait on the doc update before anchoring the doc at the right CID.
We could avoid having embedding generation support and let the users do that call themselves and store the embedding vectors directly. However, having it as a feature allows us to support RAG vector search which would let users get started with AI with very little work. This seems to be something our partners are looking forward to.
I don't see the 3rd party API call inline with a mutation as a problem since this is something that has to be configured by users and users will expect the mutation calls to take more time as a result.
If you're interested in running it locally, install Ollama and define a schema like so
Next steps:
_similarity
operation to calculate the cosine similarity between two arrays.Tasks
How has this been tested?
make test and manual testing
Specify the platform(s) on which this was tested: