Skip to content

Commit

Permalink
feat: Add contraints and embedding directives (#3405)
Browse files Browse the repository at this point in the history
## Relevant issue(s)

Resolves #3350
Resolves #3351

## Description

Sorry for having 3 technically separate features as part of 1 PR. The
reason for this is that I started working on embeddings and the size
contraint was initially part of the embedding. Since we discussed that
it could be applied to all array fields (and later even to String and
Blob), I extracted it into a contraints directive that has a size
parameter (more contraints can be added in the future). Furthermore,
embeddings returned from ML models are arrays of float32. This caused
some precision issues because we only supported float64. When saving the
float32 array, querying it would return an float64 array with slight
precision issues. I decided to add the float32 type.

You can review the first commit for contraint and embedding related code
and the second commit for the float related changes. Some float stuff
might have leaked in the first commit. Sorry for this. I tried hard to
separate the float32 related changes.

Note that the `gql.Float` type is now `Float64` internally.
```graphql
type User {
  points: Float
}
```
is the same as
```graphql
type User {
  points: Float64
}
```

The embedding generation relies on a 3rd party package called
`chromem-go` to call the model provider API. As long as one of the
supported provider API is configured and accessible, the embeddings will
be generated when adding new documents. I've added a step in the test
workflow that will run the embedding specific tests on linux only (this
is because installation on windows and mac is less straight forward)
using Ollama (because it runs locally).

The call to the API has to be done synchronously otherwise the docID/CID
won't be representative of the contents. The only alternative would be
for the system to automatically update the document when returning from
the API call but I see that as a inferior option as it hides the update
step from the user. It could also make doc anchoring more complicated as
the user would have to remember to wait on the doc update before
anchoring the doc at the right CID.

We could avoid having embedding generation support and let the users do
that call themselves and store the embedding vectors directly. However,
having it as a feature allows us to support RAG vector search which
would let users get started with AI with very little work. This seems to
be something our partners are looking forward to.

I don't see the 3rd party API call inline with a mutation as a problem
since this is something that has to be configured by users and users
will expect the mutation calls to take more time as a result.

If you're interested in running it locally, install Ollama and define a
schema like so
```graphql
type User {
    name: String
    about: String
    name_v: [Float32!] @Constraints(size: 768) @Embedding(fields: ["name", "about"], provider: "ollama", model: "nomic-embed-text",  url: "http://localhost:11434/api") // contraint is optional and localhost:11434 is the default port for Ollama
}
```

Next steps: 
 - Support templates for the content sent to the model.
- Add the `_similarity` operation to calculate the cosine similarity
between two arrays.
  • Loading branch information
fredcarle authored Feb 7, 2025
1 parent 34eab3f commit 730eb15
Show file tree
Hide file tree
Showing 110 changed files with 7,380 additions and 878 deletions.
12 changes: 12 additions & 0 deletions .github/workflows/test-coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ env:

DEFRA_BADGER_ENCRYPTION: false

DEFRA_VECTOR_EMBEDDING: false

DEFRA_MUTATION_TYPE: collection-save
DEFRA_LENS_TYPE: wasm-time
DEFRA_ACP_TYPE: local
Expand Down Expand Up @@ -72,6 +74,7 @@ jobs:
DEFRA_BADGER_MEMORY: ${{ matrix.database-type == 'memory' }}
DEFRA_BADGER_FILE: ${{ matrix.database-type == 'file' }}
DEFRA_MUTATION_TYPE: ${{ matrix.mutation-type }}
DEFRA_VECTOR_EMBEDDING: true

steps:
- name: Checkout code into the directory
Expand All @@ -80,6 +83,15 @@ jobs:
- name: Setup defradb
uses: ./.github/composites/setup-defradb

- name: Install Ollama
run: make deps:ollama

- name: Run Ollama
run: make ollama

- name: Pull LLM model
run: make ollama:nomic

- name: Test coverage & save coverage report in an artifact
uses: ./.github/composites/test-coverage-with-artifact
with:
Expand Down
21 changes: 21 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,16 @@ deps\:mocks:
deps\:playground:
cd $(PLAYGROUND_DIRECTORY) && npm install --legacy-peer-deps && npm run build

.PHONY: deps\:ollama
deps\:ollama:
ifeq ($(OS_GENERAL),Linux)
curl -fsSL https://ollama.com/install.sh | sh
else ifeq ($(OS_GENERAL),Darwin)
brew install ollama
else
@echo "Makefile installation of Ollama is not supported for your system. Please install manually."
endif

.PHONY: deps
deps:
@$(MAKE) deps:modules && \
Expand All @@ -185,6 +195,17 @@ mocks:
@$(MAKE) deps:mocks
mockery --config="tools/configs/mockery.yaml"

.PHONY: ollama
ollama:
# run ollama in the background
nohup ollama serve > ollama.log 2>&1 &

.PHONY: ollama\:nomic
ollama\:nomic:
# make sure ollama is running before continuing
time curl --retry 5 --retry-connrefused --retry-delay 0 -sf http://localhost:11434
ollama pull nomic-embed-text

.PHONY: dev\:start
dev\:start:
@$(MAKE) build
Expand Down
83 changes: 74 additions & 9 deletions client/collection_description.go
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,16 @@ type CollectionDescription struct {
// Currently this property is immutable and can only be set on collection creation, however
// that will change in the future.
IsBranchable bool

// VectorEmbeddings contains the configuration for generating embedding vectors.
//
// This is only usable with array fields.
//
// When configured, embeddings may call 3rd party APIs inline with document mutations.
// This may cause increase latency in the completion of the mutation requests.
// This is necessary to ensure that the generated docID is representative of the
// content of the document.
VectorEmbeddings []VectorEmbeddingDescription
}

// QuerySource represents a collection data source from a query.
Expand Down Expand Up @@ -199,15 +209,16 @@ func sourcesOfType[ResultType any](col CollectionDescription) []ResultType {
// of json to a [CollectionDescription].
type collectionDescription struct {
// These properties are unmarshalled using the default json unmarshaller
Name immutable.Option[string]
ID uint32
RootID uint32
SchemaVersionID string
IsMaterialized bool
IsBranchable bool
Policy immutable.Option[PolicyDescription]
Indexes []IndexDescription
Fields []CollectionFieldDescription
Name immutable.Option[string]
ID uint32
RootID uint32
SchemaVersionID string
IsMaterialized bool
IsBranchable bool
Policy immutable.Option[PolicyDescription]
Indexes []IndexDescription
Fields []CollectionFieldDescription
VectorEmbeddings []VectorEmbeddingDescription

// Properties below this line are unmarshalled using custom logic in [UnmarshalJSON]
Sources []map[string]json.RawMessage
Expand All @@ -230,6 +241,7 @@ func (c *CollectionDescription) UnmarshalJSON(bytes []byte) error {
c.Fields = descMap.Fields
c.Sources = make([]any, len(descMap.Sources))
c.Policy = descMap.Policy
c.VectorEmbeddings = descMap.VectorEmbeddings

for i, source := range descMap.Sources {
sourceJson, err := json.Marshal(source)
Expand Down Expand Up @@ -268,3 +280,56 @@ func (c *CollectionDescription) UnmarshalJSON(bytes []byte) error {

return nil
}

// VectorEmbeddingDescription hold the relevant information to generate embeddings.
//
// Embeddings are AI/ML specific vector representations of some content.
// In the case of DefraDB, that content is one or multiple fields, optionally added to a template.
type VectorEmbeddingDescription struct {
// FieldName is the name of the field on the collection that this embedding description applies to.
FieldName string
// Fields are the fields in the parent schema that will be used as the basis of the
// vector generation.
Fields []string
// Model is the LLM of the provider to use for generating the embeddings.
// For example: text-embedding-3-small
Model string
// Provider is the API provider to use for generating the embeddings.
// For example: openai
Provider string
// (Optional) Template is the local path of the template to use with the
// field values to form the content to send to the model.
//
// For example, with the following schema,
// ```
// type User {
// name: String
// age: Int
// name_about_v: [Float32!] @embedding(fields: ["name", "age"], ...)
// }
// ````
// we can define the following Go template.
// ```
// {{ .name }} is {{ .age }} years old.
// ```
Template string
// URL is the url enpoint of the provider's API.
// For example: https://api.openai.com/v1
//
// Not providing a URL will result in the use of the default
// known URL for the given provider.
URL string
}

// IsSupportedVectorEmbeddingSourceKind return true if the fields used for embedding generation
// are of supported type.
//
// Currently, the supported types are Float32, Float64, Int and String
func IsSupportedVectorEmbeddingSourceKind(fieldKind FieldKind) bool {
switch fieldKind {
case FieldKind_NILLABLE_FLOAT32, FieldKind_NILLABLE_FLOAT64, FieldKind_NILLABLE_INT, FieldKind_NILLABLE_STRING:
return true
default:
return false
}
}
8 changes: 8 additions & 0 deletions client/collection_field_description.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,12 @@ type CollectionFieldDescription struct {
//
// This value has no effect on views.
DefaultValue any

// Size is a constraint that can be applied to fields that are arrays.
//
// Mutations on fields with a size constraint will fail if the size of the array
// does not match the constraint.
Size int
}

func (f FieldID) String() string {
Expand All @@ -56,6 +62,7 @@ type collectionFieldDescription struct {
ID FieldID
RelationName immutable.Option[string]
DefaultValue any
Size int

// Properties below this line are unmarshalled using custom logic in [UnmarshalJSON]
Kind json.RawMessage
Expand All @@ -72,6 +79,7 @@ func (f *CollectionFieldDescription) UnmarshalJSON(bytes []byte) error {
f.ID = descMap.ID
f.DefaultValue = descMap.DefaultValue
f.RelationName = descMap.RelationName
f.Size = descMap.Size
kind, err := parseFieldKind(descMap.Kind)
if err != nil {
return err
Expand Down
2 changes: 1 addition & 1 deletion client/ctype.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ func (t CType) IsSupportedFieldCType() bool {
func (t CType) IsCompatibleWith(kind FieldKind) bool {
switch t {
case PN_COUNTER, P_COUNTER:
if kind == FieldKind_NILLABLE_INT || kind == FieldKind_NILLABLE_FLOAT {
if kind == FieldKind_NILLABLE_INT || kind == FieldKind_NILLABLE_FLOAT64 || kind == FieldKind_NILLABLE_FLOAT32 {
return true
}
return false
Expand Down
8 changes: 8 additions & 0 deletions client/definitions.go
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,12 @@ type FieldDefinition struct {

// DefaultValue contains the default value for this field.
DefaultValue any

// Size is a constraint that can be applied to fields that are arrays.
//
// Mutations on fields with a size constraint will fail if the size of the array
// does not match the constraint.
Size int
}

// NewFieldDefinition returns a new [FieldDefinition], combining the given local and global elements
Expand All @@ -168,6 +174,7 @@ func NewFieldDefinition(local CollectionFieldDescription, global SchemaFieldDesc
Typ: global.Typ,
IsPrimaryRelation: kind.IsObject() && !kind.IsArray(),
DefaultValue: local.DefaultValue,
Size: local.Size,
}
}

Expand All @@ -179,6 +186,7 @@ func NewLocalFieldDefinition(local CollectionFieldDescription) FieldDefinition {
Kind: local.Kind.Value(),
RelationName: local.RelationName.Value(),
DefaultValue: local.DefaultValue,
Size: local.Size,
}
}

Expand Down
Loading

0 comments on commit 730eb15

Please sign in to comment.