feat: Add contraints and embedding directives (#3405)

## Relevant issue(s) Resolves #3350 Resolves #3351 ## Description Sorry for having 3 technically separate features as part of 1 PR. The reason for this is that I started working on embeddings and the size contraint was initially part of the embedding. Since we discussed that it could be applied to all array fields (and later even to String and Blob), I extracted it into a contraints directive that has a size parameter (more contraints can be added in the future). Furthermore, embeddings returned from ML models are arrays of float32. This caused some precision issues because we only supported float64. When saving the float32 array, querying it would return an float64 array with slight precision issues. I decided to add the float32 type. You can review the first commit for contraint and embedding related code and the second commit for the float related changes. Some float stuff might have leaked in the first commit. Sorry for this. I tried hard to separate the float32 related changes. Note that the `gql.Float` type is now `Float64` internally. ```graphql type User { points: Float } ``` is the same as ```graphql type User { points: Float64 } ``` The embedding generation relies on a 3rd party package called `chromem-go` to call the model provider API. As long as one of the supported provider API is configured and accessible, the embeddings will be generated when adding new documents. I've added a step in the test workflow that will run the embedding specific tests on linux only (this is because installation on windows and mac is less straight forward) using Ollama (because it runs locally). The call to the API has to be done synchronously otherwise the docID/CID won't be representative of the contents. The only alternative would be for the system to automatically update the document when returning from the API call but I see that as a inferior option as it hides the update step from the user. It could also make doc anchoring more complicated as the user would have to remember to wait on the doc update before anchoring the doc at the right CID. We could avoid having embedding generation support and let the users do that call themselves and store the embedding vectors directly. However, having it as a feature allows us to support RAG vector search which would let users get started with AI with very little work. This seems to be something our partners are looking forward to. I don't see the 3rd party API call inline with a mutation as a problem since this is something that has to be configured by users and users will expect the mutation calls to take more time as a result. If you're interested in running it locally, install Ollama and define a schema like so ```graphql type User { name: String about: String name_v: [Float32!] @Constraints(size: 768) @Embedding(fields: ["name", "about"], provider: "ollama", model: "nomic-embed-text", url: "http://localhost:11434/api") // contraint is optional and localhost:11434 is the default port for Ollama } ``` Next steps: - Support templates for the content sent to the model. - Add the `_similarity` operation to calculate the cosine similarity between two arrays.
sourcenetwork · Feb 7, 2025 · 730eb15 · 730eb15
1 parent 34eab3f
commit 730eb15
Show file tree

Hide file tree

Showing 110 changed files with 7,380 additions and 878 deletions.
diff --git a/.github/workflows/test-coverage.yml b/.github/workflows/test-coverage.yml
@@ -37,6 +37,8 @@ env:
 
   DEFRA_BADGER_ENCRYPTION: false
 
+  DEFRA_VECTOR_EMBEDDING: false
+
   DEFRA_MUTATION_TYPE: collection-save
   DEFRA_LENS_TYPE: wasm-time
   DEFRA_ACP_TYPE: local
@@ -72,6 +74,7 @@ jobs:
       DEFRA_BADGER_MEMORY: ${{ matrix.database-type == 'memory' }}
       DEFRA_BADGER_FILE: ${{ matrix.database-type == 'file' }}
       DEFRA_MUTATION_TYPE: ${{ matrix.mutation-type }}
+      DEFRA_VECTOR_EMBEDDING: true
 
     steps:
       - name: Checkout code into the directory
@@ -80,6 +83,15 @@ jobs:
       - name: Setup defradb
         uses: ./.github/composites/setup-defradb
 
+      - name: Install Ollama
+        run: make deps:ollama
+
+      - name: Run Ollama
+        run: make ollama
+
+      - name: Pull LLM model
+        run: make ollama:nomic
+
       - name: Test coverage & save coverage report in an artifact
         uses: ./.github/composites/test-coverage-with-artifact
         with:

diff --git a/Makefile b/Makefile
@@ -170,6 +170,16 @@ deps\:mocks:
 deps\:playground:
 	cd $(PLAYGROUND_DIRECTORY) && npm install --legacy-peer-deps && npm run build
 
+.PHONY: deps\:ollama
+deps\:ollama:
+ifeq ($(OS_GENERAL),Linux)
+	curl -fsSL https://ollama.com/install.sh | sh
+else ifeq ($(OS_GENERAL),Darwin)
+	brew install ollama
+else
+	@echo "Makefile installation of Ollama is not supported for your system. Please install manually."
+endif
+
 .PHONY: deps
 deps:
 	@$(MAKE) deps:modules && \
@@ -185,6 +195,17 @@ mocks:
 	@$(MAKE) deps:mocks
 	mockery --config="tools/configs/mockery.yaml"
 
+.PHONY: ollama
+ollama:
+# run ollama in the background
+	nohup ollama serve > ollama.log 2>&1 &
+
+.PHONY: ollama\:nomic
+ollama\:nomic:
+# make sure ollama is running before continuing
+	time curl --retry 5 --retry-connrefused --retry-delay 0 -sf http://localhost:11434
+	ollama pull nomic-embed-text
+
 .PHONY: dev\:start
 dev\:start:
 	@$(MAKE) build

diff --git a/client/collection_description.go b/client/collection_description.go
@@ -103,6 +103,16 @@ type CollectionDescription struct {
 	// Currently this property is immutable and can only be set on collection creation, however
 	// that will change in the future.
 	IsBranchable bool
+
+	// VectorEmbeddings contains the configuration for generating embedding vectors.
+	//
+	// This is only usable with array fields.
+	//
+	// When configured, embeddings may call 3rd party APIs inline with document mutations.
+	// This may cause increase latency in the completion of the mutation requests.
+	// This is necessary to ensure that the generated docID is representative of the
+	// content of the document.
+	VectorEmbeddings []VectorEmbeddingDescription
 }
 
 // QuerySource represents a collection data source from a query.
@@ -199,15 +209,16 @@ func sourcesOfType[ResultType any](col CollectionDescription) []ResultType {
 // of json to a [CollectionDescription].
 type collectionDescription struct {
 	// These properties are unmarshalled using the default json unmarshaller
-	Name            immutable.Option[string]
-	ID              uint32
-	RootID          uint32
-	SchemaVersionID string
-	IsMaterialized  bool
-	IsBranchable    bool
-	Policy          immutable.Option[PolicyDescription]
-	Indexes         []IndexDescription
-	Fields          []CollectionFieldDescription
+	Name             immutable.Option[string]
+	ID               uint32
+	RootID           uint32
+	SchemaVersionID  string
+	IsMaterialized   bool
+	IsBranchable     bool
+	Policy           immutable.Option[PolicyDescription]
+	Indexes          []IndexDescription
+	Fields           []CollectionFieldDescription
+	VectorEmbeddings []VectorEmbeddingDescription
 
 	// Properties below this line are unmarshalled using custom logic in [UnmarshalJSON]
 	Sources []map[string]json.RawMessage
@@ -230,6 +241,7 @@ func (c *CollectionDescription) UnmarshalJSON(bytes []byte) error {
 	c.Fields = descMap.Fields
 	c.Sources = make([]any, len(descMap.Sources))
 	c.Policy = descMap.Policy
+	c.VectorEmbeddings = descMap.VectorEmbeddings
 
 	for i, source := range descMap.Sources {
 		sourceJson, err := json.Marshal(source)
@@ -268,3 +280,56 @@ func (c *CollectionDescription) UnmarshalJSON(bytes []byte) error {
 
 	return nil
 }
+
+// VectorEmbeddingDescription hold the relevant information to generate embeddings.
+//
+// Embeddings are AI/ML specific vector representations of some content.
+// In the case of DefraDB, that content is one or multiple fields, optionally added to a template.
+type VectorEmbeddingDescription struct {
+	// FieldName is the name of the field on the collection that this embedding description applies to.
+	FieldName string
+	// Fields are the fields in the parent schema that will be used as the basis of the
+	// vector generation.
+	Fields []string
+	// Model is the LLM of the provider to use for generating the embeddings.
+	// For example: text-embedding-3-small
+	Model string
+	// Provider is the API provider to use for generating the embeddings.
+	// For example: openai
+	Provider string
+	// (Optional) Template is the local path of the template to use with the
+	// field values to form the content to send to the model.
+	//
+	// For example, with the following schema,
+	// ```
+	// type User {
+	//   name: String
+	//   age: Int
+	//   name_about_v: [Float32!] @embedding(fields: ["name", "age"], ...)
+	// }
+	// ````
+	// we can define the following Go template.
+	// ```
+	// {{ .name }} is {{ .age }} years old.
+	// ```
+	Template string
+	// URL is the url enpoint of the provider's API.
+	// For example: https://api.openai.com/v1
+	//
+	// Not providing a URL will result in the use of the default
+	// known URL for the given provider.
+	URL string
+}
+
+// IsSupportedVectorEmbeddingSourceKind return true if the fields used for embedding generation
+// are of supported type.
+//
+// Currently, the supported types are Float32, Float64, Int and String
+func IsSupportedVectorEmbeddingSourceKind(fieldKind FieldKind) bool {
+	switch fieldKind {
+	case FieldKind_NILLABLE_FLOAT32, FieldKind_NILLABLE_FLOAT64, FieldKind_NILLABLE_INT, FieldKind_NILLABLE_STRING:
+		return true
+	default:
+		return false
+	}
+}
diff --git a/client/collection_field_description.go b/client/collection_field_description.go
@@ -43,6 +43,12 @@ type CollectionFieldDescription struct {
 	//
 	// This value has no effect on views.
 	DefaultValue any
+
+	// Size is a constraint that can be applied to fields that are arrays.
+	//
+	// Mutations on fields with a size constraint will fail if the size of the array
+	// does not match the constraint.
+	Size int
 }
 
 func (f FieldID) String() string {
@@ -56,6 +62,7 @@ type collectionFieldDescription struct {
 	ID           FieldID
 	RelationName immutable.Option[string]
 	DefaultValue any
+	Size         int
 
 	// Properties below this line are unmarshalled using custom logic in [UnmarshalJSON]
 	Kind json.RawMessage
@@ -72,6 +79,7 @@ func (f *CollectionFieldDescription) UnmarshalJSON(bytes []byte) error {
 	f.ID = descMap.ID
 	f.DefaultValue = descMap.DefaultValue
 	f.RelationName = descMap.RelationName
+	f.Size = descMap.Size
 	kind, err := parseFieldKind(descMap.Kind)
 	if err != nil {
 		return err

diff --git a/client/ctype.go b/client/ctype.go
@@ -40,7 +40,7 @@ func (t CType) IsSupportedFieldCType() bool {
 func (t CType) IsCompatibleWith(kind FieldKind) bool {
 	switch t {
 	case PN_COUNTER, P_COUNTER:
-		if kind == FieldKind_NILLABLE_INT || kind == FieldKind_NILLABLE_FLOAT {
+		if kind == FieldKind_NILLABLE_INT || kind == FieldKind_NILLABLE_FLOAT64 || kind == FieldKind_NILLABLE_FLOAT32 {
 			return true
 		}
 		return false

diff --git a/client/definitions.go b/client/definitions.go
@@ -148,6 +148,12 @@ type FieldDefinition struct {
 
 	// DefaultValue contains the default value for this field.
 	DefaultValue any
+
+	// Size is a constraint that can be applied to fields that are arrays.
+	//
+	// Mutations on fields with a size constraint will fail if the size of the array
+	// does not match the constraint.
+	Size int
 }
 
 // NewFieldDefinition returns a new [FieldDefinition], combining the given local and global elements
@@ -168,6 +174,7 @@ func NewFieldDefinition(local CollectionFieldDescription, global SchemaFieldDesc
 		Typ:               global.Typ,
 		IsPrimaryRelation: kind.IsObject() && !kind.IsArray(),
 		DefaultValue:      local.DefaultValue,
+		Size:              local.Size,
 	}
 }
 
@@ -179,6 +186,7 @@ func NewLocalFieldDefinition(local CollectionFieldDescription) FieldDefinition {
 		Kind:         local.Kind.Value(),
 		RelationName: local.RelationName.Value(),
 		DefaultValue: local.DefaultValue,
+		Size:         local.Size,
 	}
 }