Skip to content

Latest commit

 

History

History
580 lines (404 loc) · 22.7 KB

File metadata and controls

580 lines (404 loc) · 22.7 KB

Exercise 07 - Create vector embeddings using an embedding model

In this exercise you will get a quick introduction to vector embeddings and you will learn:

  • What vector embeddings are.
  • How to create them.
  • How to store them in the SAP HANA Cloud vector engine.

What is a vector embedding?

AI models operate on data in a purely mathematical manner. An AI model by itself needs unstructured data such as text, audio or images be represented in a numerical way for it to operate on that data. To bridge the gap between unstructured data and the required numerical format for the AI model, the data must be converted to a numeric form. Vector embeddings are a way to represent unstructured data in a multi-dimensional numeric way for the AI model to process.

To perform this transformation an embedding model can be used. These embedding models can be pre-trained on a great number of data sets or you can utilize a model by training it yourself. In any case the embedding model creates an output vector representing unstructured data with the goal of meaningful similarity. That means the vector embedding should represent the semantic meaning of the real-world data. This allows for the mathematical assumption that two vector embeddings, which are similar to each other, have a high probability of being similar to their real-world data representation. The same is true for data that is dissimilar.

Using an embedding model to create vector embeddings, a numerical representation of real-world unstructured data, can help provide contextual information to an AI model that, based on that numerical data, can provide a proper answer to a user's question.

This is what you will do in the following exercises. You will use unstructured data, transform it into many vector embeddings, storing them in the SAP HANA Cloud vector engine. These vector embeddings than get compared to a vector embedding containing the user's questions using a mathematical algorithm. This comparison should return a vector embedding most similar to the numerical representation of the user's query. The resulting vector embedding's real-world data can now be used to provide contextual information to a chat model to retrieve a fitting answer to the user's query.

You can use two different algorithms for creating a comparison between vector embeddings:

  • Euclidean distance measures the average linear distance between the different vectors.

  • Cosine distance measures the cosine of the angle between two vectors. The result is between -1 and 1 whereas 1 represents complete similarity between two vectors, -1 the complete opposite and 0 that the two vectors are unrelated.

These two algorithms are different in the way of how they calculate the actual distance between two vectors. The closer the distance, the higher the probability of similarity.

If you are interested in the mathematical representation of these two algorithms, read an interesting article on the topic in the Further Reading section.

An easy approach to which algorithm to use is to simply try them both out and see what algorithm provides a better result.

Define a simple OData endpoint to trigger the creation of vector embeddings

You will define two OData endpoints in the provided cds file. These two endpoints are provided to help you trigger the creation and deletion of vector embeddings during this workshop. You can definetely create vector embeddings outside of such a project and pre-populate your HANA database. This can be done via code or via the SAP generative AI Hub. For this workshop, I want you to do this on your own within this project so you get a good understanding on how this works.

👉 Open the job-posting-service.cds file.

You want to expose the entities DocumentChunks as well as the JobPostings via your OData service. To do so, you need a reference to the schema definition.

👉 Add the following line of code to create the reference to the schema definition:

using {sap.codejam as db} from '../db/schema';

👉 Create the service definition:

service JobPostingService { }

Within the service definition you want to expose the before-mentioned entities. The DocumentChunks entity has a field containing the actual embeddings. These embeddings are of type cds.Vector. The OData specification doesn't know the type cds.Vector and so can't expose it via the OData service. For some custom types defined by CDS it is possible to create a mapping to OData types, but not in this case. There is no corresponding OData type you could map cds.Vector to, that means you have to exclude this field from the OData service.

👉 Within the service definition, add an entity projection for DocumentChunks:

entity DocumentChunks as
      projection on db.DocumentChunks
      excluding {
          embedding
      };

👉 Right below the DocumentChunks entity, add a projection for the JobPostings entity:

entity JobPostings as projection on db.JobPostings;

Great! You got both entities projected in the OData service. Now, you implement two OData functions in your service. One for creation and one for the deletion of vector embeddings.

👉 Add two function definitions right below the JobPostings entity projection:

function createVectorEmbeddings()                   returns String;
function deleteVectorEmbeddings()                   returns String;

Implement the creation of vector embeddings

To achieve a more clean code structure, you will implement most of the business logic in separate files achieving separation of concerns. The separation will be treated lightly, so you will only create two separate files, one for handling all AI relevant tasks, and one for handling all database related tasks. The project provides you with the two files that are currently empty: AIHelper and DBUtils.

You will implement the function handlers, and you will call logic from within the AIHelper and the DBUtils. You will jump back and forth between these files to implement the needed business logic.

The SDK uses ES6 for module loading which means that you need to export the function implementations differently compared to what you are used to using CAP. For this Codejam, you can simply use the export default function() approach.

👉 Open the job-posting-service.js file.

👉 Add the following code block to the file:

export default function () {
  // implementation goes here ...
}

👉 In the code block add the following function handlers:

this.on('createVectorEmbeddings', async req => {
  // implementation goes here
});

this.on('deleteVectorEmbeddings', async req => {
  // implementation goes here
});

👉 In the createVectorEmbeddings() function, implement the following lines of code calling the creation of the vector embeddings from the AIHelper and store the result from this call in the database.

const embeddings = await AIHelper.createVectorEmbeddings();
const embeddingEntries = await DBUtils.createEmbeddingEntries(embeddings);
await DBUtils.insertVectorEmbeddings(embeddingEntries);
return 'Vector embeddings created and stored in database';

Your function should look like this now:

this.on('createVectorEmbeddings', async () => {
  await DBUtils.insertVectorEmbeddings(await AIHelper.createVectorEmbedding());
  return 'Vector embeddings created and stored in database';
});

This code won't execute as of now, because the corresponding functions are not defined nor implemented in the AIHelper and DBUtils. You will do this now.

👉 Open the ai-helper.js.

Within the file you need to import the embedding client from the @sap-ai-sdk/langchain package.

👉 Add the following lines of code to the top of the file:

import {
  AzureOpenAiEmbeddingClient,
} from '@sap-ai-sdk/langchain';

👉 Right below the import statement add the following constant containing the embedding model's name:

const embeddingModelName = 'text-embedding-ada-002';

You define the embedding model's name in a constant because you will use the name again at a later point. This gives you a single point of truth in case you want to change the chat model in the future.

👉 Do the same for the resource group:

const resourceGroup = '<your-resource-group>';

👉 To use CDS methods import CDS:

import cds from '@sap/cds';

👉 To have access to the Document Splits table, add the DocumentChunks constant:

const { DocumentChunks } = cds.entities;

To create vector embeddings, you need to read the contextual information file which in your case is a text document.

👉 Import a text loader from lanchain to read the needed document from file:

import TextLoader from 'langchain/document_loaders/fs/text';

👉 Import the path tool to make definition of the file path easier:

import path from 'path';

👉 Import a text splitter for splitting up the text document into meaningful chunks for the embedding model to process into vector embeddings:

import RecursiveCharacterTextSplitter from '@langchain/textsplitters';

You have all the APIs imported to read a text file, split it into meaningful chunks and send it to the embedding model. You will implement the createVectorEmbedding() function now.

👉 Add the following asynchronous function createVectorEmbedding():

async function createVectorEmbeddings() {
  try {
    // implementation goes here...
  } catch (error) {
    console.log(`Error while generating embeddings.
      Error: ${JSON.stringify(error.response)}`);
    throw error;
  }
}

👉 Within the try block implement the loading of the text document:

const loader = new TextLoader(path.resolve('db/data/demo_grounding.txt'));
const document = await loader.load();

The code is using Langchain's text loader to load the demo_grounding.txt file.

With the file at hand you will define the text splitter and use it to split up the text within the text document into the specified sizes. The size of a text chunk is defined in characters using the specified text splitter. To find a working chunk size is not easy, it takes experience as well as testing. In case the size is set to high or to low, the embeddings might not be as usable as you want them to be. You can soften the result and make them more appliable to also set a chunk overlap.

👉 Right below the document loading call, add the initialization of the RecursiveCharacterTextSplitter:

const splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 70,
      chunkOverlap: 0,
      addStartIndex: true
});

👉 Call the splitDocuments() call with the loaded document:

const splitDocuments = await splitter.splitDocuments(document);

The objects returned by the text splitter are JSON objects that hold values under the key pageContent. This value are the text chunks. You need to extract these values to pass them to the embedding client.

👉 Implement a simple array loop to extract the values:

const textSplits = [];
for (const chunk of splitDocuments) {
  textSplits.push(chunk.pageContent);
}

You need a way to feed the document splits to the embedding model. If you remember, you've deployed an embedding model to SAP AI Launchpad. You will use this model to create the vector embeddings. SAP has provided you with the SAP Cloud SDK for AI that allows you to utilize Langchain APIs that has been enhanced with SAP functionality like connectivity to SAP AI Launchpad. You will create an AzureOpenAiEmbeddingClientinstance which can automatically connect to SAP AI Launchpad using a service binding or the .env file.

👉 Create the embedding client by adding the following code:

const embeddingClient = new AzureOpenAiEmbeddingClient({
      modelName: embeddingModelName,
      maxRetries: 0,
      resourceGroup: resourceGroup
});

👉 Call the embedding client to embedd the document splits:

const embeddings = await embeddingClient.embedDocuments(documentSplits);

👉 Finally, return the embeddings, the document splits and the path. These values will be stored in the database in the DocumentSplits table. Add the following code:

return [embeddings, splitDocuments];

Your complete implementation should look like this now:

async function createVectorEmbeddings() {
  try {
    const loader = new TextLoader(path.resolve('db/data/demo_grounding.txt'));
    const document = await loader.load();

    const splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 70,
      chunkOverlap: 0,
      addStartIndex: true
    });

    const splitDocuments = await splitter.splitDocuments(document);

    const textSplits = [];
    for (const chunk of splitDocuments) {
      textSplits.push(chunk.pageContent);
    }

    const embeddingClient = new AzureOpenAiEmbeddingClient({
      modelName: embeddingModelName,
      maxRetries: 0,
      resourceGroup: resourceGroup
    });
    const embeddings = await embeddingClient.embedDocuments(textSplits);

    return [embeddings, splitDocuments];
  } catch (error) {
    console.log(`Error while generating embeddings.
      Error: ${JSON.stringify(error.response)}`);
    throw error;
  }
}

👉 Lastly, add the createVectorEmbeddings to the function export as last line to the ai-helper.js file:

export { createVectorEmbeddings};

Implement the creation of vector embedding entries

You are about to implement the storing of the generated vector embeddings to the SAP HANA Cloud vector engine.

👉 Open the db-utils.js file.

👉 Add the following import statement to the top of the file:

import cds from '@sap/cds';

👉 Add the following line of code to make SQL insert and delete querying available to you:

const { INSERT, DELETE } = cds.ql;

👉 Add the following line of code to make the database entities available to you:

const { JobPostings, DocumentChunks } = cds.entities;

To make your code easier to read, you will seperate the functionality of creating the embedding entries and the insertion to the database. You will utilize the DocumentChunks entity you have defined above to create the JSON object for database insertion.

👉 Define the createEmbeddingEntries() function:

export function createEmbeddingEntries([embeddings, splitDocuments]) {
  // implementation goes here
}

👉 Within the function, define an array holding the different embedding entries:

let embeddingEntries = [];

👉 Iterate over the embeddings and create a JSON object for each embedding in the list. Right below the previous line of code add the following:

for (const [index, embedding] of embeddings.entries()) {
    const embeddingEntry = {
      metadata: splitDocuments[index].metadata.source,
      text_chunk: splitDocuments[index].pageContent,
      embedding: array2VectorBuffer(embedding)
    };
    embeddingEntries.push(embeddingEntry);
  }

👉 As last line of the insertVectorEmbeddings return the embedding entry list:

return embeddingEntries;

As you might have noticed, you are calling a conversion function to convert the embeddings to a vector buffer for database insertion. This function needs to be implemented next.

👉 Below the insertVectorEmbeddings function implement the following:

// Helper method to convert embeddings to buffer for insertion
let array2VectorBuffer = data => {
  const sizeFloat = 4;
  const sizeDimensions = 4;
  const bufferSize = data.length * sizeFloat + sizeDimensions;

  const buffer = Buffer.allocUnsafe(bufferSize);
  // write size into buffer
  buffer.writeUInt32LE(data.length, 0);
  data.forEach((value, index) => {
    buffer.writeFloatLE(value, index * sizeFloat + sizeDimensions);
  });
  return buffer;
};

The complete insertVectorEmbeddings function should look like this:

export function createEmbeddingEntries([embeddings, splitDocuments]) {
  let embeddingEntries = [];
  for (const [index, embedding] of embeddings.entries()) {
    const embeddingEntry = {
      metadata: splitDocuments[index].metadata.source,
      text_chunk: splitDocuments[index].pageContent,
      embedding: array2VectorBuffer(embedding)
    };
    embeddingEntries.push(embeddingEntry);
  }
  return embeddingEntries;
}

Implement the insertion of the vector embedding entries

The insertion into the database is simple. You use the CAP CQL syntax to insert all entries to the DocumentChunks table.

👉 Create the function insertVectorEmbeddings first:

export async function insertVectorEmbeddings(embeddingEntries) {
  try {
    // implementation goes here
  } catch (error) {
    console.log(
      `Error while storing the vector embeddings to SAP HANA Cloud: ${error.toString()}`
    );
    throw error;
  }
}

👉 Create the code the database insertion within the try block:

await INSERT.into(DocumentChunks).entries(embeddingEntries);

return `Embeddings inserted successfully to table.`;

The insertVectorEmbeddings should look like this now:

export async function insertVectorEmbeddings(embeddingEntries) {
  try {
    await INSERT.into(DocumentChunks).entries(embeddingEntries);

    return `Embeddings inserted successfully to table.`;
  } catch (error) {
    console.log(
      `Error while storing the vector embeddings to SAP HANA Cloud: ${error.toString()}`
    );
    throw error;
  }
}

Implement the deletion for the vector embeddings

You can create and insert vector embeddings, last step is to make it possible to delete vector embeddings.

👉 Right below the insertVectorEmbeddings function handler add the following code:

export async function deleteDocumentChunks() {
  try {
    // implementation goes here
  } catch (error) {
    console.log(
      `Error while deleting Document Chunks: \n ${JSON.stringify(error.response)}`
    );
  }

👉 Within the try block add the following code:

await DELETE.from(DocumentChunks);
return 'Successfully deleted Document Chunks!';

The deleteDocumentChunks should look like this now:

export async function deleteDocumentChunks() {
  try {
    await DELETE.from(DocumentChunks);
    return 'Successfully deleted Document Chunks!';
  } catch (error) {
    console.log(
      `Error while deleting Document Chunks: \n ${JSON.stringify(error.response)}`
    );
  }
}

Create some vector embeddings

At this point you have achieved a lot! You have defined and implemented not only the OData CAP service but also the database insertion and deletion. One thing you haven't done yet is test the code.

You will do this now!

👉 Open a new terminal if not already open.

👉 Make sure that you are still connected to the Cloud Foundry instance by checking the connection details:

cf target

If the reply from the CLI tells you to log in again simply enter cf login.

cf login

👉 Build the project first by calling the cds build --production command.

cds build --production

Now, you will utilize the cds watch --profile hybrid command to run the project on localhost while establishing a real and live connection to the database. This is a way to speed up local development by working around the need of deployment to BTP.

👉 From the CLI run:

cds watch --profile hybrid

Look at the console output! You can tell that it is running your project on localhost but also using the cdsrc-private.json to look up the binding to your HDI container within the SAP HANA Cloud instance. This look up is being used to establish a connection to the database.

👉 Open the localhost in a browser or use the prompt in Business Application Studio to directly open landing page.

01_localhost.png

From there you could access the database tables but you will see they are currently empty.

You can use the URL to call your OData function handler for creating the vector embeddings.

👉 Call the following URL: http://localhost:4004/odata/v4/job-posting/createVectorEmbeddings()

👉 Take a look at the console output. You can tell that the connection to AI Core has been established.

02_ai_core_connection.png

👉 Take a look at your browser window. You can see the success message you have implemented.

03_success.png

Summary

Congratulations! The call went through and apparently the vector embeddings were stored in the database. Wouldn't it be nice to have a certain way for checking the entries in the table?!

Remember the exercise where you used the hana-cli to do exactly that. Try this step on your own to use the CLI tool to check the database entries. If you need a quick recap, go back to Exercise 06 and check on the instructions.

Questions for Discussion

  1. How are vector embeddings created in the SAP HANA Cloud system?
Answer Vector embeddings are created using an embedding model, which can either be pre-trained or trained by the user. The process typically involves the following steps:
  • Load the unstructured data (e.g., text).
  • Split the text into meaningful chunks.
  • Feed these chunks into the embedding model to generate numerical vectors (embeddings).
  • Store the embeddings in the SAP HANA Cloud vector engine for later use.
  1. What are the two algorithms used to compare vector embeddings, and how do they differ?
Answer The two algorithms used to compare vector embeddings are Euclidean Distance and Cosine Similarity.
  • Euclidean Distance measures the average linear distance between two vectors. The closer the vectors, the more similar they are.

  • Cosine Similarity calculates the cosine of the angle between two vectors, resulting in a value between -1 and 1. A cosine similarity of 1 indicates complete similarity, while -1 means complete dissimilarity, and 0 indicates no relationship. These algorithms differ in how they compute the "distance" or "similarity" between vectors, with cosine similarity being more sensitive to the direction of the vector rather than the magnitude.

Further reading


Next exercise