From a3939638ac569b24e4f91b55ce86f6ebf5824937 Mon Sep 17 00:00:00 2001 From: Haniehz1 Date: Thu, 20 Feb 2025 17:27:49 -0500 Subject: [PATCH 1/2] Create experiments.mdx --- website/docs/autoeval/experiments.mdx | 200 ++++++++++++++++++++++++++ 1 file changed, 200 insertions(+) create mode 100644 website/docs/autoeval/experiments.mdx diff --git a/website/docs/autoeval/experiments.mdx b/website/docs/autoeval/experiments.mdx new file mode 100644 index 00000000..04adcfb7 --- /dev/null +++ b/website/docs/autoeval/experiments.mdx @@ -0,0 +1,200 @@ +# Experimentation Guide + +Build and manage your **evaluation experiments** with **LastMile AI AutoEval**. Use the **AutoEval library** to create **Experiments**, a structured way to organize and track evaluation runs as you make iterative changes to your **AI application**. + +Experiments allow you to **systematically test** the impact of changes, such as: +- Updating the **LLM model** +- Modifying the **retrieval strategy** for a **RAG system** +- Adjusting **system prompts** for an agent +- And more + +### Usage Guide +This guide walks through the process of setting up and running experiments using AutoEval, including: +1. **Setting up the API key** and **creating a project** +2. **Preparing and uploading a dataset** +3. **Creating an Experiment** +4. **Evaluating the dataset** against default metrics, logging results, and iterating on changes +5. **Visualizing the results** in the Experiments Console + +--- + +## 1. Set Up AutoEval Client + +Before running experiments, ensure you have the latest version of AutoEval: + +```bash +pip install lastmile --upgrade +``` + +#### Authenticate with the LastMile AI API + +To interact with the **LastMile AI API**, set your API key as an environment variable. + +πŸ“Œ **Tip:** If you don't have an API key yet, visit the **LastMile AI dashboard**, navigate to the **API section** on the sidebar, and copy your key. + +```python +import os + +api_token = "YOUR_API_KEY_HERE" + +if not api_token: + print("Error: Please set your API key in the environment variable LASTMILE_API_KEY") +elif api_token == "YOUR_API_KEY_HERE": + print("Error: Please replace 'YOUR_API_KEY_HERE' with your actual API key") +else: + print("βœ“ API key successfully configured!") +``` + +### Set Up AutoEval Client +Once authenticated, initialize the **AutoEval client**: + +```python +# Setup Pandas to display without truncation (for display purposes) +import pandas as pd +pd.set_option('display.max_columns', None) +pd.set_option('display.max_rows', None) +pd.set_option('display.width', None) +pd.set_option('display.max_colwidth', None) + +from lastmile.lib.auto_eval import AutoEval + +client = AutoEval(api_token=api_token) # Optionally set project_id to scope to a specific project +``` + + + + +--- + +## 2. Create a Project or Select an Existing Project + +A **Project** is the container that organizes your **Experiments, Evaluation runs, and Datasets**. It typically corresponds to the **AI initiative or application** you’re building. + +Projects help keep evaluations structured, especially when managing multiple experiments across different AI models or applications. You can create new projects or use existing ones. + +To create a new project programmatically, use: + +```python +project = client.create_project( + name="AutoEval Experiments", + description="Project to test AutoEval Experiments" +) + +# Important - set the project_id in the client so all requests are scoped to this project +client.project_id = project.id +``` + +Once a project is created, you can list all available projects, including the default **"AutoEval"** project: + +```python +# List all projects in your account +projects = client.list_projects() +projects +``` + +If you already have a project and want to use it, retrieve it using the `project_id`: + +```python +default_project = client.get_project(project_id="z8kfriq6cga6j0fx38znw4y6") +default_project +``` + +βœ… **Next Step:** **Prepare and upload a dataset.** + +--- + +## 3. Prepare and Upload Your Dataset + +Now that the API key is configured, it's time to **prepare and upload a dataset** for evaluation. + +LastMile AI AutoEval expects a **CSV file** with the following columns: +- **`input`**: The user's query or input text +- **`output`**: The assistant's response to the user's query +- **`ground_truth`** *(optional)*: The correct or expected response for comparison + +Uploading this dataset allows you to evaluate how well the assistant's responses align with the **ground truth** using LastMile AI’s evaluation metrics. + +To upload your dataset, use the following code: + +```python +dataset_csv_path = "ADD_YOUR_DATASET_HERE" + +dataset_id = client.upload_dataset( + file_path=dataset_csv_path, + name="NAME_OF_YOUR_DATASET", + description="DESCRIPTION_OF_DATASET" +) + +print(f"Dataset created with ID: {dataset_id}") +``` + +βœ… **Next Step:** **Create an Experiment.** + +--- + +## 4. Create an Experiment + +To create an experiment, use the following code: + +```python +experiment = client.create_experiment( + name="EXPERIMENT_NAME", + description="EXPERIMENT_DESCRIPTION", + metadata={ + "model": "gpt-4o", + "temperature": 0.8, + "misc": { + "dataset_version": "0.1.1", + "app": "customer-support" + } + } +) +``` + +To retrieve an experiment by ID: + +```python +experiment = client.get_experiment(experiment_id=experiment.id) +``` + +βœ… **Next Step:** **Evaluate the dataset against default metrics.** + +--- + +## 5. Evaluate the Dataset Against Built-in Metrics + +```python +from lastmile.lib.auto_eval import BuiltinMetrics + +default_metrics = [ + BuiltinMetrics.FAITHFULNESS, + BuiltinMetrics.RELEVANCE, +] + +print("Evaluation job kicked off") + +evaluation_results = client.evaluate_dataset( + dataset_id=dataset_id, + metrics=default_metrics, + experiment_id=experiment.id, + metadata={"extras": "Base metric tests"} +) + +print("Evaluation Results:") +evaluation_results.head(10) +``` + +βœ… **Next Step:** **Visualize in the Experiments Console.** + +--- + +## 6. Visualize in the Experiments Console + +πŸ“Š **Explore your results in the AutoEval UI:** +- πŸ”¬ **Experiments Overview:** [View all experiments](https://lastmileai.dev/evaluations?view=experiments) +- πŸ“ˆ **Evaluation Runs:** [See all evaluation runs](https://lastmileai.dev/evaluations?view=all_runs) +- 🏒 **Project Dashboard:** [Manage projects and experiments](https://lastmileai.dev/dashboard) +- πŸ“‚ **Dataset Library:** [Browse and manage uploaded datasets](https://lastmileai.dev/datasets) + +πŸš€ **Start iterating on your AI application based on the evaluation insights!** + From a0ddca4f836b89f92783bf6aea6510dc233e52e8 Mon Sep 17 00:00:00 2001 From: Andrew Hoh <129882602+andrew-lastmile@users.noreply.github.com> Date: Thu, 20 Feb 2025 18:47:45 -0500 Subject: [PATCH 2/2] Updating metadata and small wordsmithing --- website/docs/autoeval/experiments.mdx | 106 +++++++++++--------------- 1 file changed, 43 insertions(+), 63 deletions(-) diff --git a/website/docs/autoeval/experiments.mdx b/website/docs/autoeval/experiments.mdx index 04adcfb7..3d7681dc 100644 --- a/website/docs/autoeval/experiments.mdx +++ b/website/docs/autoeval/experiments.mdx @@ -1,14 +1,29 @@ -# Experimentation Guide +--- +title: "Experimentation" +--- + +import Tabs from "@theme/Tabs"; +import TabItem from "@theme/TabItem"; +import constants from "@site/core/tabConstants"; + +# Experimentation + +Track and manage experiments on your LLMs and applications to compare performance, test changes, and validate improvements. -Build and manage your **evaluation experiments** with **LastMile AI AutoEval**. Use the **AutoEval library** to create **Experiments**, a structured way to organize and track evaluation runs as you make iterative changes to your **AI application**. +## What is an Experiment? -Experiments allow you to **systematically test** the impact of changes, such as: -- Updating the **LLM model** +An *Experiment* is a collection of *evaluation runs*. Each *evaluation run* consists of the *dataset* and *metrics* for that run. For example, let's say I have a dataset of customer support chatbot questions, answers, and context. For my *Customer Support Experiment*, I can run this dataset against AutoEval's *relevancy*, *toxicity*, and *faithfulness* metrics as an *evaluation run*. Next, I can update the model (let's say to Gemini) for the chatbot and generate new answers and context for the same set of questions. Then, I can save that as another *evaluation run* and compare the results to determine if the model change was a positive improvement. + +*Experiments* enable you to confidently make iterative changes to your LLM application in a structured and organzied way. + +## What types of changes can Experiments measure? +Anything that influences the LLM application's performance is measureable through an experiment, such as: +- Updates to the **LLM model**, such as a new training date - Modifying the **retrieval strategy** for a **RAG system** - Adjusting **system prompts** for an agent -- And more +- And more -### Usage Guide +## Usage Guide This guide walks through the process of setting up and running experiments using AutoEval, including: 1. **Setting up the API key** and **creating a project** 2. **Preparing and uploading a dataset** @@ -18,7 +33,7 @@ This guide walks through the process of setting up and running experiments using --- -## 1. Set Up AutoEval Client +### 1. Set Up AutoEval Client Before running experiments, ensure you have the latest version of AutoEval: @@ -39,8 +54,6 @@ api_token = "YOUR_API_KEY_HERE" if not api_token: print("Error: Please set your API key in the environment variable LASTMILE_API_KEY") -elif api_token == "YOUR_API_KEY_HERE": - print("Error: Please replace 'YOUR_API_KEY_HERE' with your actual API key") else: print("βœ“ API key successfully configured!") ``` @@ -49,26 +62,15 @@ else: Once authenticated, initialize the **AutoEval client**: ```python -# Setup Pandas to display without truncation (for display purposes) -import pandas as pd -pd.set_option('display.max_columns', None) -pd.set_option('display.max_rows', None) -pd.set_option('display.width', None) -pd.set_option('display.max_colwidth', None) - from lastmile.lib.auto_eval import AutoEval - client = AutoEval(api_token=api_token) # Optionally set project_id to scope to a specific project ``` - - - --- -## 2. Create a Project or Select an Existing Project +### 2. Create a Project or Select an Existing Project -A **Project** is the container that organizes your **Experiments, Evaluation runs, and Datasets**. It typically corresponds to the **AI initiative or application** you’re building. +A **Project** is the container that organizes your **Experiments, Evaluation runs, and Datasets**. It typically corresponds to the **AI initiative, application, or use case** you’re building. Projects help keep evaluations structured, especially when managing multiple experiments across different AI models or applications. You can create new projects or use existing ones. @@ -76,41 +78,25 @@ To create a new project programmatically, use: ```python project = client.create_project( - name="AutoEval Experiments", - description="Project to test AutoEval Experiments" + name="Example Customer Agent Project", + description="Example project to evaluate customer support agents" ) -# Important - set the project_id in the client so all requests are scoped to this project client.project_id = project.id ``` -Once a project is created, you can list all available projects, including the default **"AutoEval"** project: - -```python -# List all projects in your account -projects = client.list_projects() -projects -``` - -If you already have a project and want to use it, retrieve it using the `project_id`: - -```python -default_project = client.get_project(project_id="z8kfriq6cga6j0fx38znw4y6") -default_project -``` - -βœ… **Next Step:** **Prepare and upload a dataset.** - --- -## 3. Prepare and Upload Your Dataset +### 3. Prepare and Upload Your Dataset -Now that the API key is configured, it's time to **prepare and upload a dataset** for evaluation. +You can either use an existing dataset or upload a new dataset to run an evaluation within your new experiment. For an existing dataset, you can skip to Step 4. + +#### Uploading a new dataset LastMile AI AutoEval expects a **CSV file** with the following columns: - **`input`**: The user's query or input text - **`output`**: The assistant's response to the user's query -- **`ground_truth`** *(optional)*: The correct or expected response for comparison +- **`ground_truth`** *(optional)*: The correct or expected response for comparison. This can also be the context for metrics like *faithfulness* Uploading this dataset allows you to evaluate how well the assistant's responses align with the **ground truth** using LastMile AI’s evaluation metrics. @@ -128,14 +114,16 @@ dataset_id = client.upload_dataset( print(f"Dataset created with ID: {dataset_id}") ``` -βœ… **Next Step:** **Create an Experiment.** - --- -## 4. Create an Experiment +### 4. Create an Experiment To create an experiment, use the following code: +:::info +The experiment's metadata should be used to track important information (or parameters) in regards to the application that is being tested. Important metadata to track include the LLM being used, the LLM parameters, prompt version, dataset version, the application, etc. +::: + ```python experiment = client.create_experiment( name="EXPERIMENT_NAME", @@ -151,17 +139,13 @@ experiment = client.create_experiment( ) ``` -To retrieve an experiment by ID: - -```python -experiment = client.get_experiment(experiment_id=experiment.id) -``` - -βœ… **Next Step:** **Evaluate the dataset against default metrics.** - --- -## 5. Evaluate the Dataset Against Built-in Metrics +### 5. Evaluate the Dataset Against Built-in Metrics + +:::info +You can specify custom and fine-tuned metrics to be evaluated for that dataset. +::: ```python from lastmile.lib.auto_eval import BuiltinMetrics @@ -183,12 +167,9 @@ evaluation_results = client.evaluate_dataset( print("Evaluation Results:") evaluation_results.head(10) ``` - -βœ… **Next Step:** **Visualize in the Experiments Console.** - --- -## 6. Visualize in the Experiments Console +### 6. Visualize in the Experiments Console πŸ“Š **Explore your results in the AutoEval UI:** - πŸ”¬ **Experiments Overview:** [View all experiments](https://lastmileai.dev/evaluations?view=experiments) @@ -196,5 +177,4 @@ evaluation_results.head(10) - 🏒 **Project Dashboard:** [Manage projects and experiments](https://lastmileai.dev/dashboard) - πŸ“‚ **Dataset Library:** [Browse and manage uploaded datasets](https://lastmileai.dev/datasets) -πŸš€ **Start iterating on your AI application based on the evaluation insights!** - +**Check out the [full cookbook](https://github.com/lastmile-ai/lastmile-docs/blob/main/cookbook/AutoEval_Experiments.ipynb) for expanded details on the functionality**