Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tag-bench in agent_eval #230

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions evals/evaluation/agent_eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Benchmarks for agentic applications
We collected two benchmarks for evaluating agentic applications:
1. [CRAG](./crag_eval/README.md) (Comprehensive RAG) benchmark for RAG agents
2. [TAG-Bench](./TAG-Bench/README.md) for SQL agents

These agent benchmarks are enabled on Intel Gaudi systems using vllm as the LLM serving framework. You can choose to serve the models on other hardware with vllm too.

We will add more benchmarks for agents in the future. Stay tuned.
131 changes: 131 additions & 0 deletions evals/evaluation/agent_eval/TAG-Bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# TAG-Bench for evaluating SQL agents
## Overview of TAG-Bench
[TAG-Bench](https://github.com/TAG-Research/TAG-Bench) is a benchmark published in 2024 by Stanford University and University of California Berkeley and advocated by Databricks to evaluate GenAI systems in answering challenging questions over SQL databases. The questions in TAG-Bench require the GenAI systems to not only able to translate natural language queries into SQL queries, but to combine information from other sources and do reasoning. There are 80 questions in total with 20 in each sub-category of match-based, comparison, ranking, and aggregation queries. The questions are about 5 databases selected from Alibaba's [BIRD](https://bird-bench.github.io/) Text2SQL benchmark: california_schools, debit_card_specializing, formula_1, codebase_community, and european_football_2. For more information, please read the [paper](https://arxiv.org/pdf/2408.14717).

## Getting started
1. Set up the environment
```bash
export WORKDIR=<your-work-directory>
mkdir $WORKDIR/hf_cache
export HF_CACHE_DIR=$WORKDIR/hf_cache
export HF_HOME=$HF_CACHE_DIR
export HF_TOKEN=<your-huggingface-api-token>
export HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN
export PYTHONPATH=$PYTHONPATH:$WORKDIR/GenAIEval/
```
2. Download this repo in your work directory
```bash
cd $WORKDIR
git clone https://github.com/opea-project/GenAIEval.git
```
3. Create a conda environment
```bash
conda create -n agent-eval-env python=3.10
conda activate agent-eval-env
pip install -r $WORKDIR/GenAIEval/evals/evaluation/agent_eval/docker/requirements.txt
```
4. Download data
```bash
cd $WORKDIR
git clone https://github.com/TAG-Research/TAG-Bench.git
cd TAG-Bench/setup
chmod +x get_dbs.sh
./get_dbs.sh
```
5. Preprocess data
```bash
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/preprocess_data/
bash run_data_split.sh
```
6. Generate hints file for each database in TAG-Bench
```bash
python3 generate_hints.py
```
The hints are generated from the description files that come with the TAG-Bench dataset. The hints are simply the column descriptions provided in the dataset. They can be used by the SQL agent to improve performance.

7. Launch LLM endpoint on Gaudi.

This LLM will be used by agent as well as used as LLM-judge in scoring agent's answers. By default, `meta-llama/Meta-Llama-3.1-70B-Instruct` model will be served using 4 Gaudi cards.
```bash
# First build vllm image for Gaudi
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/vllm-gaudi
bash build_image.sh
```
Then launch vllm endpoint with the command below.
```bash
bash launch_vllm_gaudi.sh
```

8. Validate vllm endpoint is working properly.
```bash
python3 test_vllm_endpoint.py
```

## Launch your SQL agent
You can create and launch your own SQL agent. Here we show an example of OPEA `sql_agent_llama`. Follow the steps below to launch OPEA `sql_agent_llama`.
1. Download OPEA GenAIComps repo
```bash
cd $WORKDIR
git clone https://github.com/opea-project/GenAIComps.git
```
2. Build docker image for OPEA agent
```bash
cd $WORKDIR/GenAIComps
export agent_image="opea/agent:comps"
docker build --no-cache -t $agent_image --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -f comps/agent/src/Dockerfile .
```
3. Set up environment for the `search_web` tool for agent.
```bash
export GOOGLE_CSE_ID=<your-GOOGLE_CSE_ID>
export GOOGLE_API_KEY=<your-GOOGLE_API_KEY>
```
For instructions on how to obtain your `GOOGLE_CSE_ID` and `your-GOOGLE_API_KEY`, refer to instructions [here](https://python.langchain.com/docs/integrations/tools/google_search/).

5. Launch SQL agent
```bash
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/opea_sql_agent_llama
bash launch_sql_agent.sh california_schools
```
The command above will launch a SQL agent that interacts with the `california_schools` database. We also have a script to run benchmarks on all databases.

## Run the benchmark
1. Generate answers
```bash
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/run_benchmark
bash run_generate_answer.sh california_schools
```
2. Grade the answers
```bash
bash run_grading.sh california_schools
```
Here we use ragas `answer_correctness` metric to measure the performance. By default, we use `meta-llama/Meta-Llama-3.1-70B-Instruct` as the LLM judge. We use the vllm endpoint launched in the previous [section](#launch-your-sql-agent).

3. Run the benchmark on all databases

If you want to run all the 80 questions spanning all the 5 different databases, run the command below.
```bash
bash run_all_databases.sh
```
This script will iteratively generate answers and grade answers for questions regarding each database.

## Benchmark results
We tested OPEA `sql_agent_llama` on all 80 questions in TAG-Bench.

Human grading criteria:
- Score 1: exact match with golden answer
- Score 0.7: match with golden answer except for the ordering of the entities
- Score 0.5: missing info, and does not contain info not present in the golden answer
- Score 0: otherwise

|Database|Average human score|Average ragas `answer_correctness`|
|--------|-------------------|----------------------------------|
|california_schools|0.264|0.415|
|codebase_community|0.262|0.404|
|debit_card_specializing|0.75|0.753|
|formula_1|0.389|0.596|
|european_football_2|0.25|0.666|
|**Overall Average (ours)**|0.31 (0.28 if strict exact match)|0.511|
|**Text2SQL (TAG-Bench paper)**|0.17||
|**Human performance (TAG-Bench paper)**|0.55||

We can see that our SQL agent achieved much higher accuracy than Text2SQL, although still lower than human experts.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

#set -xe
echo $WORKDIR
EVALDIR=${WORKDIR}/GenAIEval/evals/evaluation/agent_eval/TAG-Bench

# agent vars
export agent_image="opea/agent:comps"
export recursion_limit=15

# LLM endpoint
export ip_address=$(hostname -I | awk '{print $1}')
vllm_port=8085
export HUGGINGFACEHUB_API_TOKEN=${HF_TOKEN}
export LLM_MODEL_ID="meta-llama/Meta-Llama-3.1-70B-Instruct"
export LLM_ENDPOINT_URL="http://${ip_address}:${vllm_port}"
echo "LLM_ENDPOINT_URL=${LLM_ENDPOINT_URL}"
export temperature=0.01
export max_new_tokens=4096

# Tools
export TOOLSET_PATH=${EVALDIR}/opea_sql_agent_llama/tools/
ls ${TOOLSET_PATH}
# for using Google search API
export GOOGLE_CSE_ID=${GOOGLE_CSE_ID}
export GOOGLE_API_KEY=${GOOGLE_API_KEY}

function start_sql_agent_llama_service(){
export db_name=$1
export db_path="sqlite:////home/user/TAG-Bench/dev_folder/dev_databases/${db_name}/${db_name}.sqlite"
docker compose -f ${EVALDIR}/opea_sql_agent_llama/sql_agent_llama.yaml up -d
# sleep 1m
}

db_name=$1
start_sql_agent_llama_service $db_name
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

services:
agent:
image: ${agent_image}
container_name: sql-agent-endpoint
volumes:
- ${TOOLSET_PATH}:/home/user/tools/ # tools
# - ${WORKDIR}/GenAIComps/comps:/home/user/comps # code
- ${WORKDIR}/TAG-Bench/:/home/user/TAG-Bench # SQL database and hints_file
ports:
- "9095:9095"
ipc: host
environment:
ip_address: ${ip_address}
strategy: sql_agent_llama
with_memory: false
db_name: ${db_name}
db_path: ${db_path}
use_hints: true
hints_file: /home/user/TAG-Bench/${db_name}_hints.csv
recursion_limit: ${recursion_limit}
llm_engine: vllm
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
llm_endpoint_url: ${LLM_ENDPOINT_URL}
model: ${LLM_MODEL_ID}
temperature: ${temperature}
max_new_tokens: ${max_new_tokens}
stream: false
tools: /home/user/tools/sql_agent_tools.yaml
require_human_feedback: false
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
port: 9095
GOOGLE_CSE_ID: ${GOOGLE_CSE_ID} #delete
GOOGLE_API_KEY: ${GOOGLE_API_KEY} # delete
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


def search_web(query: str) -> str:
"""Search the web for information not contained in databases."""
from langchain_core.tools import Tool
from langchain_google_community import GoogleSearchAPIWrapper

search = GoogleSearchAPIWrapper()

tool = Tool(
name="google_search",
description="Search Google for recent results.",
func=search.run,
)

response = tool.run(query)
return response
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

search_web:
description: Search the web for a given query.
callable_api: sql_agent_tools.py:search_web
args_schema:
query:
type: str
description: query
return_output: retrieved_data
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import glob
import os

import pandas as pd


def generate_column_descriptions(db_name):
descriptions = []
working_dir = os.getenv("WORKDIR")
assert working_dir is not None, "WORKDIR environment variable is not set."
DESCRIPTION_FOLDER = os.path.join(
working_dir, f"TAG-Bench/dev_folder/dev_databases/{db_name}/database_description/"
)
table_files = glob.glob(os.path.join(DESCRIPTION_FOLDER, "*.csv"))
table_name_col = []
col_name_col = []
for table_file in table_files:

table_name = os.path.basename(table_file).split(".")[0]
print("Table name: ", table_name)
df = pd.read_csv(table_file, encoding_errors="ignore")
for _, row in df.iterrows():
col_name = row["original_column_name"]
if not pd.isnull(row["value_description"]):
description = str(row["value_description"])
if description.lower() in col_name.lower():
print("Description {} is same as column name {}".format(description, col_name))
pass
else:
description = description.replace("\n", " ")
description = " ".join(description.split())
descriptions.append(description)
table_name_col.append(table_name)
col_name_col.append(col_name)
# except Exception as e:
# print("Error in processing: ", table_file)
# print("Error is ", e)
hints_df = pd.DataFrame({"table_name": table_name_col, "column_name": col_name_col, "description": descriptions})
tag_bench_dir = os.path.join(working_dir, "TAG-Bench")
output_file = os.path.join(tag_bench_dir, f"{db_name}_hints.csv")
hints_df.to_csv(output_file, index=False)
print(f"Generated hints file: {output_file}")


if __name__ == "__main__":
tag_bench_dir = os.path.join(os.getenv("WORKDIR"), "TAG-Bench/dev_folder/dev_databases/")
subfolders = [f.name for f in os.scandir(tag_bench_dir) if f.is_dir()]
print("Databases: ", subfolders)
for db_name in subfolders:
print("Generating hints for database: ", db_name)
generate_column_descriptions(db_name)
print("=" * 30)
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

DATAPATH=$WORKDIR/TAG-Bench/tag_queries.csv
OUTFOLDER=$WORKDIR/TAG-Bench/query_by_db
python3 split_data.py --path $DATAPATH --output $OUTFOLDER
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import os

import pandas as pd

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--path", type=str, required=True)
parser.add_argument("--output", type=str, required=True)
args = parser.parse_args()

# if output folder does not exist, create it
if not os.path.exists(args.output):
os.makedirs(args.output)

# Load the data
data = pd.read_csv(args.path)

# Split the data by domain
domains = data["DB used"].unique()
for domain in domains:
domain_data = data[data["DB used"] == domain]
out = os.path.join(args.output, f"query_{domain}.csv")
domain_data.to_csv(out, index=False)
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import glob
import os

import pandas as pd

parser = argparse.ArgumentParser()
parser.add_argument("--filedir", type=str, required=True, help="Directory containing the csv files")
args = parser.parse_args()

filedir = args.filedir
csv_files = glob.glob(os.path.join(filedir, "*_graded.csv"))
print("Number of score files found: ", len(csv_files))
print(csv_files)

df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)
print(df.columns)
print("Average score of all questions: ", df["answer_correctness"].mean())

# average score per csv file
for f in csv_files:
df = pd.read_csv(f)
print(f, df["answer_correctness"].mean())
Loading